Abstract
Recently, multimodal relation extraction (MRE) and multimodal-named entity recognition (MNER) have attracted widespread attention. However, prior research works have encountered challenges including inadequate semantic representation of images, cross-modal information fusion, and irrelevance between some images and text. To enhance semantic representation, we employ CLIP’s image encoder, vision transformer (VIT), to generate visual features representing different semantic intensities. Addressing cross-modal semantic gaps, we introduce an image caption generation model and BERT to sequentially generate image captions and their features, transforming both modalities into text. Dynamic gates and attention mechanisms are introduced to efficiently fuse visual features, image description text features, and text features, mitigating noise from image-text irrelevance. Eventually, we successfully constructed an efficient MRE and MNER model. The experimental outcomes demonstrate that the model proposed in this paper improves 2.2% to 0.18% on the MRE and MNER datasets. Our code is available at https://github.com/SiweiWei6/VIT-CMNet.
This paper was recommended by Regional Editor Takuro Sato.