World Scientific
Skip main navigation

Cookies Notification

We use cookies on this site to enhance your user experience. By continuing to browse the site, you consent to the use of our cookies. Learn More
×

System Upgrade on Tue, May 28th, 2024 at 2am (EDT)

Existing users will be able to log into the site and access content. However, E-commerce and registration of new users may not be available for up to 12 hours.
For online purchase, please visit us again. Contact us at customercare@wspc.com for any enquiries.

Efficient Image Semantic Representation and Visual–Textual Semantic Fusion for Multimodal Relation Extraction and Multimodal-Named Entity Recognition

    https://doi.org/10.1142/S0218126625501142Cited by:0 (Source: Crossref)

    Recently, multimodal relation extraction (MRE) and multimodal-named entity recognition (MNER) have attracted widespread attention. However, prior research works have encountered challenges including inadequate semantic representation of images, cross-modal information fusion, and irrelevance between some images and text. To enhance semantic representation, we employ CLIP’s image encoder, vision transformer (VIT), to generate visual features representing different semantic intensities. Addressing cross-modal semantic gaps, we introduce an image caption generation model and BERT to sequentially generate image captions and their features, transforming both modalities into text. Dynamic gates and attention mechanisms are introduced to efficiently fuse visual features, image description text features, and text features, mitigating noise from image-text irrelevance. Eventually, we successfully constructed an efficient MRE and MNER model. The experimental outcomes demonstrate that the model proposed in this paper improves 2.2% to 0.18% on the MRE and MNER datasets. Our code is available at https://github.com/SiweiWei6/VIT-CMNet.

    This paper was recommended by Regional Editor Takuro Sato.