Please login to be able to save your searches and receive alerts for new content matching your search criteria.
To address the limitations of visible light cameras that cannot function effectively at night, infrared cameras have become the optimal supplement. However, current methods for visible–infrared cross-modality person re-identification focus solely on feature combination and fusion, neglecting the importance of feature alignment. To address this issue, we introduce a novel Hierarchical Feature Fusion (HFF) network, which comprehensively integrates features across various levels through sequential feature extraction. Specifically, we design a pixel-level contrastive loss function that makes pixels in the same region of cross-modality images more similar and distinguishes pixel features at different locations, thereby extracting similar low-frequency information in the shallow network. Furthermore, in the deep network, we extract high-frequency information of different modalities through the Bi-Transformer Layer and propose Node-level Coupling Attention and Modality-level Decoupling Attention. Coupling attention is used for high-frequency information coupling within the same modality while decoupling attention is used for high-frequency information decoupling between different modalities to obtain more texture and detail information. Through a series of experimental results, we validate the superiority of the proposed HFF network in cross-modality person re-identification. Our proposed method achieved 87.16% and 95.23% Rank-1 on the SYSU-MM01 and RegDB datasets, respectively, and extensive experiments have validated its effectiveness in feature alignment.
Visual question answering is a cross-modality task which needs to simultaneously understand multi-modality inputs and then reason to provide a correct answer. Nowadays, a number of creative works have been done in this field. But most of them are not real end-to-end systems, as the commonly used finetuning technique for the module of convolutional neural network usually makes the system stuck into a local optimum. This problem is concealed by using image features only in most of the current works limited by computing ability. The development of graphic processing unit offers an opportunity to solve this problem. In this work, the challenge is analysed and an effective solution is proposed. Experiments on two public datasets are conducted to demonstrate the effectiveness of the proposed method.