Biomedical Image AnalysisOpen Access

Hierarchical Feature Fusion for Cross-Modality Person Re-identification

Artificial Intelligence and Big Data College, ChongQing Polytechnic University of Electronic Technology, Chongqing 401331, P. R. China

E-mail Address: 200620013@cqcet.edu.cn

Corresponding author.

Search for more papers by this author

and

Monghao Lim

https://orcid.org/0009-0007-4255-6516

Department of Engineering, (IT Networking) North Metro TAFE, Perth, Western Australia 6000, Australia

E-mail Address: patricklmh@yahoo.com.au

Search for more papers by this author

https://doi.org/10.1142/S0218001424570179Cited by:0 (Source: Crossref)

Abstract

To address the limitations of visible light cameras that cannot function effectively at night, infrared cameras have become the optimal supplement. However, current methods for visible–infrared cross-modality person re-identification focus solely on feature combination and fusion, neglecting the importance of feature alignment. To address this issue, we introduce a novel Hierarchical Feature Fusion (HFF) network, which comprehensively integrates features across various levels through sequential feature extraction. Specifically, we design a pixel-level contrastive loss function that makes pixels in the same region of cross-modality images more similar and distinguishes pixel features at different locations, thereby extracting similar low-frequency information in the shallow network. Furthermore, in the deep network, we extract high-frequency information of different modalities through the Bi-Transformer Layer and propose Node-level Coupling Attention and Modality-level Decoupling Attention. Coupling attention is used for high-frequency information coupling within the same modality while decoupling attention is used for high-frequency information decoupling between different modalities to obtain more texture and detail information. Through a series of experimental results, we validate the superiority of the proposed HFF network in cross-modality person re-identification. Our proposed method achieved 87.16% and 95.23% Rank-1 on the SYSU-MM01 and RegDB datasets, respectively, and extensive experiments have validated its effectiveness in feature alignment.

Keywords:

1. Introduction

Visible–Infrared cross-modality person re-identification (VI-Re-ID) remains a critical challenge in intelligent video surveillance systems, serving an essential role in applications like 24/7 monitoring and advanced security solutions.^23,40 However, the accuracy of conventional VI-Re-ID approaches is severely hampered by the variations across various modalities and the complexity of the environment.^2,34 These issues include changes in lighting conditions and feature differences between modalities,¹⁷ which, under the influence of perspective changes and occlusions, further reduce the model’s performance.

As a retrieval task, feature alignment in person re-identification (Re-ID) can effectively improve retrieval accuracy. However, VI-Re-ID is significantly different from visible-to-visible re-identification (VV-Re-ID). Due to the differences between visible and infrared cameras, the most important color features in visible images are not available in infrared images, which severely impacts the alignment and matching of features between different modalities. To address this issue, while stacking Multi-Granularity (MG) features^39,44 can improve feature discrimination, it also further increases the difficulty of feature alignment. Feature fusion methods,^30,46 although reducing the difficulty of aligning features between different modalities, also narrow the differences among all features, reducing the model’s robustness to interference. Therefore, we believe that extracting modality-independent features is key to cross-modal retrieval. However, simply aligning all features is not feasible, as features at different levels represent distinct semantics. Thus, it is crucial to consider the connections between features at various levels. In both modalities, low-frequency information, such as background and large-scale environmental features, shares common elements, while high-frequency details are unique to each modality. For example, texture and detail features primarily represent visible light images, while thermal radiation features define infrared images. Therefore, our goal is to enhance the correlation among low-frequency features while processing high-frequency features through different pathways. This approach aims to improve the representational capacity of intra-modality features and reduce the correlation between high-frequency features across modalities,^11,12 thereby enhancing the extraction of cross-modal and shared features.

To address this, we introduce the Hierarchical Feature Fusion (HFF) network, a novel VI-Re-ID model that enhances efficiency by deeply integrating features across multiple levels. In the shallow stage of the network, we adopt a shared ResNet-34 backbone to extract shallow features of different modalities. Based on this, we integrate the suggested Pixel-level Contrastive Loss (PCL) function to refine features at the pixel level. Specifically, the objective of this loss function is to reduce the distance between pixel features within the same region while increasing the distance between pixel features in different regions, thereby achieving alignment of shallow semantic information. In this way, we can effectively align the low-frequency information of different modalities, such as background and large-scale environmental features, in the initial stages of the network, laying the foundation for the model to further extract deeper features. After completing the extraction of shallow features, we input the extracted feature maps into the deep Bi-Transformer Layer to extract higher-level information. The Bi-Transformer Layer can capture high-frequency features in the images, such as texture and detail information. To fully utilize the features of different modalities, we design two attention mechanisms in the Bi-Transformer Layer: Node-level Coupling Attention (NCA) and Modality-level Decoupling Attention (MDA). The NCA is used for coupling local node features within the same modality. Specifically, this mechanism enhances the feature representation ability of each node by focusing on the relationships between nodes within the same modality, promoting information exchange between nodes. The MDA is used for decoupling global node features between modalities. In general, the high-frequency information between different modalities represents distinct semantics, but this does not imply that they can be arbitrarily combined and fused. To address this, the two modules we proposed further enhance the feature representation capability within each modality and reduce the interference between features from different modalities.

Overall, our work contributes in three key aspects:

(1)	We propose a PCL function, which promotes the alignment and fusion of pixel-level low-frequency features between different modalities in the shallow network, guiding the model to extract modality-shared shallow features and providing support for the subsequent extraction of more robust deep features.
(2)	An NCA module is designed to calculate the coupling coefficients of nodes within the same modality, integrating the information of local nodes into the cls token to enhance the feature representation capability.
(3)	An MDA module is designed to calculate the decoupling coefficients of cls tokens between different modalities, reducing interference between modalities and thereby improving the distinguishability of cross-modal features.

2. Related Work

2.1. Visible–visible person re-identification

The primary goal of traditional models is to create unique handcrafted characteristics, like color, texture, and certain regular patterns.⁴⁷ Lately, VV-Re-ID models based on CNN have reached unprecedented levels of performance. To be more precise, a subset of these models focuses on representation learning, trying to identify person-related characteristics that allow for individual differentiation. Jiang et al.¹³ introduced a channel attention model with global awareness that operates at multiple scales. To effectively address viewpoint variation and occlusion issues in real-world scenarios, the model employs a multiscale structural design, combining techniques for global structural information collection and feature enhancement. It gathers global information from images through a global perception module and enhances distinguishing features in images using an adaptive feature fusion module. Wang et al.³³ believe that person Re-ID should not be limited to extracting features from unoccluded human body areas. They treat occluded pedestrian images as adversarial examples to train the model to resist such attacks, thereby enhancing its generalization ability. Further research focuses on metric learning, which seeks to learn an embedding representation in the embedding space that minimizes feature similarity across distinct identities and maximizes feature similarity between identities. For example, Qin et al.²⁵ introduced an enhanced triplet loss function that categorizes the penalty metric space into consistency space and autonomy space. The common penalty intensity in the consistency space makes the model simpler; in the autonomy space, sample distribution-based customization of the penalty leads to better embedding space learning and more effective person Re-ID. However, these methods do not perform well in cross-domain or domain adaptation tasks, primarily because domain shifts make the alignment of local features more challenging. To address this issue, Wang et al.³¹ proposed a method that leverages the differences between different body parts to aggregate local features of the same body part, achieving domain alignment at the body part level. In contrast, Li et al.¹⁶ transformed pedestrian attributes into latent attributes that do not contain domain information, serving as substitutes for pedestrian attributes to achieve alignment of the same attributes across multiple pedestrian images. Additionally, Li et al.¹⁵ ensured domain-invariant features at the camera level through adversarial learning between the feature extractor and camera classifier. By competing with classifiers integrated with identity and domain, they achieved joint alignment of identity and domain.

2.2. Visible–infrared person re-identification

Current VI-Re-ID models typically fall into two categories: those that emphasize modality-specific feature compensation and those that concentrate on learning shared features across modalities. The former tackles cross-modality differences by generating information for the missing modality using data from the available one. This synthesized information, along with the original data, is then employed to manage variations within the same modality. Du and Zhang⁵ introduced a feature confusion baseline designed to mitigate the influence of modality-specific features by blending visible light color and infrared spectrum. To enhance the model’s ability to compensate for missing information, they developed a similarity feature refinement module. This module models sample similarity as affinity, updates the original features accordingly, and utilizes intra-modality relationships to compensate for invariant information. Similar to this, Qi et al.²⁴ proposed a contrastive-learning-based image generation network that uses adversarial generation techniques to create cross-modality paired images. This approach enables the generation of missing modality information from one modality, addressing inconsistencies in cross-modality data. They also designed a dual-modality feature fusion module that operates on local features to extract and integrate features from both modalities into a unified representation.

Contrary to the methods previously discussed, models utilizing modality-shared feature learning strive to extract unique features common to multiple modalities. Semantic alignment feature learning and affinity inference modules are used in the framework that Fang et al.⁶ devised to learn modality-shared features. First, it achieves feature alignment across several modalities by aggregating possible semantic partial features through the use of pixel-level feature similarity with learnable prototypes. Then, it optimizes the relationship between pedestrian features through an affinity matrix to enhance the accuracy of feature matching, thus enabling the learning of modality-shared features. Huang et al.¹⁰ proposed a sub-network for feature extraction that utilizes a multi-level, two-stream modality-shared approach, which captures appearance features common to both modalities and relational features that remain invariant across modalities, operating within shared 2D and 3D feature spaces. To effectively learn these shared features and minimize variability both within and between modalities, the extracted features are fused and enhanced using a proposed cross-modality center alignment loss. Similar to this, Li et al.¹⁴ developed a correlation-guided semantic consistency network designed to detect and utilize relationships across different modalities. This network employs a cross-modality semantic alignment module to extract features that remain consistent regardless of the modality. Additionally, it incorporates a cross-granularity discrepancy awareness module to maintain unique attributes through intra-modality correlations, and a probability consistency constraint module that reduces modality-specific differences at the probabilistic level. Together, these components enhance the network’s ability to learn common features across diverse modalities.

Compared to the existing methods, our Hierarchical Feature Fusion Network (HFF), by introducing the PCL function and designing the NCA and MDA mechanisms, not only optimizes the fusion and alignment of low-frequency information but also effectively decouples high-frequency features between modalities. This overcomes the limitations of traditional methods in terms of generated information quality and consistency, resulting in improved robustness and recognition performance for cross-modality person Re-ID.

3. Proposed Method

In this section, we will provide an in-depth description of the proposed approach. We will start by outlining the structure of the model, as depicted in Fig. 1. Then, we will elaborate on the HFF network, including two key components: (1) the PCL function, which optimizes the pixel feature distances within and between regions during training, and (2) the NCA and MDA mechanisms, which enhance the model’s feature representation and robustness by coupling local features within modalities and decoupling global features between modalities.

Fig. 1. Framework diagram of HFF network.

We input RGB image samples and infrared image samples into the ResNet-34 backbone network to extract low-level shared information. The input can be represented as $input = [I_{IR}, I_{RGB}]$ , with the image dimensions being $[256, 128, 3]$ , representing height, width, and the number of channels, respectively. It is important to note that since infrared images have only one channel, we concatenate them along the channel dimension, expanding them to three channels.

Given the excellent performance of Vision Transformers in the image domain,³² we have chosen the Bi-Transformer Layer as the feature extractor for the deep network. However, instead of simply concatenating Transformer Layers for different modalities, we have introduced two unique attention mechanism modules between the Transformer Layers. These modules not only enhance high-frequency information within each modality but also ensure that high-frequency information between modalities remains independent, thereby promoting richer and deeper feature representations.

Specifically, the features $F_{IR}$ and $F_{RGB}$ output by the ResNet-34 network have dimensions of $[8, 4, 512]$ . We first treat each pixel as a patch, with each patch having dimensions of $[1, 1, 512]$ . Thus, we obtain a total of 32 patches. Through Linear Projection, the feature maps for each modality are converted into 32 tokens, with each token having dimensions of $[1, 1, 768]$ . It is important to note that pixels at the same position in different modalities share the same position embedding to maintain spatial consistency of the features. Next, all tokens are input into an $L$ -layer Bi-Transformer Layer. Within each modality, the tokens undergo NCA, which enhances the feature representation ability of each node and effectively integrates high-frequency information within the modality. Tokens between modalities are processed through MDA, which reduces mutual interference of high-frequency features between different modalities, promoting the model to learn richer features. Finally, the decoupled cls tokens are concatenated to serve as representative features for retrieval.

3.1. Pixel-level contrastive loss

To enhance the distinguishability of different identity samples in cross-modality person Re-ID tasks, we have specifically designed a novel loss function, named PCL function. The loss function guides the model to learn and align pixel-level modality-invariant features by reducing the distance between pixel features within the same region and increasing the distance between pixel features in different regions. First, we extract features of different modalities from ResNet-34, as shown in the following equation :

[F IR, F RGB] = R 34 (input IR, input RGB), <math display="block" altimg="eq-00009.gif"><mo stretchy="false">[</mo><msub><mrow><mi>F</mi></mrow><mrow><mstyle><mtext mathvariant="normal">IR</mtext></mstyle></mrow></msub><mo>,</mo><msub><mrow><mi>F</mi></mrow><mrow><mstyle><mtext mathvariant="normal">RGB</mtext></mstyle></mrow></msub><mo stretchy="false">]</mo><mo>=</mo><msub><mrow><mi>R</mi></mrow><mrow><mn>3</mn><mn>4</mn></mrow></msub><mo stretchy="false">(</mo><msub><mrow><mstyle><mtext mathvariant="normal">input</mtext></mstyle></mrow><mrow><mstyle><mtext mathvariant="normal">IR</mtext></mstyle></mrow></msub><mo>,</mo><msub><mrow><mstyle><mtext mathvariant="normal">input</mtext></mstyle></mrow><mrow><mstyle><mtext mathvariant="normal">RGB</mtext></mstyle></mrow></msub><mo stretchy="false">)</mo><mo>,</mo></math> (1)

where

$F_{IR}$ and

$F_{RGB}$ represent the extracted features of infrared and visible images, respectively,

$R_{34}$ denotes the ResNet-34 backbone network, and

${input}_{IR}$ and

${input}_{RGB}$ denote the input infrared and visible images, respectively. After feature extraction, we calculate the pixel-level feature distances between different modalities using dot products, and then exponentiate to amplify the differences in feature similarity, as shown in the following equation :

dist = exp (〈 q, p 〉), <math display="block" altimg="eq-00015.gif"><mstyle><mtext mathvariant="normal">dist</mtext></mstyle><mo>=</mo><mo>exp</mo><mo stretchy="false">(</mo><mo>〈</mo><mi>q</mi><mo>,</mo><mi>p</mi><mo>〉</mo><mo stretchy="false">)</mo><mo>,</mo></math> (2)

where q denotes the pixel points in

$F_{IR}$ and p denotes the pixel points in

$F_{RGB}$ . Our goal is to enhance the feature similarity in corresponding regions of different modality images while reducing the similarity between features in different regions. To achieve this, we use the loss function defined in Eq. (3) to optimize the features :

Losspixel=−1NN∑i=1logexp(〈qi,pi〉∕ε)exp(〈qi,pi〉∕ε)+∑Nj=1,j≠iexp(〈qi,pj〉),<math display="block" altimg="eq-00018.gif"><msub><mrow><mstyle><mtext mathvariant="normal">Loss</mtext></mstyle></mrow><mrow><mstyle><mtext mathvariant="normal">pixel</mtext></mstyle></mrow></msub><mo>=</mo><mo>−</mo><mfrac><mrow><mn>1</mn></mrow><mrow><mi>N</mi></mrow></mfrac><munderover accentunder="true" accent="true"><mrow><mo>∑</mo></mrow><mrow><mi>i</mi><mo>=</mo><mn>1</mn></mrow><mrow><mi>N</mi></mrow></munderover><mstyle><mtext mathvariant="normal">log</mtext></mstyle><mfrac><mrow><mo>exp</mo><mo stretchy="false">(</mo><mo>〈</mo><msub><mrow><mi>q</mi></mrow><mrow><mi>i</mi></mrow></msub><mo>,</mo><msub><mrow><mi>p</mi></mrow><mrow><mi>i</mi></mrow></msub><mo>〉</mo><mo stretchy="false">∕</mo><mi>ε</mi><mo stretchy="true">)</mo></mrow><mrow><mo>exp</mo><mo stretchy="true">(</mo><mo>〈</mo><msub><mrow><mi>q</mi></mrow><mrow><mi>i</mi></mrow></msub><mo>,</mo><msub><mrow><mi>p</mi></mrow><mrow><mi>i</mi></mrow></msub><mo>〉</mo><mo stretchy="true">∕</mo><mi>ε</mi><mo stretchy="true">)</mo><mo>+</mo><msubsup><mrow><mo>∑</mo></mrow><mrow><mi>j</mi><mo>=</mo><mn>1</mn><mo>,</mo><mi>j</mi><mo>≠</mo><mi>i</mi></mrow><mrow><mi>N</mi></mrow></msubsup><mo>exp</mo><mo stretchy="true">(</mo><mo>〈</mo><msub><mrow><mi>q</mi></mrow><mrow><mi>i</mi></mrow></msub><mo>,</mo><msub><mrow><mi>p</mi></mrow><mrow><mi>j</mi></mrow></msub><mo>〉</mo><mo stretchy="true">)</mo></mrow></mfrac><mo>,</mo></math>(3)

where

$ε$ represents the scaling factor, which adjusts the influence of the feature similarity in corresponding regions on the model, and N denotes the number of pixel points in the feature map. In each iteration, our network is optimized by leveraging the distances between features from two different modalities corresponding to the same identity. The introduced PCL function compels the network to emphasize features that are shared across modalities for samples of the same identity. This approach enhances both the clarity of features and the precision of their alignment.

3.2. Node-level coupling attention

The core idea of NCA is based on the attention mechanism. It calculates the coupling coefficients between local feature nodes within the same modality and aggregates the information from these local feature nodes into the global feature node. This process focuses on the relationships among nodes within the modality, thereby facilitating effective information exchange. The process of NCA is illustrated in Fig. 2.

Fig. 2. Flowchart of node-level coupling attention and modality-level decoupling attention.

We use the ResNet-34 network to extract features $F_{IR}$ and $F_{RGB}$ from images of different modalities. These features are processed through Linear Projection and combined with the shared position embeddings at corresponding locations. They are then input into the Bi-Transformer Layer, generating tokens $t_{IR}$ and $t_{RGB}$ corresponding to the infrared and visible modalities, respectively.

For infrared images, we first reduce the dimension of the tokens $t_{IR}$ using a weight-shared mapping $ϕ \in R^{768 \times 256}$ , and similarly, reduce the dimension of $t_{IR}^{cls}$ using $φ \in R^{768 \times 256}$ . Next, we define a coupling factor $θ_{ϕ}$ , which calculates the coupling relationships between different tokens using Eq. (4) to quantify the interaction between the target token $t_{IR}^{i}$ and other tokens :

θϕi=1|𝒱ϕ|∑j∈𝒱qTϕtanh(Wϕ[tiIR∥tjIR]+bϕ),<math display="block" altimg="eq-00030.gif"><msub><mrow><mi>θ</mi></mrow><mrow><mi>ϕ</mi><mi>i</mi></mrow></msub><mo>=</mo><mfrac><mrow><mn>1</mn></mrow><mrow><mi>|</mi><msub><mrow><mi mathvariant="script">𝒱</mi></mrow><mrow><mi>ϕ</mi></mrow></msub><mi>|</mi></mrow></mfrac><munder><mrow><mo>∑</mo></mrow><mrow><mi>j</mi><mo>∈</mo><mi mathvariant="script">𝒱</mi></mrow></munder><msubsup><mrow><mi>q</mi></mrow><mrow><mi>ϕ</mi></mrow><mrow><mi>T</mi></mrow></msubsup><mo>tanh</mo><mo stretchy="false">(</mo><msub><mrow><mi>W</mi></mrow><mrow><mi>ϕ</mi></mrow></msub><mo stretchy="false">[</mo><msubsup><mrow><mi>t</mi></mrow><mrow><mstyle><mtext mathvariant="normal">IR</mtext></mstyle></mrow><mrow><mi>i</mi></mrow></msubsup><mo>∥</mo><msubsup><mrow><mi>t</mi></mrow><mrow><mstyle><mtext mathvariant="normal">IR</mtext></mstyle></mrow><mrow><mi>j</mi></mrow></msubsup><mo stretchy="false">]</mo><mo>+</mo><msub><mrow><mi>b</mi></mrow><mrow><mi>ϕ</mi></mrow></msub><mo stretchy="false">)</mo><mo>,</mo></math>(4)

where

$𝒱_{ϕ}$ represents the set of tokens

$t_{IR}$ ,

$W_{ϕ} \in R^{256 \times 512}$ ,

$q_{ϕ}^{T} \in R^{256 \times 1}$ , and

$b_{ϕ} \in R^{256 \times 1}$ are the trainable parameter matrices and bias, respectively. By calculating the coupling factor

$θ_{ϕ}$ between the target token

$t_{IR}^{i}$ and other tokens, we then compute the coupling coefficient

$α_{i}$ using the following equation :

αi=exp(θϕi)∑j∈𝒱exp(θϕj).<math display="block" altimg="eq-00039.gif"><msub><mrow><mi>α</mi></mrow><mrow><mi>i</mi></mrow></msub><mo>=</mo><mfrac><mrow><mo>exp</mo><mo stretchy="false">(</mo><msub><mrow><mi>θ</mi></mrow><mrow><mi>ϕ</mi><mi>i</mi></mrow></msub><mo stretchy="false">)</mo></mrow><mrow><munder><mrow><mo>∑</mo></mrow><mrow><mi>j</mi><mo>∈</mo><mi mathvariant="script">𝒱</mi></mrow></munder><mo>exp</mo><mo stretchy="false">(</mo><msub><mrow><mi>θ</mi></mrow><mrow><mi>ϕ</mi><mi>j</mi></mrow></msub><mo stretchy="false">)</mo></mrow></mfrac><mo>.</mo></math>(5)

This step ensures that the model can adjust the weights between different tokens based on the value of the coupling factor

$θ_{ϕ}$ . Finally, through Eq. (6), we multiply the coupling coefficients

$α_{i}$ with the corresponding tokens

$t_{IR}^{i}$ , aggregating all intra-modality information into the cls token

$t_{IR}^{cls}$ :

ˉ t cls IR = t cls IR + \sum i \in 𝒱 α i t i IR . <math display="block" altimg="eq-00044.gif"><msubsup><mrow><mover accent="true"><mrow><mi>t</mi></mrow><mo>̄</mo></mover></mrow><mrow><mstyle><mtext mathvariant="normal">IR</mtext></mstyle></mrow><mrow><mstyle><mtext mathvariant="normal">cls</mtext></mstyle></mrow></msubsup><mo>=</mo><msubsup><mrow><mi>t</mi></mrow><mrow><mstyle><mtext mathvariant="normal">IR</mtext></mstyle></mrow><mrow><mstyle><mtext mathvariant="normal">cls</mtext></mstyle></mrow></msubsup><mo>+</mo><munder><mrow><mo>\sum</mo></mrow><mrow><mi>i</mi><mo>\in</mo><mi mathvariant="script">𝒱</mi></mrow></munder><msub><mrow><mi>α</mi></mrow><mrow><mi>i</mi></mrow></msub><msubsup><mrow><mi>t</mi></mrow><mrow><mstyle><mtext mathvariant="normal">IR</mtext></mstyle></mrow><mrow><mi>i</mi></mrow></msubsup><mo>.</mo></math> (6)

For visible light images, we use a similar method to aggregate all intra-modality information into the cls token

${\bar{t}}_{RGB}^{cls}$ .

3.3. Modality-level decoupling attention

While the representation capability of intra-modality features is important, reducing feature interference between modalities is equally crucial. MDA is also based on the attention mechanism, but it operates across different modalities by calculating decoupling features and decoupling coefficients between global feature nodes. This helps reduce interference from high-frequency features between modalities, allowing each modality to independently learn richer feature representations, thereby enhancing the robustness and distinctiveness of cross-modal feature representations.

Specifically, there is often redundancy or conflict of high-frequency features between tokens ${\bar{t}}_{IR}^{cls}$ and ${\bar{t}}_{RGB}^{cls}$ representing different modalities. We define a decoupling factor $θ_{φ}$ and a decoupling feature $τ$ to quantify the semantic difference and degree of decoupling. The calculation process is shown in the following equations :

θ φ = [γ IR, γ RGB] = q T φ [ˉ t cls IR | ˉ t cls RGB], <math display="block" altimg="eq-00050.gif"><msub><mrow><mi>θ</mi></mrow><mrow><mi>φ</mi></mrow></msub><mo>=</mo><mo stretchy="false">[</mo><msub><mrow><mi>γ</mi></mrow><mrow><mstyle><mtext mathvariant="normal">IR</mtext></mstyle></mrow></msub><mo>,</mo><msub><mrow><mi>γ</mi></mrow><mrow><mstyle><mtext mathvariant="normal">RGB</mtext></mstyle></mrow></msub><mo stretchy="false">]</mo><mo>=</mo><msubsup><mrow><mi>q</mi></mrow><mrow><mi>φ</mi></mrow><mrow><mi>T</mi></mrow></msubsup><mo stretchy="false">[</mo><msubsup><mrow><mover accent="true"><mrow><mi>t</mi></mrow><mo>̄</mo></mover></mrow><mrow><mstyle><mtext mathvariant="normal">IR</mtext></mstyle></mrow><mrow><mstyle><mtext mathvariant="normal">cls</mtext></mstyle></mrow></msubsup><mi>|</mi><msubsup><mrow><mover accent="true"><mrow><mi>t</mi></mrow><mo>̄</mo></mover></mrow><mrow><mstyle><mtext mathvariant="normal">RGB</mtext></mstyle></mrow><mrow><mstyle><mtext mathvariant="normal">cls</mtext></mstyle></mrow></msubsup><mo stretchy="false">]</mo><mo>,</mo></math> (7)

τ = tanh (W φ [ˉ t cls IR ∥ ˉ t cls RGB] + b φ), <math display="block" altimg="eq-00051.gif"><mi>τ</mi><mo>=</mo><mo>tanh</mo><mo stretchy="false">(</mo><msub><mrow><mi>W</mi></mrow><mrow><mi>φ</mi></mrow></msub><mo stretchy="false">[</mo><msubsup><mrow><mover accent="true"><mrow><mi>t</mi></mrow><mo>̄</mo></mover></mrow><mrow><mstyle><mtext mathvariant="normal">IR</mtext></mstyle></mrow><mrow><mstyle><mtext mathvariant="normal">cls</mtext></mstyle></mrow></msubsup><mo>∥</mo><msubsup><mrow><mover accent="true"><mrow><mi>t</mi></mrow><mo>̄</mo></mover></mrow><mrow><mstyle><mtext mathvariant="normal">RGB</mtext></mstyle></mrow><mrow><mstyle><mtext mathvariant="normal">cls</mtext></mstyle></mrow></msubsup><mo stretchy="false">]</mo><mo>+</mo><msub><mrow><mi>b</mi></mrow><mrow><mi>φ</mi></mrow></msub><mo stretchy="false">)</mo><mo>,</mo></math> (8)

where

$W_{φ} \in R^{256 \times 512}$ ,

$q_{φ}^{T} \in R^{512 \times 2}$ , and

$b_{φ} \in R^{256 \times 1}$ are the trainable parameter matrices and bias, respectively.

$γ_{IR}$ and

$γ_{RGB}$ represent the decoupling coefficients for the infrared image cls token and the visible light image cls token, respectively.

The decoupling factor $θ_{φ}$ helps the model understand the independence between features of different modalities, thereby reducing their mutual interference and allowing each modality to independently learn richer feature information. The decoupling feature $τ$ is a feature representation related to the decoupling factor $θ_{φ}$ , used to adjust the meaning of features between different modalities to reduce interference from high-frequency features. Based on this, the following equations further optimize the cls tokens of different modalities :

ˆ t cls IR = ˉ t cls IR - γ IR τ, <math display="block" altimg="eq-00060.gif"><msubsup><mrow><mover accent="true"><mrow><mi>t</mi></mrow><mo>̂</mo></mover></mrow><mrow><mstyle><mtext mathvariant="normal">IR</mtext></mstyle></mrow><mrow><mstyle><mtext mathvariant="normal">cls</mtext></mstyle></mrow></msubsup><mo>=</mo><msubsup><mrow><mover accent="true"><mrow><mi>t</mi></mrow><mo>̄</mo></mover></mrow><mrow><mstyle><mtext mathvariant="normal">IR</mtext></mstyle></mrow><mrow><mstyle><mtext mathvariant="normal">cls</mtext></mstyle></mrow></msubsup><mo>-</mo><msub><mrow><mi>γ</mi></mrow><mrow><mstyle><mtext mathvariant="normal">IR</mtext></mstyle></mrow></msub><mi>τ</mi><mo>,</mo></math> (9)

ˆ t cls IR = ˉ t cls IR - γ RGB τ . <math display="block" altimg="eq-00061.gif"><msubsup><mrow><mover accent="true"><mrow><mi>t</mi></mrow><mo>̂</mo></mover></mrow><mrow><mstyle><mtext mathvariant="normal">IR</mtext></mstyle></mrow><mrow><mstyle><mtext mathvariant="normal">cls</mtext></mstyle></mrow></msubsup><mo>=</mo><msubsup><mrow><mover accent="true"><mrow><mi>t</mi></mrow><mo>̄</mo></mover></mrow><mrow><mstyle><mtext mathvariant="normal">IR</mtext></mstyle></mrow><mrow><mstyle><mtext mathvariant="normal">cls</mtext></mstyle></mrow></msubsup><mo>-</mo><msub><mrow><mi>γ</mi></mrow><mrow><mstyle><mtext mathvariant="normal">RGB</mtext></mstyle></mrow></msub><mi>τ</mi><mo>.</mo></math> (10)

Finally, all dimension-reduced tokens

${{t}_{IR}^{i}, t_{RGB}^{i}, {\hat{t}}_{IR}^{cls}, {\hat{t}}_{IR}^{cls}} \in R^{1 \times 256}$ are elevated in dimension using

${Φ, Ψ} \in R^{256 \times 512}$ and then passed to the next layer of the Bi-Transformer Layer.

3.4. Loss function

In the training phase, we employ three distinct loss functions: cross-entropy loss,³⁸ triplet loss,⁹ and a novel PCL that we have introduced. The calculation process of cross-entropy loss is shown in the following equation :

LCE=−1BB∑i=1M∑j=1yijlog(pij),<math display="block" altimg="eq-00064.gif"><msub><mrow><mi>L</mi></mrow><mrow><mstyle><mtext mathvariant="normal">CE</mtext></mstyle></mrow></msub><mo>=</mo><mo>−</mo><mfrac><mrow><mn>1</mn></mrow><mrow><mi>B</mi></mrow></mfrac><munderover accentunder="true" accent="true"><mrow><mo>∑</mo></mrow><mrow><mi>i</mi><mo>=</mo><mn>1</mn></mrow><mrow><mi>B</mi></mrow></munderover><munderover accentunder="true" accent="true"><mrow><mo>∑</mo></mrow><mrow><mi>j</mi><mo>=</mo><mn>1</mn></mrow><mrow><mi>M</mi></mrow></munderover><msub><mrow><mi>y</mi></mrow><mrow><mi>i</mi><mi>j</mi></mrow></msub><mo>log</mo><mo stretchy="false">(</mo><msub><mrow><mi>p</mi></mrow><mrow><mi>i</mi><mi>j</mi></mrow></msub><mo stretchy="false">)</mo><mo>,</mo></math>(11)

where B signifies the batch size,

$y_{i j}$ refers to the actual label assigned to each sample, M indicates the total count of unique identities in the training dataset, and

$p_{i j}$ denotes the likelihood that sample i is associated with identity j. The calculation process of triplet loss is shown in the following equation :

L Tri = P \sum i = 1 K \sum a = 1 [m + max p = 1, \dots, K Dis (f i a, f i p) - min n = 1, \dots, K Dis (f i a, f i n)], <math display="block" altimg="eq-00067.gif"><msub><mrow><mi>L</mi></mrow><mrow><mstyle><mtext mathvariant="normal">Tri</mtext></mstyle></mrow></msub><mo>=</mo><munderover accentunder="true" accent="true"><mrow><mo>\sum</mo></mrow><mrow><mi>i</mi><mo>=</mo><mn>1</mn></mrow><mrow><mi>P</mi></mrow></munderover><munderover accentunder="true" accent="true"><mrow><mo>\sum</mo></mrow><mrow><mi>a</mi><mo>=</mo><mn>1</mn></mrow><mrow><mi>K</mi></mrow></munderover><mfenced separators="" open="[" close="]"><mrow><mi>m</mi><mo>+</mo><munder><mrow><mo>max</mo></mrow><mrow><mi>p</mi><mo>=</mo><mn>1</mn><mo>,</mo><mo>\dots</mo><mo>,</mo><mi>K</mi></mrow></munder><mstyle><mtext mathvariant="normal">Dis</mtext></mstyle><mo stretchy="false">(</mo><msubsup><mrow><mi>f</mi></mrow><mrow><mi>a</mi></mrow><mrow><mi>i</mi></mrow></msubsup><mo>,</mo><msubsup><mrow><mi>f</mi></mrow><mrow><mi>p</mi></mrow><mrow><mi>i</mi></mrow></msubsup><mo stretchy="false">)</mo><mo>-</mo><munder><mrow><mo>min</mo></mrow><mrow><mi>n</mi><mo>=</mo><mn>1</mn><mo>,</mo><mo>\dots</mo><mo>,</mo><mi>K</mi></mrow></munder><mstyle><mtext mathvariant="normal">Dis</mtext></mstyle><mo stretchy="false">(</mo><msubsup><mrow><mi>f</mi></mrow><mrow><mi>a</mi></mrow><mrow><mi>i</mi></mrow></msubsup><mo>,</mo><msubsup><mrow><mi>f</mi></mrow><mrow><mi>n</mi></mrow><mrow><mi>i</mi></mrow></msubsup><mo stretchy="false">)</mo></mrow></mfenced><mo>,</mo></math> (12)

where

$Dis (f_{a}, f_{n})$ denotes the distance to the hardest negative sample,

$f_{a}$ is the anchor feature,

$f_{p}$ is the positive sample feature with the same identity as

$f_{a}$ , and

$f_{n}$ is the negative sample feature with a different identity in the batch.

The overall loss function is given by

L = λ CE \cdot L CE + λ Tri \cdot L Tri + λ Pixel \cdot L Pixel, <math display="block" altimg="eq-00073.gif"><mi>L</mi><mo>=</mo><msub><mrow><mi>λ</mi></mrow><mrow><mstyle><mtext mathvariant="normal">CE</mtext></mstyle></mrow></msub><mo>\cdot</mo><msub><mrow><mi>L</mi></mrow><mrow><mstyle><mtext mathvariant="normal">CE</mtext></mstyle></mrow></msub><mo>+</mo><msub><mrow><mi>λ</mi></mrow><mrow><mstyle><mtext mathvariant="normal">Tri</mtext></mstyle></mrow></msub><mo>\cdot</mo><msub><mrow><mi>L</mi></mrow><mrow><mstyle><mtext mathvariant="normal">Tri</mtext></mstyle></mrow></msub><mo>+</mo><msub><mrow><mi>λ</mi></mrow><mrow><mstyle><mtext mathvariant="normal">Pixel</mtext></mstyle></mrow></msub><mo>\cdot</mo><msub><mrow><mi>L</mi></mrow><mrow><mstyle><mtext mathvariant="normal">Pixel</mtext></mstyle></mrow></msub><mo>,</mo></math> (13)

where

$λ_{CE}$ ,

$λ_{Tri}$ , and

$λ_{Pixel}$ are the weighting coefficients for each loss function, respectively.

4. Experiment and Analysis

4.1. Dataset

We test the efficacy of the suggested model using the RegDB²² and SYSU-MM01³⁵ datasets. The SYSU-MM01 dataset comprises images captured in various environments, both indoors and outdoors, using a combination of two near-infrared and four visible-light cameras. It encompasses a total of 491 unique identities, represented by 15 014 near-infrared and 30 071 visible light photographs. For training purposes, the dataset includes 22 258 visible light images and 11 909 near-infrared images. The testing set is derived from this training set, and additionally, the query and gallery sets consist of 3803 near-infrared images and 301 visible light images, featuring 96 randomly selected identities. The RegDB dataset utilizes a pair of cameras, consisting of one visible light camera and one infrared camera, to capture a total of 8240 images. These images depict 412 unique individuals, with each identity represented by ten photographs from the visible light camera and an equal number from the infrared camera. Training and testing sets are randomly and equitably separated from the dataset. As per the conventional experimental procedure, we assess the model by switching between utilizing the gallery set for visible-to-infrared and infrared-to-visible matching to consist solely of visible/infrared photographs.

4.2. Evaluation metrics

To evaluate the efficacy of the proposed approach and other methodologies, we utilize Cumulative Matching Characteristics (CMC) and Mean Average Precision (mAP) metrics on standard datasets, adhering to the protocols established for the Re-ID task. The accuracy of the retrieval process is indicated by the CMC score, calculated using the Euclidean distance metric. The CMC Rank-N specifically highlights the top N matches of an individual’s identity that are most similar to the query image, as captured by different cameras.

4.3. Implementation details

Our proposed technique is implemented within the PyTorch environment and leverages the computational power of an Nvidia RTX 4090 GPU for training. The ResNet-34⁸ and Vision Transformer⁴ models, which have been pre-trained on ImageNet,³ are utilized with input dimensions adjusted to 288 by 144 pixels. Each batch read from the dataset contains four randomly selected identities, with each identity having eight visible images and eight infrared images. The model optimization is conducted via Stochastic Gradient Descent (SGD) over eight epochs, beginning with a learning rate set at 0.01 and reducing it by an order of magnitude after every subsequent 10 epochs. The scaling factor $ε$ is set to 0.75, and the loss function weights $λ_{CE}$ , $λ_{Tri}$ , and $λ_{Pixel}$ are set to 0.5, 0.5, and 1, respectively. The Bi-Transformer Layer is configured with eight layers.

4.4. Experiment

4.4.1. Ablation experiment

To verify the effectiveness of the proposed PCL function, NCA, and MDA, we tested these components on the SYSU-MM01 dataset. The results are shown in Table 1.

**Table 1. Performance comparison (%) of the proposed HFF network and the Baseline, the impact of PCL, NCA, and MDA on model performance.**
				All Search		Indoor Search
	NCA	MDA	PCL	Rank-1	mAP	Rank-1	mAP
Baseline	×	×	×	65.24	61.92	78.52	66.18
	×	×	$\sqrt$	69.31	67.17	83.26	78.81
	$\sqrt$	×	×	67.26	62.82	80.46	71.50
	×	$\sqrt$	×	68.38	64.47	81.83	73.26
	$\sqrt$	$\sqrt$	×	72.40	69.91	83.19	80.55
HFF	$\sqrt$	$\sqrt$	$\sqrt$	76.32	73.27	87.16	86.73

Through the experimental results shown in Table 1, we can clearly observe the specific impacts of different components on the model’s performance. In the Baseline model, all performance metrics are at their lowest, indicating the limitations of lacking effective feature extraction and similarity constraints. After introducing the PCL function, the model’s Rank-1 accuracy in All Search increased from 65.24% to 69.31%, and the mAP improved from 61.92% to 67.17%. This indicates that PCL can effectively enhance the alignment ability of cross-modal features and improve the model’s discrimination of fine-grained features.

However, relying solely on NCA or MDA did not yield significant performance improvements. Specifically, when using NCA, the Rank-1 was 67.26% and the mAP was 62.82%, showing only a slight increase over the Baseline. This suggests that the node coupling alone does not sufficiently address the feature inconsistencies between modalities. In comparison, the performance of MDA alone was slightly better, with a Rank-1 of 68.38% and an mAP of 64.47%. This implies that MDA has certain advantages in reducing interference between modalities, but it is still insufficient to independently tackle the complexities of cross-modal feature integration.

The combination of NCA and MDA showed significant improvement, with Rank-1 rising to 72.40% and mAP reaching 69.91%. This further emphasizes the complementarity of these two attention mechanisms; NCA enhances feature coupling within the same modality, while MDA ensures the independence of cross-modal features, thereby improving the overall model performance.

Finally, when utilizing all three components (HFF), the Rank-1 accuracy surged to 76.32% and the mAP reached 73.27%. This result not only reflects the synergistic effect of PCL, NCA, and MDA, but also highlights the importance of fine-grained feature learning and interference reduction between modalities in cross-modal person Re-ID tasks. Overall, the model incorporating PCL, NCA, and MDA shows an improvement of approximately 11.08% in Rank-1 and 11.35% in mAP compared to the Baseline, demonstrating the significant effectiveness of the proposed method in feature extraction and integration.

4.4.2. Model analysis

To further investigate the impact of hyperparameter settings on model performance, we tested the scaling factor $ε$ , the loss function allocation coefficients $λ_{CE}$ , $λ_{Tri}$ , and $λ_{Pixel}$ , and the number of Bi-Transformer Layers L. The scaling factor $ε$ is used to control the amplification of feature similarity differences, thereby affecting the model’s sensitivity to feature similarity. We tested different values of the scaling factor $ε$ , and the results are shown in Fig. 3.

Fig. 3. Impact of scaling factor $ε$ on model performance (%).

It can be observed that when $ε = 0.75$ , the model achieves the best performance on both Rank-1 and mAP metrics. This indicates that the model has good sensitivity to the differences in feature similarity, allowing it to more accurately distinguish samples of different identities. When the scaling factor $ε$ is too small, the differences in feature similarity are excessively amplified, causing the model to become overly sensitive and treat noise and minor irrelevant features as important features, thereby affecting overall performance. Conversely, when the scaling factor $ε$ is too large, the differences in feature similarity are weakened, making it difficult for the model to effectively distinguish between samples of different identities. This insufficient differentiation in feature similarity leads to challenges in accurate matching and recognition.

We train the model utilizing three distinct loss functions: cross-entropy loss to enhance prediction accuracy, triplet loss to cluster features of the same identity, and PCL to achieve feature alignment and differentiation. To explore the influence of each loss function on the model’s performance, we modified the coefficients of these loss functions, with the experimental results presented in Table 2.

**Table 2. Impact of loss function coefficients on model performance (%).**
			All search		Indoor search
$λ_{CE}$	$λ_{Tri}$	$λ_{Pixel}$	Rank-1	mAP	Rank-1	mAP
1	1	1	75.83	72.29	85.91	83.65
0.5	0.5	1	76.32	73.27	87.16	86.73
0.25	0.25	1	74.25	70.65	83.86	80.95
1	1	0.5	74.38	71.60	85.59	84.72
1	1	0.25	73.54	70.98	84.67	81.61

It is evident that the model performs best when the loss function allocation coefficients $λ_{CE}$ , $λ_{Tri}$ , and $λ_{Pixel}$ are set to 0.5, 0.5, and 1, respectively. We believe that with this loss function coefficient allocation, PCL can better achieve feature alignment and differentiation in cross-modality person Re-ID tasks. When the proportion of the PCL coefficient is too large, the model may overly focus on pixel-level alignment, neglecting the relationships between global features, which can lead to reduced feature generalization ability and affect model performance. Conversely, when the proportion of the PCL coefficient is too small, the role of PCL in the loss function is weakened, failing to fully leverage its advantages in feature alignment and differentiation, resulting in decreased model performance.

The number of layers L in the Bi-Transformer Layer determines the complexity and depth of feature extraction in the deep network. More layers mean that the model can capture higher-level features and more complex patterns, but it also increases the computational complexity and the risk of overfitting. To determine the optimal number of layers for the Bi-Transformer Layer, we conducted a series of experiments, and the results are shown in Fig. 4.

From Fig. 4, it can be observed that as the number of Bi-Transformer Layer layers increases, the overall performance of the model improves. This is because more layers imply a deeper network structure, capable of extracting higher-level and more abstract features. Each layer of the Bi-Transformer Layer processes and transforms the input features, and through multi-level gradual processing, the features are more fully integrated and refined, resulting in more discriminative final feature representations. Additionally, the Bi-Transformer Layer not only processes single-modal features but also facilitates information interaction between different modalities. More layers can better capture the correlations and complementary information between modalities, enhancing the accuracy of cross-modal matching. Although increasing the number of layers can improve model performance, it also brings increased computational complexity and potential overfitting risks. In practical applications, a balance must be found between performance improvement and computational resource constraints. Based on the experimental results, we chose 8 layers for the Bi-Transformer Layer, ensuring high model performance while avoiding excessive computational overhead and model complexity.

To further showcase the advantages of the proposed Hierarchical Feature Fusion (HFF) Network, we evaluated its performance against Several State-of-the-Art (SOTA) methods. The results of this comparison are presented in Tables 3 and 4, where the methods are organized into categories, including Loss-based, Attention-based, and MG-based approaches.

**Table 3. Performance (%) comparison of the proposed HFF network and other SOTA methods on the SYSU-MM01 dataset.**
		All Search		Indoor Search
Category	Method	Rank-1	mAP	Rank-1	mAP
Loss-Based	DCLNet²⁸	70.79	65.18	73.51	76.80
	LCCRF⁷	43.23	43.09	50.07	58.88
	DEEN⁴⁵	74.70	71.80	80.30	83.30
	MTMFE¹⁰	69.47	66.41	71.72	76.38
	G2DA²⁹	63.94	60.73	71.06	76.01
	MAUM¹⁸	71.68	68.79	76.97	81.94
	DART³⁶	68.72	66.29	72.52	78.17
	FMCNet⁴³	66.34	62.51	68.15	74.09
	IMG¹⁹	69.79	51.01	78.14	65.51
	MG-DDCL²⁷	76.05	57.70	85.29	73.06
Attention-Based	AGCC⁴¹	75.91	72.96	79.34	84.62
	SPOT¹	65.34	62.25	69.42	74.63
	MIA⁴⁸	75.23	72.36	83.56	85.67
MG-Based	JMMRL²¹	71.27	68.11	77.64	81.06
	TOPLight⁴²	66.76	64.01	72.89	76.70
	Baseline	65.24	61.92	78.52	66.18
	HFF (Ours)	76.32	73.27	87.16	86.73

**Table 4. Performance (%) comparison of the proposed HFF network and other SOTA methods on the RegDB dataset.**
		Visible to Infrared		Infrared to Visible
Category	Method	Rank-1	mAP	Rank-1	mAP
Loss-Based	DCLNet⁹	81.21	74.33	78.06	70.60
	LCCRF²²	79.27	77.69	80.97	79.92
	DEEN³⁵	91.60	85.10	89.50	83.40
	MTMFE⁸	85.04	82.52	81.11	79.59
	G2DA⁴	73.95	65.49	69.67	61.98
	MAUM³	87.87	85.09	86.95	84.34
	DART²⁸	83.60	75.67	81.97	73.78
	FMCNet⁷	89.12	84.43	88.38	83.86
	IMG⁴⁵	89.70	85.82	87.64	84.03
	MG-DDCL¹⁰	84.02	79.26	83.39	77.24
Attention-Based	AGCC²⁹	92.59	86.18	91.35	84.92
	SPOT¹⁸	80.35	72.46	79.37	72.26
	MIA³⁶	92.75	85.40	91.82	84.64
MG-Based	JMMRL⁴³	94.18	86.54	91.16	83.67
	TOPLight¹⁹	85.51	79.95	80.65	75.91
	Baseline	87.29	86.07	83.62	77.83
	HFF (Ours)	95.23	88.05	93.26	86.27

It’s evident that our proposed HFF network outperforms all the compared models on the SYSU-MM01 dataset, achieving the best performance. Among the models, the Loss-Based MG-DDCL’s performance is quite close to HFF. The MG-DDCL model excels due to its dynamic dual-task collaborative learning strategy, which effectively reduces background interference and enhances cross-modality person Re-ID performance. However, MG-DDCL does not fully utilize the complementary information between modalities, resulting in suboptimal feature alignment and separation, and its loss function design lacks specificity. In contrast, the proposed HFF, through the introduction of NCA, MDA, and PCL, fully leverages multi-modality information, optimizes feature alignment and separation, and significantly enhances feature representation ability and robustness.

Similarly, our HFF network achieves the highest performance on the RegDB dataset. Among the compared models, the MG-Based JMMRL model’s performance is the closest to ours. The advantage of JMMRL lies in its MG feature extraction method, which captures more levels of feature information, resulting in high performance. However, JMMRL falls short in handling cross-modality feature alignment, failing to sufficiently reduce the differences between modalities. The HFF, by introducing PCL, enhances feature alignment capability and further optimizes feature representation through the combination of global and local features.

Finally, to verify the complexity of our proposed model, we compared it with other models in terms of the number of parameters and computational complexity, as shown in Table 5. It can be seen that AWG is based on the ResNet-50 architecture, PMT is based on ViT, while our proposed HFF combines ResNet-34 and ViT, resulting in a significant increase in both parameter count and computational complexity. However, despite this increased complexity, HFF has achieved a notable improvement in accuracy on the All Search of SYSU-MM01. To facilitate model deployment, we will further investigate model pruning in future work to reduce the number of parameters.

**Table 5. The proposed HFF compared with other models in terms of performance (%), model size, and computational cost.**
Methods	Rank-1	mAP	FLOPs (G)	Parameters (M)
AGW³⁷	58.19	56.60	7.5	30.2
PMT²⁰	67.53	64.98	18.8	86.4
HFF	76.32	73.27	25.2	116.4

4.4.3. Visualization experiment

To further validate the effectiveness of our proposed HFF network, we conducted visualization experiments from three aspects. First, we visualized the feature maps extracted by the Baseline and HFF networks using Gradient-weighted Class Activation Mapping.²⁶ The comparison of the visualized results is shown in Fig. 5.

According to the feature visualization comparison results in Fig. 5, the Baseline model’s extracted features exhibit two scenarios: sometimes they are dispersed across various small features, and other times they are concentrated on a few highly discriminative local features. In contrast, the features extracted by the HFF network provide a more comprehensive coverage of the entire body of the person, with a more uniform and holistic distribution. This indicates that HFF is more effective in capturing critical information across the whole body rather than relying solely on local features. We attribute this improvement to the proposed PCL, which allows the model to extract shared features between different modalities. Additionally, the proposed NCA and MDA mechanisms further enable the model to preserve unique information from each modality. In comparison, the Baseline model’s feature extraction is more susceptible to interference from local information, leading to features being either dispersed or overly concentrated in notable local areas.

Then, to further verify whether the proposed PCL facilitates feature alignment, we visualized specific pixel points in the feature maps extracted by ResNet-34 before and after adding PCL. The visualization results are shown in Fig. 6.

Fig. 6. Visualization of some pixels in the feature map extracted by ResNet-34.

It can be observed that the proposed PCL significantly improves feature alignment. Without using PCL, the model’s attention areas are scattered, and the features between different modalities are not fully aligned, showing noticeable differences in attention to the same local semantics. After incorporating PCL, the model’s attention to features becomes more focused, and the attention levels of semantic features in the same regions become increasingly similar. By enforcing pixel-level alignment of features from samples of the same identity across different modalities, PCL effectively reduces the feature discrepancies between modalities, thereby enhancing the model’s cross-modal feature alignment capability.

Finally, we visualized the cross-modal person retrieval results and compared them with the retrieval results from the Baseline model. The comparison results are shown in Fig. 7.

From the visual retrieval results in Fig. 7, it is evident that the Baseline model suffers from errors in matching, particularly for similar-looking individuals, indicating its reliance on local discriminative features and lack of global feature understanding. In contrast, the HFF model shows significant improvement, effectively extracting comprehensive full-body features of the person, reducing erroneous matches, and significantly enhancing the accuracy of cross-modal person Re-ID.

5. Conclusion

To address the issue of ineffective alignment caused by environmental complexities in cross-modal person Re-ID, we propose a novel HFF network. In the shallow layers of the network, we introduce a PCL function to achieve effective alignment of low-frequency information with the same semantics across different modalities. In the deeper layers, we design NCA and MDA mechanisms to enhance the representational capability of features within the same modality while reducing the interference between features from different modalities. Extensive experimental results show that HFF significantly improves performance on the SYSU-MM01 and RegDB datasets, providing new insights into addressing the alignment of features across different modal levels. Although our model effectively handles complex backgrounds and modality variations, the introduction of the hierarchical feature fusion mechanism does increase the complexity of the network. In the future, we plan to explore the feasibility of integrating this loss function into Vision Transformer and to simplify the network structure using a teacher-student model approach.

Acknowledgments

This research was supported by Deep Analysis of Vocational Education Intelligent Classroom Behavior Based on Multimodal Data (No. KJZD-K202303105).

ORCID

Wen Fu https://orcid.org/0009-0008-0421-5558

Monghao Lim https://orcid.org/0009-0007-4255-6516

Biography

Wen Fu, a member of the Communist Party of China, is Professor at Chongqing Polytechnic University of Electronic Technology. She holds Master’s degree and graduated from the Wuhan University. She is currently Visiting Scholar at the Peking University. Her research interests include Artificial Intelligence & Big Data Software Technology and Management Science. Professor Wen has published 18 papers and four academic monographs. She is the Chief Editor of nine textbooks, a co-editor of seven textbooks, and holds four invention patents and nine utility model patents. Additionally, she has eight software copyrights.

She led the construction of three national-level projects and five provincial and ministerial-level projects. She has also guided students to participate in the “Software Testing” competition of the National Vocational College Skills Competition, where they won the first prize. In the “Big Data Technology and Application” competition, her students won the second prize. Furthermore, she has received 28 provincial and ministerial-level industry competition awards.

Monghao Lim graduated with Bachelor’s degree in Computer Science and Information Systems from the National University of Singapore in 1989. He was awarded Master of Business Administration (International Business) in 2007 from the University of Wales, UK. In addition, he pursued vocational education, earning a Certificate IV in Vocational Education and Training (VET) in 2010 and a Graduate Diploma in VET from Charles Sturt University in 2012.

With extensive experience in vocational education, Lim has delivered educational programs internationally across Singapore, Australia, and China. He is not only a practitioner in education delivery but also an expert in educational leadership, having managed foreign lecturers at five Chinese university colleges across China.

Technically, Lim’s research interests include IT Networking and Cybersecurity. Instructionally, he specializes in working with students with special needs.

References

1. C. Chen, M. Ye, M. Qi, J. Wu, J. Jiang and C. W. Lin, Structure-aware positional transformer for visible-infrared person re-identification, IEEE Trans. Image Process. 31 (2022) 2352–2364. Crossref, Web of Science, Google Scholar
2. K. Cheng, Q. Geng, S. Huang, J. Tu and H. Lu, Learning shared features from specific and ambiguous descriptions for text-based person search, Multimed. Syst. 30(2) (2024) 94. Crossref, Web of Science, Google Scholar
3. J. Deng, W. Dong, R. Socher, L. J. Li, K. Li and L. Fei-Fei, Imagenet: A large-scale hierarchical image database, in Proc. of the IEEE Conf. on Computer Vision and Pattern Recognition (IEEE, 2009), pp. 248–255. Crossref, Google Scholar
4. A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner and N. Houlsby, An image is worth 16x16 words: Transformers for image recognition at scale, preprint (2020), arXiv:2010.11929. Google Scholar
5. G. Du and L. Zhang, Enhanced invariant feature joint learning via modality-invariant neighbor relations for cross-modality person re-identification, IEEE Trans. Circuits Syst. Video Technol. 34 (2023) 2361–2373. Crossref, Google Scholar
6. X. Fang, Y. Yang and Y. Fu, Visible-infrared person re-identification via semantic alignment and affinity inference, in Proc. of the IEEE/CVF Int. Conf. on Computer Vision (IEEE, 2023), pp. 11270–11279. Crossref, Google Scholar
7. G. Gao, H. Shao, F. Wu, M. Yang and Y. Yu, Leaning compact and representative features for cross-modality person re-identification, World Wide Web 25(4) (2022) 1649–1666. Crossref, Web of Science, Google Scholar
8. K. He, X. Zhang, S. Ren and J. Sun, Deep residual learning for image recognition, in Proc. of the IEEE Conf. on Computer Vision and Pattern Recognition (IEEE, 2016), pp. 770–778. Crossref, Google Scholar
9. A. Hermans, L. Beyer and B. Leibe, In defense of the triplet loss for person re-identification, preprint (2017), arXiv:1703.07737. Google Scholar
10. N. Huang, J. Liu, Y. Luo, Q. Zhang and J. Han, Exploring modality-shared appearance features and modality-invariant relation features for cross-modality person re-identification, Pattern Recognit. 135 (2023) 109145. Crossref, Web of Science, Google Scholar
11. N. Huang, K. Liu, Y. Liu, Q. Zhang and J. Han, Cross-modality person re-identification via multi-task learning, Pattern Recognit. 128 (2022) 108653. Crossref, Web of Science, Google Scholar
12. P. Huang, S. Zhu, D. Wang and Z. Liang, Cross-modality person re-identification with triple-attentive feature aggregation, Multimed. Tools Appl. 81(3) (2022) 4455–4473. Crossref, Google Scholar
13. J. Jiang, W. Zhang, R. Ran, W. Hu and J. Dai, Multi-scale transformer-based matching network for generalizable person re-identification, IEEE Signal Process. Lett. 30 (2023) 1277–1281. Crossref, Web of Science, Google Scholar
14. H. Li, M. Li, Q. Peng, S. Wang, H. Yu and Z. Wang, Correlation-guided semantic consistency network for visible-infrared person re-identification, IEEE Trans. Circuits Syst. Video Technol. 34 (2023) 4503–4515. Crossref, Google Scholar
15. H. Li, N. Dong, Z. Yu, D. Tao and G. Qi, Triple adversarial learning and multi-view imaginative reasoning for unsupervised domain adaptation person re-identification, IEEE Trans. Circuits Syst. Video Technol. 32 (2022) 2814–2830. Crossref, Web of Science, Google Scholar
16. H. Li, Y. Chen, D. Tao, Z. Yu and G. Qi, Attribute-aligned domain-invariant feature learning for unsupervised domain adaptation person re-identification, IEEE Trans. Inf. Forensics Security 16 (2021) 1480–1494. Crossref, Web of Science, Google Scholar
17. Y. Liang, H. Han, L. Huang and C. Wang, Four-stream network and nonsignificant feature learning for visible-infrared person re-identification, Int. J. Pattern Recognit. Artif. Intell. 36(7) (2022) 2250029. Link, Web of Science, Google Scholar
18. J. Liu, Y. Sun, F. Zhu, H. Pei, Y. Yang and W. Li, Learning memory-augmented unidirectional metrics for cross-modality person re-identification, in Proc. of the IEEE/CVF Conf. on Computer Vision and Pattern Recognition (IEEE, 2022), pp. 19366–19375. Crossref, Google Scholar
19. J. Lu, S. Zhang, M. Chen, X. Chen and K. Zhang, Cross-modality person re-identification based on intermediate modal generation, Opt. Lasers Eng. 177 (2024) 108117. Crossref, Web of Science, Google Scholar
20. H. Lu, X. Zou and P. Zhang, Learning progressive modality-shared transformers for effective visible-infrared person re-identification, Proc. AAAI Conf. Artif. Intell. 37(2) (2023) 1835–1843. Google Scholar
21. L. Ma, Z. Guan, X. Dai, H. Gao and Y. Lu, A cross-modality person re-identification method based on joint middle modality and representation learning, Electronics 12(12) (2023) 2687. Crossref, Web of Science, Google Scholar
22. D. T. Nguyen, H. G. Hong, K. W. Kim and K. R. Park, Person recognition system based on a combination of body images from visible light and thermal cameras, Sensors 17(3) (2017) 605. Crossref, Web of Science, Google Scholar
23. S. Ping, X. Jiang, Z. Tian, R. Cao, W. Chi and S. Yang, Cross-modal interaction network for video moment retrieval, Int. J. Pattern Recognit. Artif. Intell. 37(8) (2023) 2355010. Link, Google Scholar
24. J. Qi, T. Liang, W. Liu, Y. Li and Y. Jin, A generative-based image fusion strategy for visible-infrared person re-identification, IEEE Trans. Circuits Syst. Video Technol. 34(1) (2023) 518–533. Crossref, Google Scholar
25. W. C. Qin, Z. Y. Huang, T. H. Guan, F. Xie, F. Luo and P. Z. Qin, Triplet penalty matters: Penalty metric space triplet loss for person re-identification, J. Ambient Intell. Human Comput. 14(10) (2023) 14029–14043. Crossref, Google Scholar
26. R. R. Selvaraju, M. Cogswell, A. Das, R. Vedantam, D. Parikh and D. Batra, Grad-cam: Visual explanations from deep networks via gradient-based localization, in Proceedings of the IEEE International Conference on Computer Vision (IEEE, 2017), pp. 618–626. Crossref, Google Scholar
27. W. Shao, Y. Liu, W. Zhang and Z. Li, Cross modality person re-identification via mask-guided dynamic dual-task collaborative learning, Appl. Intell. 54(5) (2024) 3723–3736. Crossref, Web of Science, Google Scholar
28. H. Sun, J. Liu, Z. Zhang, C. Wang, Y. Qu, Y. Xie and L. Ma, Not all pixels are matched: Dense contrastive learning for cross-modality person re-identification, in Proc. of the 30th ACM Int. Conf. on Multimedia (2022), pp. 5333–5341. Crossref, Google Scholar
29. L. Wan, Z. Sun, Q. Jing, Y. Chen, L. Lu and Z. Li, G2DA: Geometry-guided dual-alignment learning for RGB-infrared person re-identification, Pattern Recognit. 135 (2023) 109150. Crossref, Web of Science, Google Scholar
30. Z. Wang, C. Li, A. Zheng, R. He and J. Tang, Interact, embed, and enlarge: Boosting modality-specific representations for multi-modal person re-identification, Proc. AAAI Conf. Artif. Intell. 36(3) (2022) 2633–2641. Google Scholar
31. Y. Wang, G. Qi, S. Li, Y. Chai and H. Li, Body part-level domain alignment for domain-adaptive person re-identification with transformer framework, IEEE Trans. Inf. Forensics Security 17 (2022) 3321–3334. Crossref, Web of Science, Google Scholar
32. Z. Wang, H. Huang, A. Zheng and R. He, Heterogeneous test-time training for multi-modal person re-identification, Proc. AAAI Conf. Artif. Intell. 38(6) (2024) 5850–5858. Google Scholar
33. S. Wang, R. Liu, H. Li, G. Qi and Z. Yu, Occluded person re-identification via defending against attacks from obstacles, IEEE Trans. Inf. Forensics Security 18 (2023) 147–161. Crossref, Web of Science, Google Scholar
34. Y. Wang, X. Liu, P. Zhang, H. Lu, Z. Tu and H. Lu, TOP-ReID: Multi-spectral object re-identification with token permutation, Proc. AAAI Conf. Artif. Intell. 38(6) (2024) 5758–5766. Google Scholar
35. A. Wu, W. S. Zheng, H. X. Yu, S. Gong and J. Lai, RGB-infrared cross-modality person re-identification, in Proc. of the IEEE Int. Conf. on Computer Vision (IEEE, 2017), pp. 5380–5389. Crossref, Google Scholar
36. M. Yang, Z. Huang, P. Hu, T. Li, J. Lv and X. Peng, Learning with twin noisy labels for visible-infrared person re-identification, in Proc. of the IEEE/CVF Conf. on Computer Vision and Pattern Recognition (IEEE, 2022), pp. 14308–14317. Crossref, Google Scholar
37. M. Ye, J. Shen, G. Lin, T. Xiang, L. Shao and S. C. Hoi, Deep learning for person re-identification: A survey and outlook, IEEE Trans. Pattern Anal. Mach. Intell. 44 (2020) 2872–2893. Crossref, Google Scholar
38. M. Ye, X. Zhang, P. C. Yuen and S. F. Chang, Unsupervised embedding learning via invariant and spreading instance feature, in Proc. of the IEEE/CVF Conf. on Computer Vision and Pattern Recognition (IEEE, 2019), pp. 6594–6603. Crossref, Google Scholar
39. J. Yin, Z. Ma, J. Xie, S. Nie, K. Liang and J. Guo, Dual-granularity feature alignment for cross-modality person re-identification, Neurocomputing 511 (2022) 78–90. Crossref, Web of Science, Google Scholar
40. V. Yogarajan, B. Pfahringer and M. Mayo, A review of automatic end-to-end de-identification: Is high accuracy the only metric?, Appl. Artif. Intell. 34(3) (2020) 251–269. Crossref, Google Scholar
41. H. Yu, X. Cheng, K. H. M. Cheng, W. Peng, Z. Yu and G. Zhao, Discovering attention-guided cross-modality correlation for visible-infrared person re-identification, Pattern Recognit. 155 (2024) 110643. Crossref, Web of Science, Google Scholar
42. H. Yu, X. Cheng and W. Peng, TOPLight: Lightweight neural networks with task-oriented pretraining for visible-infrared recognition, in Proc. of the IEEE/CVF Conf. on Computer Vision and Pattern Recognition (IEEE, 2023), pp. 3541–3550. Crossref, Google Scholar
43. Q. Zhang, C. Lai, J. Liu, N. Huang and J. Han, Fmcnet: Feature-level modality compensation for visible-infrared person re-identification, in Proc. of the IEEE/CVF Conf. on Computer Vision and Pattern Recognition (IEEE, 2022), pp. 7349–7358. Crossref, Google Scholar
44. G. Zhang, Y. Chen, H. Zhang and Y. Zheng, Multi-granularity feature utilization network for cross-modality visible-infrared person re-identification, Soft Comput. (2023) 1–14, https://doi.org/10.1007/s00500-023-08321-7. Web of Science, Google Scholar
45. Y. Zhang and H. Wang, Diverse embedding expansion network and low-light cross-modality benchmark for visible-infrared person re-identification, in Proc. of the IEEE/CVF Conf. on Computer Vision and Pattern Recognition (IEEE, 2023), pp. 2153–2162. Crossref, Google Scholar
46. Z. Zhao, H. Bai, J. Zhang, Y. Zhang, S. Xu, Z. Lin, R. Timofte and L. Van Gool, Cddfuse: Correlation-driven dual-branch feature decomposition for multi-modality image fusion, in Proc. of the IEEE/CVF Conf. on Computer Vision and Pattern Recognition (IEEE, 2023), pp. 5906–5916. Crossref, Google Scholar
47. S. Zhou, F. Zhang and W. Zou, Focusing on shared areas for partial person re-identification, Appl. Artif. Intell. 36(1) (2022) 2031818. Crossref, Web of Science, Google Scholar
48. Z. Zou and Y. Chen, Modality interactive attention for cross-modality person re-identification, Image Vis. Comput. 148 (2024) 105128. Crossref, Web of Science, Google Scholar

Vol. 38, No. 16

Metrics

Downloaded 259 times

History

Received 7 August 2024

Accepted 10 November 2024

Published: 30 December 2024

Information

This is an Open Access article published by World Scientific Publishing Company. It is distributed under the terms of the Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 (CC BY-NC-ND) License which permits use, distribution and reproduction, provided that the original work is properly cited, the use is non-commercial and no modifications or adaptations are made.

Keywords

PDF download