Loading [MathJax]/jax/output/CommonHTML/jax.js
World Scientific
Skip main navigation

Cookies Notification

We use cookies on this site to enhance your user experience. By continuing to browse the site, you consent to the use of our cookies. Learn More
×

System Upgrade on Tue, May 28th, 2024 at 2am (EDT)

Existing users will be able to log into the site and access content. However, E-commerce and registration of new users may not be available for up to 12 hours.
For online purchase, please visit us again. Contact us at customercare@wspc.com for any enquiries.

Hierarchical Feature Fusion for Cross-Modality Person Re-identification

    https://doi.org/10.1142/S0218001424570179Cited by:0 (Source: Crossref)

    Abstract

    To address the limitations of visible light cameras that cannot function effectively at night, infrared cameras have become the optimal supplement. However, current methods for visible–infrared cross-modality person re-identification focus solely on feature combination and fusion, neglecting the importance of feature alignment. To address this issue, we introduce a novel Hierarchical Feature Fusion (HFF) network, which comprehensively integrates features across various levels through sequential feature extraction. Specifically, we design a pixel-level contrastive loss function that makes pixels in the same region of cross-modality images more similar and distinguishes pixel features at different locations, thereby extracting similar low-frequency information in the shallow network. Furthermore, in the deep network, we extract high-frequency information of different modalities through the Bi-Transformer Layer and propose Node-level Coupling Attention and Modality-level Decoupling Attention. Coupling attention is used for high-frequency information coupling within the same modality while decoupling attention is used for high-frequency information decoupling between different modalities to obtain more texture and detail information. Through a series of experimental results, we validate the superiority of the proposed HFF network in cross-modality person re-identification. Our proposed method achieved 87.16% and 95.23% Rank-1 on the SYSU-MM01 and RegDB datasets, respectively, and extensive experiments have validated its effectiveness in feature alignment.

    1. Introduction

    Visible–Infrared cross-modality person re-identification (VI-Re-ID) remains a critical challenge in intelligent video surveillance systems, serving an essential role in applications like 24/7 monitoring and advanced security solutions.23,40 However, the accuracy of conventional VI-Re-ID approaches is severely hampered by the variations across various modalities and the complexity of the environment.2,34 These issues include changes in lighting conditions and feature differences between modalities,17 which, under the influence of perspective changes and occlusions, further reduce the model’s performance.

    As a retrieval task, feature alignment in person re-identification (Re-ID) can effectively improve retrieval accuracy. However, VI-Re-ID is significantly different from visible-to-visible re-identification (VV-Re-ID). Due to the differences between visible and infrared cameras, the most important color features in visible images are not available in infrared images, which severely impacts the alignment and matching of features between different modalities. To address this issue, while stacking Multi-Granularity (MG) features39,44 can improve feature discrimination, it also further increases the difficulty of feature alignment. Feature fusion methods,30,46 although reducing the difficulty of aligning features between different modalities, also narrow the differences among all features, reducing the model’s robustness to interference. Therefore, we believe that extracting modality-independent features is key to cross-modal retrieval. However, simply aligning all features is not feasible, as features at different levels represent distinct semantics. Thus, it is crucial to consider the connections between features at various levels. In both modalities, low-frequency information, such as background and large-scale environmental features, shares common elements, while high-frequency details are unique to each modality. For example, texture and detail features primarily represent visible light images, while thermal radiation features define infrared images. Therefore, our goal is to enhance the correlation among low-frequency features while processing high-frequency features through different pathways. This approach aims to improve the representational capacity of intra-modality features and reduce the correlation between high-frequency features across modalities,11,12 thereby enhancing the extraction of cross-modal and shared features.

    To address this, we introduce the Hierarchical Feature Fusion (HFF) network, a novel VI-Re-ID model that enhances efficiency by deeply integrating features across multiple levels. In the shallow stage of the network, we adopt a shared ResNet-34 backbone to extract shallow features of different modalities. Based on this, we integrate the suggested Pixel-level Contrastive Loss (PCL) function to refine features at the pixel level. Specifically, the objective of this loss function is to reduce the distance between pixel features within the same region while increasing the distance between pixel features in different regions, thereby achieving alignment of shallow semantic information. In this way, we can effectively align the low-frequency information of different modalities, such as background and large-scale environmental features, in the initial stages of the network, laying the foundation for the model to further extract deeper features. After completing the extraction of shallow features, we input the extracted feature maps into the deep Bi-Transformer Layer to extract higher-level information. The Bi-Transformer Layer can capture high-frequency features in the images, such as texture and detail information. To fully utilize the features of different modalities, we design two attention mechanisms in the Bi-Transformer Layer: Node-level Coupling Attention (NCA) and Modality-level Decoupling Attention (MDA). The NCA is used for coupling local node features within the same modality. Specifically, this mechanism enhances the feature representation ability of each node by focusing on the relationships between nodes within the same modality, promoting information exchange between nodes. The MDA is used for decoupling global node features between modalities. In general, the high-frequency information between different modalities represents distinct semantics, but this does not imply that they can be arbitrarily combined and fused. To address this, the two modules we proposed further enhance the feature representation capability within each modality and reduce the interference between features from different modalities.

    Overall, our work contributes in three key aspects:

    (1)

    We propose a PCL function, which promotes the alignment and fusion of pixel-level low-frequency features between different modalities in the shallow network, guiding the model to extract modality-shared shallow features and providing support for the subsequent extraction of more robust deep features.

    (2)

    An NCA module is designed to calculate the coupling coefficients of nodes within the same modality, integrating the information of local nodes into the cls token to enhance the feature representation capability.

    (3)

    An MDA module is designed to calculate the decoupling coefficients of cls tokens between different modalities, reducing interference between modalities and thereby improving the distinguishability of cross-modal features.

    2. Related Work

    2.1. Visible–visible person re-identification

    The primary goal of traditional models is to create unique handcrafted characteristics, like color, texture, and certain regular patterns.47 Lately, VV-Re-ID models based on CNN have reached unprecedented levels of performance. To be more precise, a subset of these models focuses on representation learning, trying to identify person-related characteristics that allow for individual differentiation. Jiang et al.13 introduced a channel attention model with global awareness that operates at multiple scales. To effectively address viewpoint variation and occlusion issues in real-world scenarios, the model employs a multiscale structural design, combining techniques for global structural information collection and feature enhancement. It gathers global information from images through a global perception module and enhances distinguishing features in images using an adaptive feature fusion module. Wang et al.33 believe that person Re-ID should not be limited to extracting features from unoccluded human body areas. They treat occluded pedestrian images as adversarial examples to train the model to resist such attacks, thereby enhancing its generalization ability. Further research focuses on metric learning, which seeks to learn an embedding representation in the embedding space that minimizes feature similarity across distinct identities and maximizes feature similarity between identities. For example, Qin et al.25 introduced an enhanced triplet loss function that categorizes the penalty metric space into consistency space and autonomy space. The common penalty intensity in the consistency space makes the model simpler; in the autonomy space, sample distribution-based customization of the penalty leads to better embedding space learning and more effective person Re-ID. However, these methods do not perform well in cross-domain or domain adaptation tasks, primarily because domain shifts make the alignment of local features more challenging. To address this issue, Wang et al.31 proposed a method that leverages the differences between different body parts to aggregate local features of the same body part, achieving domain alignment at the body part level. In contrast, Li et al.16 transformed pedestrian attributes into latent attributes that do not contain domain information, serving as substitutes for pedestrian attributes to achieve alignment of the same attributes across multiple pedestrian images. Additionally, Li et al.15 ensured domain-invariant features at the camera level through adversarial learning between the feature extractor and camera classifier. By competing with classifiers integrated with identity and domain, they achieved joint alignment of identity and domain.

    2.2. Visible–infrared person re-identification

    Current VI-Re-ID models typically fall into two categories: those that emphasize modality-specific feature compensation and those that concentrate on learning shared features across modalities. The former tackles cross-modality differences by generating information for the missing modality using data from the available one. This synthesized information, along with the original data, is then employed to manage variations within the same modality. Du and Zhang5 introduced a feature confusion baseline designed to mitigate the influence of modality-specific features by blending visible light color and infrared spectrum. To enhance the model’s ability to compensate for missing information, they developed a similarity feature refinement module. This module models sample similarity as affinity, updates the original features accordingly, and utilizes intra-modality relationships to compensate for invariant information. Similar to this, Qi et al.24 proposed a contrastive-learning-based image generation network that uses adversarial generation techniques to create cross-modality paired images. This approach enables the generation of missing modality information from one modality, addressing inconsistencies in cross-modality data. They also designed a dual-modality feature fusion module that operates on local features to extract and integrate features from both modalities into a unified representation.

    Contrary to the methods previously discussed, models utilizing modality-shared feature learning strive to extract unique features common to multiple modalities. Semantic alignment feature learning and affinity inference modules are used in the framework that Fang et al.6 devised to learn modality-shared features. First, it achieves feature alignment across several modalities by aggregating possible semantic partial features through the use of pixel-level feature similarity with learnable prototypes. Then, it optimizes the relationship between pedestrian features through an affinity matrix to enhance the accuracy of feature matching, thus enabling the learning of modality-shared features. Huang et al.10 proposed a sub-network for feature extraction that utilizes a multi-level, two-stream modality-shared approach, which captures appearance features common to both modalities and relational features that remain invariant across modalities, operating within shared 2D and 3D feature spaces. To effectively learn these shared features and minimize variability both within and between modalities, the extracted features are fused and enhanced using a proposed cross-modality center alignment loss. Similar to this, Li et al.14 developed a correlation-guided semantic consistency network designed to detect and utilize relationships across different modalities. This network employs a cross-modality semantic alignment module to extract features that remain consistent regardless of the modality. Additionally, it incorporates a cross-granularity discrepancy awareness module to maintain unique attributes through intra-modality correlations, and a probability consistency constraint module that reduces modality-specific differences at the probabilistic level. Together, these components enhance the network’s ability to learn common features across diverse modalities.

    Compared to the existing methods, our Hierarchical Feature Fusion Network (HFF), by introducing the PCL function and designing the NCA and MDA mechanisms, not only optimizes the fusion and alignment of low-frequency information but also effectively decouples high-frequency features between modalities. This overcomes the limitations of traditional methods in terms of generated information quality and consistency, resulting in improved robustness and recognition performance for cross-modality person Re-ID.

    3. Proposed Method

    In this section, we will provide an in-depth description of the proposed approach. We will start by outlining the structure of the model, as depicted in Fig. 1. Then, we will elaborate on the HFF network, including two key components: (1) the PCL function, which optimizes the pixel feature distances within and between regions during training, and (2) the NCA and MDA mechanisms, which enhance the model’s feature representation and robustness by coupling local features within modalities and decoupling global features between modalities.

    Fig. 1.

    Fig. 1. Framework diagram of HFF network.

    We input RGB image samples and infrared image samples into the ResNet-34 backbone network to extract low-level shared information. The input can be represented as input=[IIR,IRGB], with the image dimensions being [256,128,3], representing height, width, and the number of channels, respectively. It is important to note that since infrared images have only one channel, we concatenate them along the channel dimension, expanding them to three channels.

    Given the excellent performance of Vision Transformers in the image domain,32 we have chosen the Bi-Transformer Layer as the feature extractor for the deep network. However, instead of simply concatenating Transformer Layers for different modalities, we have introduced two unique attention mechanism modules between the Transformer Layers. These modules not only enhance high-frequency information within each modality but also ensure that high-frequency information between modalities remains independent, thereby promoting richer and deeper feature representations.

    Specifically, the features FIR and FRGB output by the ResNet-34 network have dimensions of [8,4,512]. We first treat each pixel as a patch, with each patch having dimensions of [1,1,512]. Thus, we obtain a total of 32 patches. Through Linear Projection, the feature maps for each modality are converted into 32 tokens, with each token having dimensions of [1,1,768]. It is important to note that pixels at the same position in different modalities share the same position embedding to maintain spatial consistency of the features. Next, all tokens are input into an L-layer Bi-Transformer Layer. Within each modality, the tokens undergo NCA, which enhances the feature representation ability of each node and effectively integrates high-frequency information within the modality. Tokens between modalities are processed through MDA, which reduces mutual interference of high-frequency features between different modalities, promoting the model to learn richer features. Finally, the decoupled cls tokens are concatenated to serve as representative features for retrieval.

    3.1. Pixel-level contrastive loss

    To enhance the distinguishability of different identity samples in cross-modality person Re-ID tasks, we have specifically designed a novel loss function, named PCL function. The loss function guides the model to learn and align pixel-level modality-invariant features by reducing the distance between pixel features within the same region and increasing the distance between pixel features in different regions. First, we extract features of different modalities from ResNet-34, as shown in the following equation :

    [FIR,FRGB]=R34(inputIR,inputRGB),(1)
    where FIR and FRGB represent the extracted features of infrared and visible images, respectively, R34 denotes the ResNet-34 backbone network, and inputIR and inputRGB denote the input infrared and visible images, respectively. After feature extraction, we calculate the pixel-level feature distances between different modalities using dot products, and then exponentiate to amplify the differences in feature similarity, as shown in the following equation :
    dist=exp(q,p),(2)
    where q denotes the pixel points in FIR and p denotes the pixel points in FRGB. Our goal is to enhance the feature similarity in corresponding regions of different modality images while reducing the similarity between features in different regions. To achieve this, we use the loss function defined in Eq. (3) to optimize the features :
    Losspixel=1NNi=1logexp(qi,piε)exp(qi,piε)+Nj=1,jiexp(qi,pj),(3)
    where ε represents the scaling factor, which adjusts the influence of the feature similarity in corresponding regions on the model, and N denotes the number of pixel points in the feature map. In each iteration, our network is optimized by leveraging the distances between features from two different modalities corresponding to the same identity. The introduced PCL function compels the network to emphasize features that are shared across modalities for samples of the same identity. This approach enhances both the clarity of features and the precision of their alignment.

    3.2. Node-level coupling attention

    The core idea of NCA is based on the attention mechanism. It calculates the coupling coefficients between local feature nodes within the same modality and aggregates the information from these local feature nodes into the global feature node. This process focuses on the relationships among nodes within the modality, thereby facilitating effective information exchange. The process of NCA is illustrated in Fig. 2.

    Fig. 2.

    Fig. 2. Flowchart of node-level coupling attention and modality-level decoupling attention.

    We use the ResNet-34 network to extract features FIR and FRGB from images of different modalities. These features are processed through Linear Projection and combined with the shared position embeddings at corresponding locations. They are then input into the Bi-Transformer Layer, generating tokens tIR and tRGB corresponding to the infrared and visible modalities, respectively.

    For infrared images, we first reduce the dimension of the tokens tIR using a weight-shared mapping ϕR768×256, and similarly, reduce the dimension of tclsIR using φR768×256. Next, we define a coupling factor θϕ, which calculates the coupling relationships between different tokens using Eq. (4) to quantify the interaction between the target token tiIR and other tokens :

    θϕi=1|𝒱ϕ|j𝒱qTϕtanh(Wϕ[tiIRtjIR]+bϕ),(4)
    where 𝒱ϕ represents the set of tokens tIR, WϕR256×512, qTϕR256×1, and bϕR256×1 are the trainable parameter matrices and bias, respectively. By calculating the coupling factor θϕ between the target token tiIR and other tokens, we then compute the coupling coefficient αi using the following equation :
    αi=exp(θϕi)j𝒱exp(θϕj).(5)
    This step ensures that the model can adjust the weights between different tokens based on the value of the coupling factor θϕ. Finally, through Eq. (6), we multiply the coupling coefficients αi with the corresponding tokens tiIR, aggregating all intra-modality information into the cls token tclsIR :
    ˉtclsIR=tclsIR+i𝒱αitiIR.(6)
    For visible light images, we use a similar method to aggregate all intra-modality information into the cls token ˉtclsRGB.

    3.3. Modality-level decoupling attention

    While the representation capability of intra-modality features is important, reducing feature interference between modalities is equally crucial. MDA is also based on the attention mechanism, but it operates across different modalities by calculating decoupling features and decoupling coefficients between global feature nodes. This helps reduce interference from high-frequency features between modalities, allowing each modality to independently learn richer feature representations, thereby enhancing the robustness and distinctiveness of cross-modal feature representations.

    Specifically, there is often redundancy or conflict of high-frequency features between tokens ˉtclsIR and ˉtclsRGB representing different modalities. We define a decoupling factor θφ and a decoupling feature τ to quantify the semantic difference and degree of decoupling. The calculation process is shown in the following equations :

    θφ=[γIR,γRGB]=qTφ[ˉtclsIR|ˉtclsRGB],(7)
    τ=tanh(Wφ[ˉtclsIRˉtclsRGB]+bφ),(8)
    where WφR256×512, qTφR512×2, and bφR256×1 are the trainable parameter matrices and bias, respectively. γIR and γRGB represent the decoupling coefficients for the infrared image cls token and the visible light image cls token, respectively.

    The decoupling factor θφ helps the model understand the independence between features of different modalities, thereby reducing their mutual interference and allowing each modality to independently learn richer feature information. The decoupling feature τ is a feature representation related to the decoupling factor θφ, used to adjust the meaning of features between different modalities to reduce interference from high-frequency features. Based on this, the following equations further optimize the cls tokens of different modalities :

    ˆtclsIR=ˉtclsIRγIRτ,(9)
    ˆtclsIR=ˉtclsIRγRGBτ.(10)
    Finally, all dimension-reduced tokens {tiIR,tiRGB,ˆtclsIR,ˆtclsIR}R1×256 are elevated in dimension using {Φ,Ψ}R256×512 and then passed to the next layer of the Bi-Transformer Layer.

    3.4. Loss function

    In the training phase, we employ three distinct loss functions: cross-entropy loss,38 triplet loss,9 and a novel PCL that we have introduced. The calculation process of cross-entropy loss is shown in the following equation :

    LCE=1BBi=1Mj=1yijlog(pij),(11)
    where B signifies the batch size, yij refers to the actual label assigned to each sample, M indicates the total count of unique identities in the training dataset, and pij denotes the likelihood that sample i is associated with identity j. The calculation process of triplet loss is shown in the following equation :
    LTri=Pi=1Ka=1[m+maxp=1,,KDis(fia,fip)minn=1,,KDis(fia,fin)],(12)
    where Dis(fa,fn) denotes the distance to the hardest negative sample, fa is the anchor feature, fp is the positive sample feature with the same identity as fa, and fn is the negative sample feature with a different identity in the batch.

    The overall loss function is given by

    L=λCELCE+λTriLTri+λPixelLPixel,(13)
    where λCE, λTri, and λPixel are the weighting coefficients for each loss function, respectively.

    4. Experiment and Analysis

    4.1. Dataset

    We test the efficacy of the suggested model using the RegDB22 and SYSU-MM0135 datasets. The SYSU-MM01 dataset comprises images captured in various environments, both indoors and outdoors, using a combination of two near-infrared and four visible-light cameras. It encompasses a total of 491 unique identities, represented by 15 014 near-infrared and 30 071 visible light photographs. For training purposes, the dataset includes 22 258 visible light images and 11 909 near-infrared images. The testing set is derived from this training set, and additionally, the query and gallery sets consist of 3803 near-infrared images and 301 visible light images, featuring 96 randomly selected identities. The RegDB dataset utilizes a pair of cameras, consisting of one visible light camera and one infrared camera, to capture a total of 8240 images. These images depict 412 unique individuals, with each identity represented by ten photographs from the visible light camera and an equal number from the infrared camera. Training and testing sets are randomly and equitably separated from the dataset. As per the conventional experimental procedure, we assess the model by switching between utilizing the gallery set for visible-to-infrared and infrared-to-visible matching to consist solely of visible/infrared photographs.

    4.2. Evaluation metrics

    To evaluate the efficacy of the proposed approach and other methodologies, we utilize Cumulative Matching Characteristics (CMC) and Mean Average Precision (mAP) metrics on standard datasets, adhering to the protocols established for the Re-ID task. The accuracy of the retrieval process is indicated by the CMC score, calculated using the Euclidean distance metric. The CMC Rank-N specifically highlights the top N matches of an individual’s identity that are most similar to the query image, as captured by different cameras.

    4.3. Implementation details

    Our proposed technique is implemented within the PyTorch environment and leverages the computational power of an Nvidia RTX 4090 GPU for training. The ResNet-348 and Vision Transformer4 models, which have been pre-trained on ImageNet,3 are utilized with input dimensions adjusted to 288 by 144 pixels. Each batch read from the dataset contains four randomly selected identities, with each identity having eight visible images and eight infrared images. The model optimization is conducted via Stochastic Gradient Descent (SGD) over eight epochs, beginning with a learning rate set at 0.01 and reducing it by an order of magnitude after every subsequent 10 epochs. The scaling factor ε is set to 0.75, and the loss function weights λCE, λTri, and λPixel are set to 0.5, 0.5, and 1, respectively. The Bi-Transformer Layer is configured with eight layers.

    4.4. Experiment

    4.4.1. Ablation experiment

    To verify the effectiveness of the proposed PCL function, NCA, and MDA, we tested these components on the SYSU-MM01 dataset. The results are shown in Table 1.

    Table 1. Performance comparison (%) of the proposed HFF network and the Baseline, the impact of PCL, NCA, and MDA on model performance.

    All SearchIndoor Search
    NCAMDAPCLRank-1mAPRank-1mAP
    Baseline×××65.2461.9278.5266.18
    ××69.3167.1783.2678.81
    ××67.2662.8280.4671.50
    ××68.3864.4781.8373.26
    ×72.4069.9183.1980.55
    HFF76.3273.2787.1686.73

    Through the experimental results shown in Table 1, we can clearly observe the specific impacts of different components on the model’s performance. In the Baseline model, all performance metrics are at their lowest, indicating the limitations of lacking effective feature extraction and similarity constraints. After introducing the PCL function, the model’s Rank-1 accuracy in All Search increased from 65.24% to 69.31%, and the mAP improved from 61.92% to 67.17%. This indicates that PCL can effectively enhance the alignment ability of cross-modal features and improve the model’s discrimination of fine-grained features.

    However, relying solely on NCA or MDA did not yield significant performance improvements. Specifically, when using NCA, the Rank-1 was 67.26% and the mAP was 62.82%, showing only a slight increase over the Baseline. This suggests that the node coupling alone does not sufficiently address the feature inconsistencies between modalities. In comparison, the performance of MDA alone was slightly better, with a Rank-1 of 68.38% and an mAP of 64.47%. This implies that MDA has certain advantages in reducing interference between modalities, but it is still insufficient to independently tackle the complexities of cross-modal feature integration.

    The combination of NCA and MDA showed significant improvement, with Rank-1 rising to 72.40% and mAP reaching 69.91%. This further emphasizes the complementarity of these two attention mechanisms; NCA enhances feature coupling within the same modality, while MDA ensures the independence of cross-modal features, thereby improving the overall model performance.

    Finally, when utilizing all three components (HFF), the Rank-1 accuracy surged to 76.32% and the mAP reached 73.27%. This result not only reflects the synergistic effect of PCL, NCA, and MDA, but also highlights the importance of fine-grained feature learning and interference reduction between modalities in cross-modal person Re-ID tasks. Overall, the model incorporating PCL, NCA, and MDA shows an improvement of approximately 11.08% in Rank-1 and 11.35% in mAP compared to the Baseline, demonstrating the significant effectiveness of the proposed method in feature extraction and integration.

    4.4.2. Model analysis

    To further investigate the impact of hyperparameter settings on model performance, we tested the scaling factor ε, the loss function allocation coefficients λCE, λTri, and λPixel, and the number of Bi-Transformer Layers L. The scaling factor ε is used to control the amplification of feature similarity differences, thereby affecting the model’s sensitivity to feature similarity. We tested different values of the scaling factor ε, and the results are shown in Fig. 3.

    Fig. 3.

    Fig. 3. Impact of scaling factor ε on model performance (%).

    It can be observed that when ε=0.75, the model achieves the best performance on both Rank-1 and mAP metrics. This indicates that the model has good sensitivity to the differences in feature similarity, allowing it to more accurately distinguish samples of different identities. When the scaling factor ε is too small, the differences in feature similarity are excessively amplified, causing the model to become overly sensitive and treat noise and minor irrelevant features as important features, thereby affecting overall performance. Conversely, when the scaling factor ε is too large, the differences in feature similarity are weakened, making it difficult for the model to effectively distinguish between samples of different identities. This insufficient differentiation in feature similarity leads to challenges in accurate matching and recognition.

    We train the model utilizing three distinct loss functions: cross-entropy loss to enhance prediction accuracy, triplet loss to cluster features of the same identity, and PCL to achieve feature alignment and differentiation. To explore the influence of each loss function on the model’s performance, we modified the coefficients of these loss functions, with the experimental results presented in Table 2.

    Table 2. Impact of loss function coefficients on model performance (%).

    All searchIndoor search
    λCEλTriλPixelRank-1mAPRank-1mAP
    11175.8372.2985.9183.65
    0.50.5176.3273.2787.1686.73
    0.250.25174.2570.6583.8680.95
    110.574.3871.6085.5984.72
    110.2573.5470.9884.6781.61

    It is evident that the model performs best when the loss function allocation coefficients λCE, λTri, and λPixel are set to 0.5, 0.5, and 1, respectively. We believe that with this loss function coefficient allocation, PCL can better achieve feature alignment and differentiation in cross-modality person Re-ID tasks. When the proportion of the PCL coefficient is too large, the model may overly focus on pixel-level alignment, neglecting the relationships between global features, which can lead to reduced feature generalization ability and affect model performance. Conversely, when the proportion of the PCL coefficient is too small, the role of PCL in the loss function is weakened, failing to fully leverage its advantages in feature alignment and differentiation, resulting in decreased model performance.

    The number of layers L in the Bi-Transformer Layer determines the complexity and depth of feature extraction in the deep network. More layers mean that the model can capture higher-level features and more complex patterns, but it also increases the computational complexity and the risk of overfitting. To determine the optimal number of layers for the Bi-Transformer Layer, we conducted a series of experiments, and the results are shown in Fig. 4.

    Fig. 4.

    Fig. 4. The impact of the number of Bi-Transformer Layers on model performance (%).

    From Fig. 4, it can be observed that as the number of Bi-Transformer Layer layers increases, the overall performance of the model improves. This is because more layers imply a deeper network structure, capable of extracting higher-level and more abstract features. Each layer of the Bi-Transformer Layer processes and transforms the input features, and through multi-level gradual processing, the features are more fully integrated and refined, resulting in more discriminative final feature representations. Additionally, the Bi-Transformer Layer not only processes single-modal features but also facilitates information interaction between different modalities. More layers can better capture the correlations and complementary information between modalities, enhancing the accuracy of cross-modal matching. Although increasing the number of layers can improve model performance, it also brings increased computational complexity and potential overfitting risks. In practical applications, a balance must be found between performance improvement and computational resource constraints. Based on the experimental results, we chose 8 layers for the Bi-Transformer Layer, ensuring high model performance while avoiding excessive computational overhead and model complexity.

    To further showcase the advantages of the proposed Hierarchical Feature Fusion (HFF) Network, we evaluated its performance against Several State-of-the-Art (SOTA) methods. The results of this comparison are presented in Tables 3 and 4, where the methods are organized into categories, including Loss-based, Attention-based, and MG-based approaches.

    Table 3. Performance (%) comparison of the proposed HFF network and other SOTA methods on the SYSU-MM01 dataset.

    All SearchIndoor Search
    CategoryMethodRank-1mAPRank-1mAP
    Loss-BasedDCLNet2870.7965.1873.5176.80
    LCCRF743.2343.0950.0758.88
    DEEN4574.7071.8080.3083.30
    MTMFE1069.4766.4171.7276.38
    G2DA2963.9460.7371.0676.01
    MAUM1871.6868.7976.9781.94
    DART3668.7266.2972.5278.17
    FMCNet4366.3462.5168.1574.09
    IMG1969.7951.0178.1465.51
    MG-DDCL2776.0557.7085.2973.06
    Attention-BasedAGCC4175.9172.9679.3484.62
    SPOT165.3462.2569.4274.63
    MIA4875.2372.3683.5685.67
    MG-BasedJMMRL2171.2768.1177.6481.06
    TOPLight4266.7664.0172.8976.70
    Baseline65.2461.9278.5266.18
    HFF (Ours)76.3273.2787.1686.73

    Table 4. Performance (%) comparison of the proposed HFF network and other SOTA methods on the RegDB dataset.

    Visible to InfraredInfrared to Visible
    CategoryMethodRank-1mAPRank-1mAP
    Loss-BasedDCLNet981.2174.3378.0670.60
    LCCRF2279.2777.6980.9779.92
    DEEN3591.6085.1089.5083.40
    MTMFE885.0482.5281.1179.59
    G2DA473.9565.4969.6761.98
    MAUM387.8785.0986.9584.34
    DART2883.6075.6781.9773.78
    FMCNet789.1284.4388.3883.86
    IMG4589.7085.8287.6484.03
    MG-DDCL1084.0279.2683.3977.24
    Attention-BasedAGCC2992.5986.1891.3584.92
    SPOT1880.3572.4679.3772.26
    MIA3692.7585.4091.8284.64
    MG-BasedJMMRL4394.1886.5491.1683.67
    TOPLight1985.5179.9580.6575.91
    Baseline87.2986.0783.6277.83
    HFF (Ours)95.2388.0593.2686.27

    It’s evident that our proposed HFF network outperforms all the compared models on the SYSU-MM01 dataset, achieving the best performance. Among the models, the Loss-Based MG-DDCL’s performance is quite close to HFF. The MG-DDCL model excels due to its dynamic dual-task collaborative learning strategy, which effectively reduces background interference and enhances cross-modality person Re-ID performance. However, MG-DDCL does not fully utilize the complementary information between modalities, resulting in suboptimal feature alignment and separation, and its loss function design lacks specificity. In contrast, the proposed HFF, through the introduction of NCA, MDA, and PCL, fully leverages multi-modality information, optimizes feature alignment and separation, and significantly enhances feature representation ability and robustness.

    Similarly, our HFF network achieves the highest performance on the RegDB dataset. Among the compared models, the MG-Based JMMRL model’s performance is the closest to ours. The advantage of JMMRL lies in its MG feature extraction method, which captures more levels of feature information, resulting in high performance. However, JMMRL falls short in handling cross-modality feature alignment, failing to sufficiently reduce the differences between modalities. The HFF, by introducing PCL, enhances feature alignment capability and further optimizes feature representation through the combination of global and local features.

    Finally, to verify the complexity of our proposed model, we compared it with other models in terms of the number of parameters and computational complexity, as shown in Table 5. It can be seen that AWG is based on the ResNet-50 architecture, PMT is based on ViT, while our proposed HFF combines ResNet-34 and ViT, resulting in a significant increase in both parameter count and computational complexity. However, despite this increased complexity, HFF has achieved a notable improvement in accuracy on the All Search of SYSU-MM01. To facilitate model deployment, we will further investigate model pruning in future work to reduce the number of parameters.

    Table 5. The proposed HFF compared with other models in terms of performance (%), model size, and computational cost.

    MethodsRank-1mAPFLOPs (G)Parameters (M)
    AGW3758.1956.607.530.2
    PMT2067.5364.9818.886.4
    HFF76.3273.2725.2116.4

    4.4.3. Visualization experiment

    To further validate the effectiveness of our proposed HFF network, we conducted visualization experiments from three aspects. First, we visualized the feature maps extracted by the Baseline and HFF networks using Gradient-weighted Class Activation Mapping.26 The comparison of the visualized results is shown in Fig. 5.

    Fig. 5.

    Fig. 5. Visual comparison of features extracted by HFF network and Baseline.

    According to the feature visualization comparison results in Fig. 5, the Baseline model’s extracted features exhibit two scenarios: sometimes they are dispersed across various small features, and other times they are concentrated on a few highly discriminative local features. In contrast, the features extracted by the HFF network provide a more comprehensive coverage of the entire body of the person, with a more uniform and holistic distribution. This indicates that HFF is more effective in capturing critical information across the whole body rather than relying solely on local features. We attribute this improvement to the proposed PCL, which allows the model to extract shared features between different modalities. Additionally, the proposed NCA and MDA mechanisms further enable the model to preserve unique information from each modality. In comparison, the Baseline model’s feature extraction is more susceptible to interference from local information, leading to features being either dispersed or overly concentrated in notable local areas.

    Then, to further verify whether the proposed PCL facilitates feature alignment, we visualized specific pixel points in the feature maps extracted by ResNet-34 before and after adding PCL. The visualization results are shown in Fig. 6.

    Fig. 6.

    Fig. 6. Visualization of some pixels in the feature map extracted by ResNet-34.

    It can be observed that the proposed PCL significantly improves feature alignment. Without using PCL, the model’s attention areas are scattered, and the features between different modalities are not fully aligned, showing noticeable differences in attention to the same local semantics. After incorporating PCL, the model’s attention to features becomes more focused, and the attention levels of semantic features in the same regions become increasingly similar. By enforcing pixel-level alignment of features from samples of the same identity across different modalities, PCL effectively reduces the feature discrepancies between modalities, thereby enhancing the model’s cross-modal feature alignment capability.

    Finally, we visualized the cross-modal person retrieval results and compared them with the retrieval results from the Baseline model. The comparison results are shown in Fig. 7.

    Fig. 7.

    Fig. 7. Top-5 visualization ranking of pedestrian retrieval of HFF network and Baseline.

    From the visual retrieval results in Fig. 7, it is evident that the Baseline model suffers from errors in matching, particularly for similar-looking individuals, indicating its reliance on local discriminative features and lack of global feature understanding. In contrast, the HFF model shows significant improvement, effectively extracting comprehensive full-body features of the person, reducing erroneous matches, and significantly enhancing the accuracy of cross-modal person Re-ID.

    5. Conclusion

    To address the issue of ineffective alignment caused by environmental complexities in cross-modal person Re-ID, we propose a novel HFF network. In the shallow layers of the network, we introduce a PCL function to achieve effective alignment of low-frequency information with the same semantics across different modalities. In the deeper layers, we design NCA and MDA mechanisms to enhance the representational capability of features within the same modality while reducing the interference between features from different modalities. Extensive experimental results show that HFF significantly improves performance on the SYSU-MM01 and RegDB datasets, providing new insights into addressing the alignment of features across different modal levels. Although our model effectively handles complex backgrounds and modality variations, the introduction of the hierarchical feature fusion mechanism does increase the complexity of the network. In the future, we plan to explore the feasibility of integrating this loss function into Vision Transformer and to simplify the network structure using a teacher-student model approach.

    Acknowledgments

    This research was supported by Deep Analysis of Vocational Education Intelligent Classroom Behavior Based on Multimodal Data (No. KJZD-K202303105).

    ORCID

    Wen Fu  https://orcid.org/0009-0008-0421-5558

    Monghao Lim  https://orcid.org/0009-0007-4255-6516

    Biography

     Wen Fu, a member of the Communist Party of China, is Professor at Chongqing Polytechnic University of Electronic Technology. She holds Master’s degree and graduated from the Wuhan University. She is currently Visiting Scholar at the Peking University. Her research interests include Artificial Intelligence & Big Data Software Technology and Management Science. Professor Wen has published 18 papers and four academic monographs. She is the Chief Editor of nine textbooks, a co-editor of seven textbooks, and holds four invention patents and nine utility model patents. Additionally, she has eight software copyrights.

    She led the construction of three national-level projects and five provincial and ministerial-level projects. She has also guided students to participate in the “Software Testing” competition of the National Vocational College Skills Competition, where they won the first prize. In the “Big Data Technology and Application” competition, her students won the second prize. Furthermore, she has received 28 provincial and ministerial-level industry competition awards.

     Monghao Lim graduated with Bachelor’s degree in Computer Science and Information Systems from the National University of Singapore in 1989. He was awarded Master of Business Administration (International Business) in 2007 from the University of Wales, UK. In addition, he pursued vocational education, earning a Certificate IV in Vocational Education and Training (VET) in 2010 and a Graduate Diploma in VET from Charles Sturt University in 2012.

    With extensive experience in vocational education, Lim has delivered educational programs internationally across Singapore, Australia, and China. He is not only a practitioner in education delivery but also an expert in educational leadership, having managed foreign lecturers at five Chinese university colleges across China.

    Technically, Lim’s research interests include IT Networking and Cybersecurity. Instructionally, he specializes in working with students with special needs.