Cascaded information enhancement and cross-modal attention feature fusion for multispectral pedestrian detection

Multispectral pedestrian detection is a technology designed to detect and locate pedestrians in Color and Thermal images, which has been widely used in automatic driving, video surveillance, etc. So far most available multispectral pedestrian detection algorithms only achieved limited success in pedestrian detection because of the lacking take into account the confusion of pedestrian information and background noise in Color and Thermal images. Here we propose a multispectral pedestrian detection algorithm, which mainly consists of a cascaded information enhancement module and a cross-modal attention feature fusion module. On the one hand, the cascaded information enhancement module adopts the channel and spatial attention mechanism to perform attention weighting on the features fused by the cascaded feature fusion block. Moreover, it multiplies the single-modal features with the attention weight element by element to enhance the pedestrian features in the single-modal and thus suppress the interference from the background. On the other hand, the cross-modal attention feature fusion module mines the features of both Color and Thermal modalities to complement each other, then the global features are constructed by adding the cross-modal complemented features element by element, which are attentionally weighted to achieve the effective fusion of the two modal features. Finally, the fused features are input into the detection head to detect and locate pedestrians. Extensive experiments have been performed on two improved versions of annotations (sanitized annotations and paired annotations) of the public dataset KAIST. The experimental results show that our method demonstrates a lower pedestrian miss rate and more accurate pedestrian detection boxes compared to the comparison method. Additionally, the ablation experiment also proved the effectiveness of each module designed in this paper.


INTRODUCTION
Pedestrian detection, parsing visual content to identify and locate pedestrians on an image/video, has been viewed as an essential and central task within the computer vision field and widely employed in various applications, e.g. autonomous driving, video surveillance and person re-identification (Jeong et al., 2017;Zhang et al., 2018;Li et al., 2021aLi et al., ,b, 2022aWang et al., 2023). The performance of such technology has greatly advanced through the facilitation of convolutional neural networks (CNN). Typically, pedestrian detectors take Color images as input and try to retrieve the pedestrian information from them. However, the quality of Color images highly depends on the light condition. Missing recognition of pedestrians occurs frequently when pedestrian detectors process Color images with poor resolution and contrast caused by unfavorable lighting. Consequently, the use of such models has been limited for the application of all-weather devices.
Thermal imaging is related to the infrared radiation of pedestrians, barely affected by changes in ambient light. The technique of combining Color and Thermal images has been explored in recent years (Hwang et al., 2015;Liu et al., 2016;González et al., 2016;Zhang et al., 2020;Liu et al., 2020;Li et al., 2018aXie et al., 2021;Wang et al., 2022a). These methods has been shown to exhibit positive effects on pedestrian detection performance in complex environments as it could retrieve more pedestrian information. However, despite important initial success, there remain two major challenges. First, as shown in Figure 1, the image of pedestrians tends to blend with the background for nighttime Color images resulting from insufficient light (Zhu et al., 2021), and for daytime Thermal images as well due to similar temperatures between the human body and the ambient environment (Yang et al., 2022). Second, there is an essential difference between Color images and Thermal images the former displays the color and texture detail information of pedestrians while the latter shows the temperature information. Therefore, solutions needed to be taken to augment the pedestrian features in Color and Thermal modalities in order to suppress background interference, and enable better integration and understanding of both Color and Thermal images to improve the accuracy of pedestrian detection in complex environments. To address the challenges above, the researches (Guan et al., 2019;Zhou et al., 2020) designed illumination-aware networks to obtain illumination-measured parameters of Color and Thermal images respectively, which were used as fusion weights for Color and Thermal features in order to realize a self-adaptively fuse of two modal features. However, the acquisition of illumination-measured parameters relied heavily on the classification scores, the accuracy of which was limited by the performance of the classifier. Li et al. (Li et al., 2022c) reported confidence-aware networks to predict the confidence of detection boxes for each modal, and then Dempster-Sheffer theory combination rules were employed to fuse the results of different branches based on uncertainty. Nevertheless, the accuracy of predicting the detection boxes' confidence is also affected by the performance of the confidence-aware network. A cyclic fusion and refinement scheme was introduced by (Zhang et al., 2020b) for the sake of gradually improving the quality of Color and Thermal features and automatically adjusting the complementary and consistent information balance of the two modalities to effectively utilize the information of both modalities. However, this method only used a simple feature cascade operation to fuse Color and Thermal features and failed to fully exploit the complementary features of these two modalities.
To tackle the problems aforementioned, we propose a multispectral pedestrian detection algorithm with cascaded information enhancement and cross-modal attention feature fusion. The cascaded information enhancement module (CIEM) is designed to enhance the pedestrian information suppressed by the background in the Color and Thermal images. CIEM uses a cascaded feature fusion block to fuse Color and Thermal features to obtain fused features of both modalities. Since the fused features contain the consistency and complementary information of Color and Thermal modalities, the fused features can be used to enhance Color and Thermal features respectively to reduce the interference of background on pedestrian information. Inspired by the attention mechanism, the attention weights of the fused features are sequentially obtained by channel and spatial attention learning, and the Color and Thermal features are multiplied with the attention weights element by element, respectively. In this way, the single-modal features have the combined information of the two modalities, and the single-modal information is enhanced from the perspective of the fused features. Although CIEM enriches singlemodal pedestrian features, simple feature fusion of the enhanced single-modal features is still insufficient for robust multispectral pedestrian detection. Thus, we design the cross-modal attention feature fusion module (CAFFM) to efficiently fuse Color and Thermal features. Cross-modal attention is used in this module to implement the differentiation of pedestrian features between different modalities. In order to supplement the pedestrian information of the other modality to the local modality, the attention of the other modality is adopted to augment the pedestrian characteristics of the local modality. A global feature is constructed by adding the Color and Thermal features after performing cross-modal feature enhancement, and the global feature is used to guide the fusion of the Color and Thermal features. Overall, the method presented in this paper enables more comprehensive pedestrian features acquisition through cascaded information enhancement and cross-modal attention feature fusion, which effectively enhances the accuracy of multispectral image pedestrian detection. The main contributions of this paper are summarized as follows: (1) A cascaded information enhancement module is proposed. From the perspective of fused features, it reduces the interference from the background of Color and Thermal modalities on pedestrian detection and augments the pedestrian features of Color and Thermal modalities separately through an attention mechanism.
(2) The designed cross-modal attention feature fusion module first mines the features of both Color and Thermal modalities separately through a cross-modal attention network and adds them to the other modality for cross-modal feature enhancement. Meanwhile, the cross-modal enhanced Color and Thermal features are used to construct global features to guide the feature fusion of the two modalities.
(3) Numerous experiments are conducted on the public dataset KAIST to demonstrate the effectiveness and superiority of the proposed method. In addition, the ablation experiments also demonstrate the effectiveness of the proposed modules.

Multispectral Pedestrian Detection
Multispectral sensors can obtain paired Color-Thermal images to provide complementary information about pedestrian targets. A large multispectral pedestrian detection (KAIST) dataset was constructed by (Hwang et al., 2015). Meanwhile, by combining the traditional aggregated channel feature (ACF) pedestrian detector (Dollár et al., 2014) with the HOG algorithm (Dalal and Triggs, 2005), an extended ACF (ACF+T+THOG) method was proposed to fuse Color and Thermal features. In 2016, Liu et al. (Liu et al., 2016) proposed four fusion modalities of low-layer feature, middle-layer feature, highlayer feature, and confidence fraction fusion with VGG16 as the backbone network, and the middle-layer feature fusion was proved to offer the maximum integration capability of Color and Thermal features. Inspired by this, (König et al., 2017) developed a multispectral region candidate network with Faster RCNN (Region with CNN features, RCNN) (Ren et al., 2017) as the architecture and replaced the original classifier in Faster RCNN with an enhanced decision tree classifier to reduce the missed and false detection of pedestrians. Recently, Kim et al. (Kim et al., 2021a) deployed the EfficientDet as the backbone network and proposed an EfficientDet-based fusion framework for multispectral pedestrian detection to improve the detection accuracy of pedestrians in Color and Thermal images by adding and cascading the Color and Thermal features. Although the studies (Hwang et al., 2015;Liu et al., 2016;König et al., 2017;Kim et al., 2021a) fused Color and Thermal features for pedestrian detection, they mainly focused on exploring the impact of different stages of fusion on pedestrian detection, and only adopted simple feature fusion and not focusing on the case of pedestrian and background confusion.
In 2019, Zhang et al. (Zhang et al., 2019a) observed a weak alignment problem of pedestrian position between Color and Thermal images, for which the KAIST dataset was re-annotated and Aligned Region CNN (AR-CNN) was proposed to handle weakly aligned multispectral pedestrian detection data in an endto-end manner. But the deployment of this algorithm requires pairs of annotations, and the annotation of the dataset is a time-consuming and labor-intensive task, which makes the algorithm difficult to be applied in realistic scenes. Kim et al. (Kim et al., 2021b) proposed a new single-stage multispectral pedestrian detection framework. This framework used multi-label learning to learn input state-aware features based on the state of the input image pair by assigning an individual label (if the pedestrian is visible in only one image of the image pair, the label vector is assigned as y 1 ∈ [0, 1] or y 2 ∈ [1, 0] ; if the pedestrian is visible in both images of the image pair, the label vector is assigned as y 3 ∈ [1, 1] ) to solve the problem of weak alignment of pedestrian locations between Color and Thermal images, but the model still requires pairs of annotations during training. Guan et al. (Guan et al., 2019) designed illuminationaware networks to obtain illumination-measured parameters for Color and Thermal images separately and used them as the fusion weights for Color and Thermal features. Zhou et al.  designed a differential modality perception fusion module to guide the features of the two modalities to become similar, and then used the illumination perception network to assign fusion weights to the Color and Thermal features. Kim et al. (Kim et al., 2022) reported an uncertainty-aware cross-modal guidance (UCG) module to guide the distribution of modal features with high prediction uncertainty to align with the distribution of modal features with low prediction uncertainty. The researches (Guan et al., 2019;Zhou et al., 2020) noticed that the pedestrians in Color and Thermal images are easily confused with the background and used illumination-aware networks to assign fusion weights to Color and Thermal features. However, the acquisition of illumination-measured parameters relied heavily on the classification scores, whose accuracy was limited by the performance of the classifier. In contrast, the method proposed in this paper not only considers the confusion of pedestrians and background in Color and Thermal images but also effectively fuses the two modal features.

Attention Mechanisms
Attention mechanisms (Vaswani et al., 2017) utilized in computer vision are aimed to perform the processing of visual information. Currently, attention mechanisms have been widely used in semantic segmentation (Li et al., 2020a), image captioning (Li et al., 2020b), image fusion (Xiao et al., 2022;Li et al., 2021c), image dehazing (Li et al., 2022d), saliency target detection (Xu et al., 2021), person re-identification (Li et al., 2022e;Zhang et al., 2022;Wang et al., 2022b), etc. Hu et al. (Hu et al., 2020) introduced the idea of a squeeze and excitation network (SENet) to simulate the interdependence between feature channels in order to generate channel attention to recalibrate the feature mapping of channel directions. Li et al. (Li et al., 2019a) employed the use of a selective kernel unit (SKNet) to adaptively fuse branches with different kernel sizes based on input information. A work inspired by this was from Dai et al. (Dai et al., 2021). They designed a multi-scale channel attention feature fusion network that used channel attention mechanisms to replace simple fusion operations such as feature cascades or summations in feature fusion to produce richer feature representations. However, this recent progress in multispectral pedestrian detection has also been limited to two main challenges the interference caused by background and the difference of fundamental characteristics in Color and Thermal images. Therefore, we propose a multispectral pedestrian detection algorithm with cascaded information enhancement and cross-modal attention feature fusion based on the attention mechanism.

METHODS
The overall network framework of the proposed algorithm is shown in Figure 2. The network consists of an encoder, a cascaded information enhancement module (CIEM), a cross-modal attentional feature fusion module (CAFFM) and a detection head. Specifically, ResNet-101 (He et al., 2016) is used as the backbone network of the encoder to encode the features of the input Color images X c and Thermal images X t to obtain the corresponding feature maps F c ∈ R W ×H×C and F t ∈ R W ×H×C (W , H, C represent the width, height and the number of channels of the feature maps, respectively). CIEM enhances singlemodal information from the perspective of fused features by cascading feature fusion blocks to fuse F c and F t , and attention weighting the fused features to enrich pedestrian features. CAFFM complements the features of different modalities by mining the complementary features between the two modalities and constructs global features to guide the effective fusion of the two modal features. The detection head is employed for pedestrian recognition and localization of the final fused features.

Cascaded Information Enhancement Module
Considering the confusion of pedestrians with the backgrounds in Color and Thermal images, we design a cascaded information enhancement module (CIEM) to augment the pedestrian features of both modalities to mitigate the effect of background interference on pedestrian detection. Specifically, a cascaded feature fusion block is used to fuse the Color features F c and Thermal features F t . The cascaded feature fusion block consists of feature cascade, 1×1 convolution, 3×3 convolution, BN layer, and ReLu activation function. The feature cascade operation splice F c and F t along the direction of channels. 1 × 1 convolution is conducive to cross-channel feature interaction in the channel dimension and reducing the number of channels in the splice feature map, while 3 × 3 convolution expands the field of perception and makes a more comprehensive fusion of features for generating fusion features F ct : ( 1) where BN denotes batch normalization, Conv n (·) denotes a convolution kernel with kernel size n × n, [·, ·] denotes the cascade of features along the channel direction, ReLu(·) represents ReLu activation function. Fusion feature F ct is used to enhance the single-modal information because F ct combines the consistency and complementarity of the Color features F c and Thermal features F t . The use of F ct for enhancing the single-modal feature can reduce the interference of the noise in the single-modal features (for example, it is difficult to distinguish between the pedestrian information and the background noise).

Figure 2. Overall framework of the proposed algorithm
In order to effectively enhance pedestrian features, the fusion feature F ct is sent into the channel attention module (CAM) and spatial attention module (PAM) (Woo et al., 2018) to make the network pay attention to pedestrian features. The network structure of CAM and PAM is shown in Figure 3. F ct first learns the channel attention weight w ca ∈ R 1×1×C by CAM, then uses w ca to weight F ct , and the spatial attention weight w pa ∈ R W ×H×1 is obtained from the weighted features by PAM.
The single-modal Color features F c and Thermal features F t are multiplied element by element with the attention weights w ca and w pa to enhance the single-modal features from the perspective of fused features. The whole process can be described as follows: where F ′ t and F ′ c denote the Color features and Thermal features obtained by the cascaded information enhancement module, respectively. ⊗ represents the element by element multiplication.

Cross-modal Attention Feature Fusion Module
There is an essential difference between Color and Thermal images, Color images reflect the color and texture detail information of pedestrians while Thermal images contain the temperature information of pedestrians, however, they also have some complementary information. In order to explore the complementary features of different image modalities and fuse them effectively, we design a cross-modal attention feature fusion module.
Specifically, the Color features F ′ c and Thermal features F ′ t enhanced by CIEM are first mapped into feature vectors v c ∈ R 1×1×C and v t ∈ R 1×1×C , respectively, by using global average pooling operation. The cross-modal attention network consists of a set of symmetric 1 × 1 convolutions, ReLu activation functions, and Sigmoid activation functions. In order to obtain the complementary features of the two modalities, more pedestrian features need to be mined from the single-modal. The feature vectors v t and v c are learned to the respective modal attention weights w t ∈ R 1×1×C and w c ∈ R 1×1×C by a cross-modal attention network, and then the Color features F ′ c are multiplied element by element with the attention weights w t of the Thermal modality, and the Thermal features F ′ t are multiplied element by element with the attention weights w c of the Color modality to complement the features of the other modality into the present modality. The specific process can be expressed as follows.
In order to efficiently fuse the two modal features, the features F ′ ct and F ′ tc are subjected to an element by element addition operation to obtain a global feature vector containing Color and Thermal features. Then, the features F ′ t and F ′ c are added element by element and multiplied with the attention weight w ct of the global feature vector element by element to guide the fusion of Color and Thermal features from the perspective of global features to obtain the final fused feature F . The fused feature F is input to the detection head to obtain the pedestrian detection results. The feature fusion process can be expressed as follows: where ⊕ denotes element by element addition.

Loss Function
The loss function in this paper is consistent with the literature (Ren et al., 2017) and uses the Region Proposal Network (RPN) loss function L RP N and Fast RCNN (Girshick, 2015) loss function L F R to jointly optimize the network: Both L RP N and L F R consist of classification loss L cls and bounding box regression loss L reg : Where, N cls is the number of anchors, N reg is the sum of positive and negative sample number, p i is the probability that the i-th anchor is predicted to be the target, p * i is 1 when the anchor is a positive sample, otherwise it is 0, t i denotes the bounding box regression parameter predicting the i-th anchor, and t * i denotes the GT bounding box parameter of the i-th anchor, λ = 1.
The difference between the classification loss of RPN network and Fast RCNN network is that the RPN network focuses only on the foreground and background when classifying, so its loss is a binary cross-entropy loss, while the Fast RCNN classification is focused to the target category and is a multicategory cross-entropy loss: The bounding box regression loss of RPN network and Fast RCNN network uses Smooth L 1 loss:

Frontiers
Where, R denotes Smooth L 1 function, The difference between the bounding box regression loss of RPN loss and the regression loss of Fast RCNN loss is that the RPN network is trained when σ =3 and the Fast RCNN network is trained when σ =1.

Datasets
This paper evaluates the algorithm performance on the KAIST pedestrian dataset (Hwang et al., 2015), which is composed of 95,328 pairs of Color and Thermal images captured during daytime and nighttime. It is the most widely used multispectral pedestrian detection dataset at present. The dataset is labeled with four categories including person, people, person?, and cyclist. Considering the application areas of multispectral pedestrian detection (e.g., automatic driving), all four categories are treated as positive examples for detection in this paper. To address the problem of the annotation errors and missing annotations in the original annotation of the KAIST dataset, studies (Liu et al., 2016;Li et al., 2018b;Zhang et al., 2019a) performed data cleaning and re-annotation of the original data. Given that the annotations used in various studies are not consistent, we use 7601 pairs of Color and Thermal images from synthetic annotation (SA) (Li et al., 2018b) and 8892 pairs of Color and Thermal images from paired annotation (PA) (Zhang et al., 2019a) for model training. The test set consists of 2252 pairs of Color and Thermal images, of which 1455 pairs are from the daytime and 797 pairs are from the nighttime. For a fair comparison with other methods, the test experiments were performed according to the reasonable settings proposed in the literature (Hwang et al., 2015).

Evaluation Indexes
In this paper, Log-average Miss Rate (MR) proposed by Dollar et al.(Dollar et al., 2012) is employed as an evaluation index and combined with the plotting of the Miss Rate-FPPI curve to assess the effectiveness of the algorithm. The horizontal coordinate of the Miss Rate-FPPI curve indicates the average number of False Positives Per Image (FPPI), and the vertical coordinate represents the Miss Rate (MR), which is expressed as: where F N denotes False Negative, T P denotes True Positive, F P denotes False Positive, the sum of T P and F N is the number of all positive samples, and Total ( images ) denotes the total number of predicted images. It is worth noting that the lower the Miss Rate-FPPI curve trend, the better the detection performance; the smaller the MR value, the better the detection performance. In order to calculate MR, in logarithmic space, 9 points are taken from the horizontal coordinate (limited value range is 10 −2 , 10 0 ) of Miss Rate-FPPI curve, and then there are 9 corresponding vertical coordinates m 1 , m 2 ,...m 9 . By averaging these values, MR can be obtained as follows: where n is 9.

Implementation Details
In this paper, the deep learning framework pytorch1.7 is adopted. The experimental platform is the ubuntu18.04 operating system and a single NVIDIA GeForce RTX 2080Ti GPU. Stochastic Gradient Descent (SGD) algorithm is used to optimize the network during model training, with momentum value of 0.9, weight attenuation value 5 × 10 −4 , and initial learning rate is 1 × 10 −3 . The model is iterated for 5 epochs with the batch size of 4, and the learning rate decay to 1 × 10 −4 after the 3rd epoch.

Construction of the Baseline
This work constructs a baseline algorithm architecture based on ResNet-101 backbone network and Faster RCNN detection head. Simple characteristic fusion (feature cascade, element by element addition and element by element multiplication) of the Color and Thermal features output by the backbone network is carried out in three sets of experiments. The fused feature is used as the input of the detection head. In order to ensure the high efficiency of the build baseline algorithm, synthesis annotation is employed to train and test the baseline. The test results are shown in Table 1. The MR values using feature cascade, element by element addition and element by element multiplication in the all-weather scene are 14.62%, 13.84% and 14.26%, respectively. By comparing these three results, it can be seen that the feature element by element addition demonstrates the best performance. Therefore, we adopt the method of adding features element by element as the baseline integration method.

Performance comparison of different methods
The performance of this method is compared with several other state-of-the-art methods. The compared methods include hand-represented methods, e.g., ACT+T+THOG (Hwang et al., 2015) and deep learning-based methods, e.g., Halfway Fusion (Liu et al., 2016), CMT CNN (Xu et al., 2017), CIAN (Zhang et al., 2019b), IAF R-CNN (Li et al., 2019b), IATDNN+IAMSS (Guan et al., 2019), CS-RCNN (Zhang et al., 2020a), IT-MN (Zhuang et al., 2022), and DCRD (Liu et al., 2022). Here, the model is trained using 7601 pairs of Color and Thermal images from SA and 8892 pairs of Color and Thermal images from PA, respectively. Besides, 2252 pairs of Color and Thermal images from the test set are used for model testing. Table 2 lists the experimental results. Table 2 shows that when the model is trained with SA, the MRs of the method proposed in this paper are 10.71%, 13.09% and 8.45% for all-weather, daytime and nighttime scenes, respectively, which are 0.72%, -1.23% and 0.37% lower than the compared method CS-RCNN with the best performance, respectively. The PA (Color) and PA (Thermal) in Table 2 represent the Color annotation and Thermal annotation in the pairwise annotation PA, respectively, for the purpose of training the model. It can be seen from 2 that the MRs of the method in this paper are 11.11% and 10.98% when using Color annotation and Thermal annotation in the all-weather scene, which are 2.53% and 3.70%, respectively, lower than those of compared method with the best performance. In addition, by analyzing the experimental results of two improved versions of annotations, it can be found that pedestrian detection results are different when using different annotations, indicating the importance of annotations.

Analysis of Ablation Experiments
(1) Complementarity and importance of Color and Thermal features This section compares the effect of different input sources on pedestrian detection performance. In order to eliminate the impact of the proposed module on detection performance, three sets of experiments are conducted on baseline: 1) the combination of Color and Thermal images as the input source (the input of the two branches of the backbone network are respectively Color and Thermal images); 2) dual-stream Color image as the input source (use Color images to replace Thermal images, that is, the backbone network input source is Color images); 3) dual-stream Thermal images as the input source (use Thermal images to replace Color images, that is, the backbone network input source is Thermal images).The training set of the model here is 7061 pairs of images of SA, and the test set is 2252 pairs of Color and Thermal images. Table 3 shows the MRs of these three input sources for the all-weather, daytime, and nighttime scenes. It can be seen from Table 3 that the MRs obtained using Color and Thermal images as input to the network are 13.84%, 15.35% and 12.48% for the all-weather, daytime and nighttime scenes, respectively, which are 11.53%, 3.96%, 18.70% and 3.71%, 7.46%, 0.13% lower than using Color images and Thermal images as input alone. The experimental results prove that the detection network combining Color and Thermal features delivers better performance, indicating that Color and Thermal features are important for pedestrian detection. Figure 4 shows the Miss Rate-FPPI curves of the detection results for these three input sources in the all-weather, daytime, and nighttime scenes (blue, red and green curves indicate dual-stream Thermal images, dual-stream Color images, and Color and Thermal images, respectively). By analyzing the Miss Rate-FPPI curve trend and combining with the experimental data in Table 3, it can be seen that the detection effect of Color images as the input source is better than that of Thermal images in the daytime scene while the result is the opposite for the night scene, and the detection effect of Color and Thermal images combined as the input source is better than that of single-modal input in both daytime and nighttime. It shows that there are complementary features between Color and Thermal modalities, and the fusion of the two modal features can improve the pedestrian detection performance. (2) Ablation experiments In this section, ablation experiments are conducted to demonstrate the effectiveness of the proposed cascaded information enhancement module (CIEM) and cross-modal attentional feature fusion module (CAFFM). Here, 7061 pairs of SA images are used to train the model, and 2252 pairs of Color and Thermal images in the test set are used to test the model.
Effectiveness of CIEM: CIEM is used to enhance the pedestrian features in Color and Thermal images to reduce the interference from the background. The experimental results are shown in Table  4. The MRs of baseline on SA are 13.84%, 15.35% and 12.48% for all-weather, daytime and nighttime scenes, respectively. When CIEM is additionally employed, the MRs are 11.21%, 13.15% and 9.07% for all-weather, daytime and nighttime scenes, respectively, which are reduced by 2.63%, 2.20% and 3.41% compared to the baseline, respectively. It is shown that the proposed CIEM effectively enhances the pedestrian features in both modalities, reduces the interference of background, and improves the pedestrian detection performance. Validity of CAFFM: CAFFM is used to effectively fuse Color and Thermal features. The experimental results are shown in Table 4. On the SA, when the baseline is used with CAFFM, the MRs are 11.68%, 13.81% and 9.50% in all-weather, daytime and nighttime scenes, respectively, which are reduced by 2.16%, 1.54% and 2.98% compared baseline, respectively. It shows that the proposed CAFFM effectively fuses the two modal features to achieve robust multispectral pedestrian detection.
Overall effectiveness: The proposed CIEM and CAFFM are additionally used on the basis of baseline. Experimental results show a reduction of 3.13%, 2.26% and 4.03% in MRs for all-weather, daytime and nighttime scenes, respectively, compared to the baseline, indicating the overall effectiveness of the proposed method. A closer look reveals that with additional employment of CIEM and CAFFM alone, MRs are decreased by 2.63% and 2.16%, respectively, in the all-weather scene, but the MR of the overall model is reduced by 3.13%. It demonstrates that there is some orthogonal complementarity in the role of the proposed two modules. Figure 5 shows the Miss Rate-FPPI curves for CIEM and CAFFM ablation studies in all-weather, daytime and nighttime scenes (blue, red, orange and green curves represent baseline, baseline + CIEM, baseline + CAFFM and overall model, respectively). It is clear that the curve trends of each module and the overall model are both lower than that of the baseline, which further proves the effectiveness of the method presented in this work.
Furthermore, in order to qualitatively analyze the effectiveness of the proposed CIEM and CAFFM, four pairs of Color and Thermal images (two pairs of images are taken from daytime and two pairs of images are taken from nighttime) are selected from the test set for testing. The pedestrian detection results of the baseline and each proposed module are shown in Figure 6. The first row is the visualization results of labeled boxes for Color and Thermal images, and the second to the fifth rows are the visualization results of the labeled and prediction boxes for baseline, baseline + CIEM, baseline + CAFFM, and the overall model pedestrian detection with the green and red boxes representing the labeled and prediction boxes, respectively. It can be seen that the proposed method successfully addresses the problem of pedestrian missing detection in complex environments and achieves more accurate detection boxes. For example, the second row, pedestrian detection missing happens in the first, third, and fourth pairs of images in the baseline detection result, however, the pedestrian miss detection problem is properly solved with CIEM and CAFFM added to the baseline and the overall model produces more accurate pedestrian detection boxes.  . In this paper, each module and baseline pedestrian detection results (The first row is the visualization results of labeled boxes for Color and Thermal images, and the second to the fifth rows are the visualization results of the labeled and prediction boxes for baseline, baseline + CIEM, baseline + CAFFM and the overall model pedestrian detection with the green and red boxes representing the labeled and prediction boxes, respectively.)

CONCLUSION
In this paper, we propose a multispectral pedestrian detection algorithm including cascaded information enhancement module and cross-modal attention feature fusion module. The proposed method improves the accuracy of pedestrian detection in multispectral images (Color and Thermal images) by effectively fusing the features from the two modules and augmenting the pedestrian features. Specifically, on the one hand, a cascaded information enhancement module (CIEM) is designed to enhance singlemodal features to enrich the pedestrian features and suppress interference from the background noise. On the other hand, unlike previous methods that simply splice Color and Thermal features directly, a cross-modal attention feature fusion module (CAFFM) is introduced to mine the features of both Color and Thermal modalities and to complement each other, then complementary enhanced modal features are used to construct global features. Extensive experiments have been conducted on two improved annotations of the public dataset KAIST. The experimental results show that the proposed method is conducive to obtain more comprehensive pedestrian features and improve the accuracy of multispectral image pedestrian detection.