Infrared and visible image fusion driven by multimodal large language models

Wang, Ke; Hu, Dengshu; Cheng, Yuan; Che, Yukui; Li, Yuelin; Jiang, Zhiwei; Chen, Fengxian; Li, Wenjuan

doi:10.3389/fphy.2025.1599937

ORIGINAL RESEARCH article

Front. Phys., 22 May 2025

Sec. Radiation Detectors and Imaging

Volume 13 - 2025 | https://doi.org/10.3389/fphy.2025.1599937

This article is part of the Research TopicMulti-Sensor Imaging and Fusion: Methods, Evaluations, and Applications, Volume IIIView all 11 articles

Infrared and visible image fusion driven by multimodal large language models

Ke Wang

Dengshu Hu

Yuan Cheng

Yukui Che*

Yuelin Li

Zhiwei Jiang

Fengxian Chen

Wenjuan Li

Qujing Power Supply Bureau, Yunnan Power Grid Co., Ltd., Kunming, China

Introduction: Existing image fusion methods primarily focus on obtaining high-quality features from source images to enhance the quality of the fused image, often overlooking the impact of improved image quality on downstream task performance.

Methods: To address this issue, this paper proposes a novel infrared and visible image fusion approach driven by multimodal large language models, aiming to improve the performance of pedestrian detection tasks. The proposed method fully considers how enhancing image quality can benefit pedestrian detection. By leveraging a multimodal large language model, we analyze the fused images based on user-provided questions related to improving pedestrian detection performance and generate suggestions for enhancing image quality. To better incorporate these suggestions, we design a Text-Driven Feature Harmonization (Text-DFH) module. Text-DFH refines the features produced by the fusion network according to the recommendations from the multimodal large language model, enabling the fused image to better meet the needs of pedestrian detection tasks.

Results: Compared with existing methods, the key advantage of our approach lies in utilizing the strong semantic understanding and scene analysis capabilities of multimodal large language models to provide precise guidance for improving fused image quality. As a result, our method enhances image quality while maintaining strong performance in pedestrian detection. Extensive qualitative and quantitative experiments on multiple public datasets validate the effectiveness and superiority of the proposed method.

Discussion: In addition to its effectiveness in infrared and visible image fusion, the method also demonstrates promising application potential in the field of nuclear medical imaging.

1 Introduction

Multimodal sensor technology has facilitated the application of multimodal images across various fields. Among them, infrared and visible images have been widely used in diverse tasks due to the complementary nature of the information they contain. Specifically, infrared images provide thermal radiation information of objects and are not affected by lighting conditions, but they lack detailed textures. In contrast, visible images capture rich texture details of the scene but are highly sensitive to lighting variations. Therefore, numerous methods [1–7] have focused on fusing infrared and visible images, aiming to integrate the complementary information from both modalities into a single, more informative fused image. This facilitates better decision-making and judgment in downstream tasks such as object detection [8–10] and semantic segmentation [11–14].

Current approaches that jointly train infrared-visible image fusion with downstream tasks can be broadly categorized into two types: independent optimization and joint optimization. Independent optimization methods first train a fusion network for infrared and visible images and then use the resulting fused images to train a downstream task network, as shown in Figure 1a. Consequently, most independent optimization methods focus on improving fusion quality, for example, by designing new network architectures [15–19] or introducing specific constraints [20–23]. However, such approaches neglect the potential guidance from downstream tasks and fail to establish a deep connection between fusion and task performance, often leading to suboptimal results. Simply chaining the fusion and downstream networks makes it difficult for the fused image to specifically cater to the downstream task’s requirements. On the other hand, joint optimization methods use the downstream task network as a constraint to train the image fusion network, thereby forcing it to produce fused images that meet task-specific needs [24–28], as illustrated in Figure 1b. Nevertheless, the effectiveness of directly using high-level vision task supervision to guide fusion remains limited.

Figure 1

Figure 1. Comparison of different joint training strategies for image fusion and downstream tasks.

Recently, Multimodal Large Language Models (MLLMs) have gained popularity due to their strong capability in modeling data across different modalities, such as images and text. For instance, Text-IF [29] and TeRF [30] leverage large models to encode user instructions and guide various types of fusion tasks. However, these methods do not consider the possibility of using large language models to feed back the specific needs of high-level vision tasks to the image fusion process, which could further improve the quality of fused images.

To address this challenge, we propose a novel infrared and visible image fusion method driven by a Multimodal Large Language Model, aiming to simultaneously enhance fusion quality and pedestrian detection accuracy, as shown in Figure 1c. By leveraging the deep semantic understanding and scene analysis capabilities of MLLMs, we provide precise guidance for improving fused image quality while ensuring better pedestrian detection performance. Specifically, our method analyzes the fused images based on user-provided questions related to pedestrian detection, then generates optimization suggestions using feedback from the language model. To fully utilize these suggestions, we design a Text-Driven Feature Harmonization (Text-DFH) module, which refines the fusion network’s output features under the guidance of the MLLM, allowing the fused images to better meet the demands of pedestrian detection.

In summary, the main contributions of this paper are as follows:

(1) We are the first to leverage Multimodal Large Language Models to provide feedback on the quality of fused images based on the specific requirements of downstream tasks, thus further improving infrared and visible image fusion.

(2) We propose an effective Text-Driven Feature Harmonization (Text-DFH) module that enables text-based guidance to assist in enhancing image quality.

(3) Our proposed method achieves excellent performance in infrared and visible image fusion, nuclear medical imaging, and pedestrian detection across multiple datasets.

The remainder of this paper is organized as follows. Section 2 provides a brief overview of related work on multimodal large language models, infrared and visible image fusion, and pedestrian detection. Section 3 presents our proposed method in detail. Section 4 discusses the experimental results and analysis. Section 5 concludes the paper.

2 Related work

In this section, we first briefly introduce multimodal large language models, and then review existing infrared and visible image fusion methods.

2.1 Multimodal large language models

With the advent of the multimodal data fusion era, the capability of unimodal systems is no longer sufficient to handle complex real-world tasks. As a result, multimodal large language models (MLLMs) have been proposed to integrate information from multiple data sources, enabling more comprehensive and accurate representations. These models have demonstrated significant practical value across various domains, including natural language processing, vision tasks, and audio tasks. In the visual domain, MLLMs enhance the performance of tasks such as image classification, object detection, and image captioning by combining textual descriptions with visual instructions. For example, GPT-4V [31] and Gemini [32] integrate image content with natural language descriptions to produce more vivid and accurate annotations. NExT-GPT [33] and Sora [34] are at the forefront of multimodal video generation, producing rich and realistic content by learning from multimodal data. Moreover, VideoChat [35] and Video-LLaVA [36] demonstrate excellent capabilities in analyzing and understanding video content in intelligent video understanding scenarios.

In the field of image fusion, Text-IF [29] and MGFusion [37] uses CLIP [38] to encode user requirement texts, guiding the model to fuse images. TeRF [30] utilizes LLaMA [39] to encode user instruction texts and generate prompts for guiding image fusion across different tasks. Although these methods employ MLLMs to tackle some challenges in image fusion, they do not consider the specific requirements of high-level downstream visual tasks for image fusion quality, which limits the application of infrared and visible image fusion in such tasks.

2.2 Infrared and visible image fusion

Conventional infrared and visible image fusion methods mainly focus on designing sophisticated feature extraction networks and fusion strategies to ensure the quality of the fused results. From the perspective of network design, these methods can be broadly categorized into CNN-based methods, CNN-Transformer hybrid methods, and GAN-based methods. CNN-based methods [40–45] typically apply convolution, activation, and pooling operations to extract features from the input images, then fuse and reconstruct the final result using the extracted features. However, since CNNs can only perceive local features within a limited receptive field, they struggle to capture long-range contextual information, limiting their representational capacity. In contrast, Transformers [46] are better at modeling long-range dependencies and are more suited for capturing global features in images. ViT [47] was the first to introduce Transformer architectures into computer vision, achieving promising results. Subsequently, to combine the respective strengths of CNNs and Transformers, hybrid methods have gained increasing attention in the image fusion domain. For instance, CGTF [48], SwinFusion [16], YDTR [17], and DATFuse [49] insert Transformer layers after CNN layers to jointly leverage local and global feature extraction. CDDFuse [50] and EMMA [51] adopt dual-branch architectures combining CNNs and Transformers to simultaneously extract features from the input images and integrate them for fusion.

GAN-based methods enhance the model’s feature extraction capabilities by introducing adversarial learning between generators and discriminators. Depending on the number of discriminators used, these methods can be classified into single-discriminator and dual-discriminator approaches. Single-discriminator methods [2, 52] tend to favor one modality over the other, potentially leading to information loss and reduced visual quality of the fusion results. To address this, dual-discriminator methods [53–56] are proposed to preserve important features from both source images simultaneously.

However, all of these methods primarily focus on designing effective feature extraction networks to produce high-quality fusion features and images. They overlook how fusion quality impacts downstream task performance, and fail to consider the potential feedback from downstream tasks that could help guide fusion more effectively.

2.3 Pedestrian detection

Pedestrian detection is a fundamental problem in computer vision with a wide range of applications. Cascade R-CNN [57] extends R-CNN [58] into a multi-stage framework, improving the ability to filter hard negative samples. Faster R-CNN [59] introduces a Region Proposal Network (RPN) that shares convolutional features with the detection network, making region proposals nearly cost-free. YOLO [60] reformulates object detection as a regression problem, allowing real-time inference directly on images through a convolutional neural network. SSD [61] uses multi-scale feature maps and predefined anchors for pedestrian detection, addressing YOLO’s limitations in detecting small objects. DETR [62] adopts a Transformer-based encoder-decoder architecture for object detection. BAS Wu et al. [63] learns to represent the whole foreground region by leveraging foreground guidance and domain constraints. CREAM [64] proposes a clustering-based method to enhance activation within target regions. Group R-CNN [65] builds instance groups to perform pedestrian detection from point annotations.

However, most pedestrian detection methods are designed for unimodal images, which often leads to degraded detection performance due to incomplete scene information. In this work, we perform pedestrian detection on fused infrared and visible images, and incorporate task-specific prompts generated by large language models. This not only improves the quality of the fused images but also enhances pedestrian detection performance.

3 Methods

3.1 Overview

As shown in Figure 2, the proposed method consists of two training stages. The first stage is dedicated to training the Fusion Network, enabling it to perform basic infrared and visible image fusion. In the second stage, the parameters of the pretrained fusion network are frozen, and a Text-Driven Feature Harmonization (Text-DFH) module is trained to refine the fusion results to better align with the requirements of pedestrian detection.The fusion network is composed of three main components: an Infrared Image Feature Encoder (IR-Encoder), a Visible Image Feature Encoder (VI-Encoder), and a Fusion Feature Decoder (F-Decoder). The IR/VI-Encoders are responsible for extracting features from the input infrared and visible images, respectively, while the F-Decoder reconstructs the fused image based on the combined features.The Text-DFH module adjusts the features extracted by the IR/VI-Encoders based on responses from a Multimodal Large Language Model (MLLM), ensuring that the resulting fused image better satisfies the needs of pedestrian detection. In this work, we adopt LLaVA [66] as the MLLM. LLaVA analyzes the unmodulated fused image and generates suggestions in response to user queries related to pedestrian detection tasks (e.g., To improve the accuracy of pedestrian detection, how can the quality of this image be enhanced?). More text examples of LLaVA answers are shown in Figure 3.

Figure 2

Figure 2. Overall framework of the proposed method. We use the IR-Encoder and VI-Encoder to extract features from the infrared and visible images, respectively. To ensure that the fused output meets the requirements of the pedestrian detection task, we input both a question related to pedestrian detection (e.g., To improve the accuracy of pedestrian detection, how can the quality of this image be enhanced?) and the unmodulated fused image into a Multimodal Large Language Model. The model provides suggestions for improving the quality of the fused image. Based on these suggestions, the Text-DFH module refines the output features of the fusion network, so that the final fusion result better aligns with the needs of the pedestrian detection task.

Figure 3

Figure 3. Visualized images of text examples of LLaVA answers.

3.2 Feature extraction and fusion

In the first training stage, we train the fusion network to perform the basic task of infrared and visible image fusion. The fusion network primarily consists of three components: the IR-Encoder, VI-Encoder, and F-Decoder. Each of the IR-Encoder, VI-Encoder, and F-Decoder is composed of three feature extraction layers. Each layer is constructed by stacking a convolutional layer (kernel size = $3 \times 3$ , stride = 1), a Batch Normalization layer, and a LeakyReLU activation function. It is worth noting that the LeakyReLU activation function in the final feature extraction layer of the F-Decoder is replaced with a Tanh activation function to facilitate image reconstruction. We input the infrared image $I_{i}$ and the visible image $I_{v}$ into the IR-Encoder and VI-Encoder, respectively, to extract features $F_{i}$ and $F_{v}$ . To reconstruct the fused image, we concatenate $F_{i}$ and $F_{v}$ along the channel dimension and feed the result into the F-Decoder, which generates the final fused image $I_{f}$ .

To encourage the fused image to retain as much scene information from the source images as possible, we introduce an intensity loss $ℓ_{i n}$ and an edge loss $ℓ_{e d}$ , which together form the fusion loss $ℓ_{f}$ :

ℓ_{f} = ℓ_{i n} + ε ℓ_{e d}, (1)

Here, $ε$ denotes a hyperparameter used to balance the contribution of each sub-loss term. The intensity loss $ℓ_{i n}$ is defined as:

ℓ_{i n} = \frac{1}{H W} ({‖I_{f} - I_{i}‖}_{1} + {‖I_{f} - I_{v}‖}_{1}), (2)

The edge loss $ℓ_{e d}$ is defined as:

ℓ_{e d} = \frac{1}{H W} ({‖\nabla I_{f} - \nabla I_{i}‖}_{1} + {‖\nabla I_{f} - \nabla I_{v}‖}_{1}), (3)

Here, $H$ and $W$ denote the height and width of the fused image, respectively; ${‖\cdot‖}_{1}$ represents the l1-norm, and $\nabla$ denotes the Sobel edge extraction operator.

3.3 Text-driven feature harmonization

In the second training stage, we freeze the parameters of the pretrained fusion network and focus on training the Text-DFH module to ensure that the fusion results meet the requirements of the pedestrian detection task. Text-DFH refines the features output by the IR/VI-Encoders in the fusion network based on the responses from the multimodal large language model, enabling the fused image to better align with the needs of pedestrian detection. As shown in Figure 4, Text-DFH mainly consists of a dual-branch Cross Attention (CA) module and three feature extraction layers. The dual-branch cross attention computes the cross-attention between the features extracted by the IR/VI-Encoders and the textual features, allowing the model to extract useful information from the text that can help improve pedestrian detection accuracy. Subsequently, the three feature extraction layers integrate this textual information with the image scene features to generate refined features. The structure of the CA module is similar to the Multi-Scale Attention (MSA) module used in DATFuse.

Figure 4

Figure 4. Text-driven feature harmonization module.

We input the infrared image $I_{i}$ and visible image $I_{v}$ into the pretrained fusion network with frozen parameters to obtain the fused image $I_{f}$ . To obtain effective textual feedback that helps ensure the fused image meets the requirements of the pedestrian detection task, we input both $I_{f}$ and the text prompt “To improve the accuracy of pedestrian detection, how can the quality of this image be enhanced?” into LLaVA, resulting in the textual feature $T$ . We then input the outputs $F_{i / v}$ from the IR/VI-Encoders and the textual feature $T$ into Text-DFH to harmonize the information in $F_{i / v}$ . To comprehensively extract the task-relevant information from the textual features, we design a dual-branch processing strategy. In the first branch, we take $F_{i / v}$ as the Query (Q) and $T$ as the Key (K) and Value (V) for cross-attention computation:

F_{i / v}^{1} = s o f t m a x (\frac{Q_{i / v}^{1} {(K_{i / v}^{1})}^{T}}{\sqrt{d_{1}}}) V_{i / v}^{1}, (4)

Here, $F_{i / v}^{1}$ represents the features injected with textual information in the first branch, $d_{1}$ denotes the dimensionality of $Q_{i / v}^{1}$ , $Q_{i / v}^{1} = W_{i / v}^{Q, 1} F_{i / v}$ , $K_{i / v}^{1} = W_{i / v}^{K, 1} T$ , $V_{i / v}^{1} = W_{i / v}^{V, 1} T$ . In the second branch, we use $T$ as the Query (Q) and $F_{i / v}$ as the Key (K) and Value (V) for cross-attention computation:

F_{i / v}^{2} = s o f t m a x (\frac{Q_{i / v}^{2} {(K_{i / v}^{2})}^{T}}{\sqrt{d_{2}}}) V_{i / v}^{2}, (5)

Here, $F_{i / v}^{2}$ represents the features injected with textual information in the second branch, $d_{2}$ denotes the dimensionality of $Q_{i / v}^{2}$ , and $Q_{i / v}^{2} = W_{i / v}^{Q, 2} T$ , $K_{i / v}^{2} = W_{i / v}^{K, 2} F_{i / v}$ , $V_{i / v}^{2} = W_{i / v}^{V, 2} F_{i / v}$ . To comprehensively aggregate the textual information, we concatenate $F_{i / v}^{1}$ and $F_{i / v}^{2}$ along the channel dimension and feed the result into three feature extraction layers to obtain the harmonized features ${\hat{F}}_{i / v}$ . We then concatenate ${\hat{F}}_{i}$ and ${\hat{F}}_{v}$ along the channel dimension and input the result into the F-Decoder to reconstruct the refined fused image ${I^{'}}_{f}$ .

To ensure that the refined fused image ${I^{'}}_{f}$ meets the requirements of the pedestrian detection task, we introduce a pretrained pedestrian detection network with frozen parameters to supervise the fused image. We input ${I^{'}}_{f}$ into the detection network and obtain the pedestrian detection result $\hat{y}$ . To make $\hat{y}$ as close as possible to its ground truth $y_{g t}$ , we constrain the Text-DFH module using the loss function $ℓ_{p d}$ , which is the same as the one used during the training of YOLOv5.

4 Experiments

4.1 Datasets

The proposed method consists of two training stages. In both the first and second training stages, we train the fusion network and the text-driven feature harmonization module on the publicly available LLVIP dataset [67], respectively, in accordance with standard practices in the field [68–70]. Specifically, we randomly select 2,000 pairs of infrared and visible images from the LLVIP dataset as the training set. To enhance the diversity of training samples, we apply random flipping, random rotation, and random cropping as data augmentation techniques. For evaluation, we randomly select 200 pairs of infrared and visible images from each of the LLVIP, $M^{3}$ FD [71], and MSRS [3] datasets to form the test set, in order to assess both the fusion performance and pedestrian detection performance of the proposed method. Among them, LLVIP, $M^{3}$ FD, and MSRS are used to evaluate fusion performance, while LLVIP is specifically used to evaluate pedestrian detection performance.

4.2 Implementation details

The proposed method involves two training stages. In the first stage, the fusion network is trained. In the second stage, the parameters of the fusion network are frozen, and the text-driven feature harmonization module is trained. Both training stages use the Adam optimizer to update the network parameters, with a batch size of 16 and a learning rate of $1 \times 1 0^{- 3}$ . The total number of training epochs is set to 100 for the first stage and 200 for the second stage. In addition, the hyperparameter $ε$ is set to 0.2. The proposed method is implemented based on the PyTorch framework and is trained on a single NVIDIA RTX A6000 GPU.

4.3 Evaluation metrics

We adopt five commonly used objective evaluation metrics to quantitatively assess the fusion performance of the proposed method. These metrics include Edge Preservation Index $(Q_{A B / F})$ [72, 73], Chen-Varshney Index $(Q_{C V})$ [74], Structural Similarity Index $(Q_{SSIM})$ [75], Average Gradient $(Q_{A G})$ [76], and Sum of Correlations of Differences $(Q_{SCD})$ [77]. $Q_{A B / F}$ measures how well edge information from the source images is preserved in the fused image. $Q_{A B / F}$ higher value indicates less loss of texture details in the fused image. $Q_{C V}$ evaluates fusion quality from the perspective of human visual perception; a lower value means the fused image aligns better with human visual preferences. $Q_{SSIM}$ quantifies the similarity between the fused image and the source images in terms of luminance, contrast, and structure. A higher value indicates less information difference between the fused and source images. $Q_{A G}$ measures the richness of gradient information in the fused image. A higher value means the fused image contains more detailed gradient content. $Q_{SCD}$ assesses information loss during the fusion process by computing difference maps between the fused image and source images. A higher value indicates less distortion in the fused image. Among these, $Q_{A B / F}$ , $Q_{SSIM}$ , $Q_{A G}$ and $Q_{SCD}$ are positive indicators, meaning a higher value indicates better fusion performance. $Q_{C V}$ is a negative indicator, meaning a lower value represents better fusion performance. In addition, to objectively evaluate the effectiveness of the fused images in the pedestrian detection task, we adopt three widely used metrics in the pedestrian detection domain for quantitative analysis: Mean Average Precision (mAP) at IoU threshold of 0.5 $(m A P_{50})$ , mAP at IoU threshold of 0.75 $(m A P_{75})$ , and the averaged mAP at IoU threshold from 0.5 to 0.95 $(m A P_{50 \to 95})$ .

4.4 Comparison with state-of-the-art methods

In this study, we conduct a series of qualitative and quantitative comparisons between the proposed method and eight state-of-the-art (SOTA) methods to verify its superiority in both fusion performance and pedestrian detection performance. These methods include AUIF [78], DATFuse [49], IVFWSR [79], LRRNet [80], MLFusion [81], TIMFusion [82], SwinFusion [16], and TextIF [29]. The comparative experiments are divided into two distinct groups: In the first group, we compare the fusion performance of our method with that of the SOTA methods. In the second group, we freeze the fusion networks of the compared methods and retrain their pedestrian detection networks using the corresponding fused results. The retrained detection networks are then used to perform pedestrian detection on the fused images. This setup is designed to demonstrate that our proposed method can achieve strong pedestrian detection performance without requiring retraining of the detection network.

4.4.1 Fusion performance comparison

We conduct both quantitative and qualitative comparisons of the proposed method against AUIF, DATFuse, IVFWSR, LRRNet, MLFusion, TIMFusion, SwinFusion, and TextIF on the LLVIP, $M^{3}$ FD, and MSRS datasets to validate the superiority of our method in terms of fusion performance. As shown in the enlarged regions of Figure 5, our method effectively highlights the thermal radiation information from the infrared image while preserving fine texture details from the visible image. Compared to existing SOTA methods, the fused images produced by our method exhibit clearer local details as well as higher overall brightness and contrast at the global level. This not only improves visual quality but also facilitates better object recognition in downstream tasks. This advantage is also reflected in the quantitative evaluation results, as shown in Tables 1–3. Specifically, our method achieves the lowest values in metric $Q_{C V}$ , and ranks first in both metrics $Q_{A B / F}$ and $Q_{A G}$ , indicating that the fused images contain richer edge information and are more consistent with human visual perception. In summary, both qualitative and quantitative results demonstrate that our proposed method offers significant improvements in fusion performance over the compared methods.

Figure 5

Figure 5. Visual comparison with SOTA methods. The top two rows, middle two rows, and bottom two rows of images are from the LLVIP, $M^{3}$ FD, and MSRS datasets, respectively. The first and second columns show the infrared and visible source images, while the third to ninth columns display the fusion results produced by the compared methods.

Table 1

Table 1. Quantitative results on the LLVIP dataset. The best and second-best values for each evaluation metric are highlighted in red and blue, respectively.

Table 2

Table 2. Quantitative results on the $M^{3}$ FD dataset. The best and second-best values for each evaluation metric are highlighted in red and blue, respectively.

Table 3

Table 3. Quantitative results on the MSRS dataset. The best and second-best values for each evaluation metric are highlighted in red and blue, respectively.

4.4.2 Pedestrian detection performance comparison

A common practice to improve the performance of fusion networks in downstream tasks is to freeze the parameters of the fusion network and retrain the downstream task network based on the generated fused results. Such approaches are referred to as “retraining methods.” To evaluate the effectiveness of our proposed method in pedestrian detection, we perform both quantitative and qualitative comparisons against these retraining methods. As shown in Figure 6, the pedestrian detection results of other methods often suffer from issues such as bounding boxes that fail to fully cover the pedestrians’ bodies, or boxes that include large amounts of irrelevant background, indicating insufficient detection accuracy. In contrast, the detection results produced by our method show significantly fewer irrelevant regions within the bounding boxes and more accurate box placement. This advantage is also clearly reflected in the quantitative results, as shown in Table 4. Our method achieves the highest scores in metrics $m A P_{50}$ , $m A P_{75}$ , and $m A P_{50 \to 95}$ , indicating superior performance in the pedestrian detection task compared to the other methods. In conclusion, our method demonstrates better performance than approaches that require retraining the pedestrian detection network, even without retraining. This highlights the effectiveness and advantage of our method in pedestrian detection tasks.

Figure 6

Figure 6. Qualitative comparison of pedestrian detection performance with “retraining methods.” The first and second columns show the infrared and visible source images, while the third to ninth columns display the pedestrian detection results of the compared methods.

Table 4

Table 4. Quantitative comparison of pedestrian detection performance with “retraining methods.” The best and second-best values for each evaluation metric are highlighted in red and blue, respectively.

4.4.3 Analysis of application potential in medical image fusion

Furthermore, to validate the effectiveness and application potential of the proposed method in the field of nuclear medical imaging, we further deployed it in a medical image fusion task. Specifically, we conducted experiments on the BraTS2020 [83] dataset and performed both qualitative and quantitative analyses of the fusion results. As shown in Figure 7, compared with state-of-the-art methods such as ALMFnet [84, 85], and RMR-Fusion [86], the proposed method preserves more texture details and salient information in the fused medical images. As reported in Table 5, our method ranks first or second across most evaluation metrics. These results demonstrate the promising potential of the proposed method for applications in nuclear medical imaging.

Figure 7

Figure 7. Qualitative analysis results on the medical image fusion task.

Table 5

Table 5. Quantitative Analysis Results on the Medical Image Fusion Task. The best and second-best values for each evaluation metric are highlighted in red and blue, respectively.

4.5 Ablation study

The proposed method mainly consists of two core components: the Multimodal Large Language Model (MLLM) and the Text-Driven Feature Harmonization (Text-DFH) module. Within Text-DFH, both the text-guided cross-attention and the image-guided cross-attention play key roles. To validate the effectiveness of these components, we conduct a series of ablation experiments on the LLVIP dataset.

4.5.1 Effectiveness of the multimodal large language model

We utilize the MLLM to analyze the fused images based on user-provided questions related to pedestrian detection performance and generate suggestions for improving image quality. To assess the contribution of the MLLM, we remove it and replace its feedback with a fixed text prompt: “Brighter brightness, higher contrast, and clearer texture details.” As shown in Figure 8, the fusion results from the ablation model without the MLLM are noticeably inferior in visual quality compared to the full model. To further validate this, we perform quantitative analysis as presented in Table 6. The results show that the full model outperforms the ablation model on all evaluation metrics. Additionally, we analyze the performance of pedestrian detection, as shown in Table 7 and Figure 9. Both the quantitative and qualitative results indicate that the fused images produced by the ablation model without the MLLM lead to poorer detection performance. In contrast, the full model achieves better pedestrian detection results. In summary, both qualitative and quantitative analyses confirm the effectiveness of the Multimodal Large Language Model in our method.

Figure 8

Figure 8. Qualitative comparison of fusion performance across different ablation models. The first and second columns show the infrared and visible source images, while the third to seventh columns display the fusion results obtained under different ablation settings.

Table 6

Table 6. Quantitative comparison of fusion performance across different ablation models. The best and second-best values for each evaluation metric are highlighted in red and blue, respectively.

Table 7

Table 7. Quantitative comparison of pedestrian detection performance across different ablation models. The best and second-best values for each evaluation metric are highlighted in red and blue, respectively.

Figure 9

Figure 9. Qualitative comparison of pedestrian detection performance across different ablation models. The first and second columns show the infrared and visible source images, while the third to seventh columns display the pedestrian detection results under different ablation settings.

4.5.2 Effectiveness of Text-DFH

Text-DFH refines the output features of the fusion network based on suggestions from the multimodal large language model, enabling the fused image to better meet the requirements of the pedestrian detection task. To verify the effectiveness of Text-DFH, we remove it from the architecture and instead concatenate the text features with the image features to be refined along the channel dimension. The combined features are then processed by CNNs to obtain the refined output. We conduct both quantitative and qualitative analyses of the fusion performance of the model without Text-DFH, as shown in Table 6 and Figure 8. As observed, the ablation model without Text-DFH performs worse than the full model across multiple evaluation metrics, and the visual quality of the fused images is also inferior. In addition, we evaluate pedestrian detection performance both quantitatively and qualitatively, as presented in Table 7 and Figure 9. The full model achieves higher scores compared to the ablation model without Text-DFH. In summary, a series of experiments clearly demonstrate the effectiveness of the Text-DFH module.

4.5.3 Effectiveness of dual-branch cross attention

In the Text-DFH module, we refine image features using text features through a dual-branch cross attention mechanism. To verify its effectiveness, we remove the cross attention from each branch individually, leaving only a single branch to refine the image features. These variants are referred to as CA1 and CA2, respectively. From the quantitative and qualitative results on fusion performance, it is evident that removing either branch of the cross attention leads to a significant drop in performance, as shown in Table 6 and Figure 8. Furthermore, to assess the impact of dual-branch cross attention on pedestrian detection performance, we conduct both quantitative and qualitative analyses. The results demonstrate that pedestrian detection performance is optimal only when both branches of the cross attention are used to refine the image features, as shown in Table 7 and Figure 9. In conclusion, the above experiments confirm the effectiveness of the dual-branch cross attention mechanism.

5 Conclusion

To address the limitation of existing methods that primarily focus on improving fused image quality through network design—while overlooking the potential benefits of enhanced image quality for pedestrian detection—we propose a multimodal large language model (MLLM)-driven infrared and visible image fusion method. This method not only aims to improve the quality of the fused images but also emphasizes enhancing their performance in pedestrian detection tasks. By leveraging a multimodal large language model, we analyze the fused images based on user-provided questions related to improving pedestrian detection performance and generate suggestions for enhancing image quality. To fully utilize the guidance provided by the MLLM, we design a Text-Driven Feature Harmonization (Text-DFH) module, which refines the features output by the fusion network according to the textual suggestions. This ensures improved fusion quality while maintaining strong performance in pedestrian detection. In addition, the proposed method also demonstrates significant application potential in the field of nuclear medical imaging. However, under extreme weather conditions such as rain, fog, and snow, the fusion performance of the current method may degrade. Moreover, when such methods are applied to other types of source images [87–90], their performance may degrade. In future work, we plan to extend this research to develop an infrared and visible image fusion framework tailored for extreme weather scenarios, striving to maintain robust downstream task performance even in challenging environments.

Data availability statement

The original contributions presented in the study are included in the article/supplementary material, further inquiries can be directed to the corresponding author.

Author contributions

KW: Project administration, Writing – original draft, Writing – review and editing, Investigation, Conceptualization, Methodology. DH: Formal Analysis, Writing – review and editing, Data curation, Validation. YaC: Visualization, Supervision, Writing – review and editing, Resources. YkC: Funding acquisition, Project administration, Supervision, Writing – review and editing, Writing – original draft. YL: Validation, Writing – review and editing, Visualization, Formal Analysis. ZJ: Writing – review and editing, Investigation, Data curation, Resources. FC: Formal Analysis, Writing – review and editing, Data curation. WL: Writing – review and editing, Resources, Visualization.

Funding

The author(s) declare that financial support was received for the research and/or publication of this article. Science and Technology Project of China Southern Power Grid Co., Ltd. (No. YNKJXM20240052).

Conflict of interest

Authors KW, DH, YaC, YkC, YL, ZJ, FC, and WL were employed by Yunnan Power Grid Co., Ltd.

Generative AI statement

The author(s) declare that Generative AI was used in the creation of this manuscript. AI was only used to polish the paper.

Publisher’s note

All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article, or claim that may be made by its manufacturer, is not guaranteed or endorsed by the publisher.

References

1. Li H, Wu X-J, Kittler J. Rfn-nest: an end-to-end residual fusion network for infrared and visible images. Inf Fusion (2021) 73:72–86. doi:10.1016/j.inffus.2021.02.023