ORIGINAL RESEARCH article

Front. Phys.

Sec. Radiation Detectors and Imaging

Volume 13 - 2025 | doi: 10.3389/fphy.2025.1599937

This article is part of the Research TopicMulti-Sensor Imaging and Fusion: Methods, Evaluations, and Applications, Volume IIIView all 8 articles

Infrared and Visible Image Fusion Driven by Multimodal Large Language Models

Provisionally accepted
Ke  WangKe WangDengshu  HuDengshu HuYuan  ChengYuan ChengYukui  CheYukui Che*Yuelin  LiYuelin LiZhiwei  JiangZhiwei JiangFengxian  ChenFengxian ChenWenjuan  LiWenjuan Li
  • Qujing Power Supply Bureau, Yunnan Power Grid Co., Ltd., Kunming, China

The final, formatted version of the article will be published soon.

Existing image fusion methods primarily focus on obtaining high-quality features from source images to enhance the quality of the fused image, often overlooking the impact of improved image quality on downstream task performance. To address this issue, this paper proposes a novel infrared and visible image fusion approach driven by multimodal large language models, aiming to improve the performance of pedestrian detection tasks. The proposed method fully considers how enhancing image quality can benefit pedestrian detection. By leveraging a multimodal large language model, we analyze the fused images based on user-provided questions related to improving pedestrian detection performance and generate suggestions for enhancing image quality. To better incorporate these suggestions, we design a Text-Driven Feature Harmonization (Text-DFH) module. Text-DFH refines the features produced by the fusion network according to the recommendations from the multimodal large language model, enabling the fused image to better meet the needs of pedestrian detection tasks. Compared with existing methods, the key advantage of our approach lies in utilizing the strong semantic understanding and scene analysis capabilities of multimodal large language models to provide precise guidance for improving fused image quality. As a result, our method enhances image quality while maintaining strong performance in pedestrian detection. Extensive qualitative and quantitative experiments on multiple public datasets validate the effectiveness and superiority of the proposed method. In addition to its effectiveness in infrared and visible image fusion, the method also demonstrates promising application potential in the field of nuclear medical imaging.

Keywords: Infrared and visible image fusion, pedestrian detection, multimodal large language models, Text-Guided, Model fine-tuning

Received: 25 Mar 2025; Accepted: 06 May 2025.

Copyright: © 2025 Wang, Hu, Cheng, Che, Li, Jiang, Chen and Li. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) or licensor are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

* Correspondence: Yukui Che, Qujing Power Supply Bureau, Yunnan Power Grid Co., Ltd., Kunming, China

Disclaimer: All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article or claim that may be made by its manufacturer is not guaranteed or endorsed by the publisher.