Perceptual objective evaluation for multimodal medical image fusion

Tian, Chuangeng; Zhang, Juyuan; Tang, Lu

doi:10.3389/fphy.2025.1588508

ORIGINAL RESEARCH article

Front. Phys., 26 May 2025

Sec. Radiation Detectors and Imaging

Volume 13 - 2025 | https://doi.org/10.3389/fphy.2025.1588508

This article is part of the Research TopicMulti-Sensor Imaging and Fusion: Methods, Evaluations, and Applications, Volume IIIView all 11 articles

Perceptual objective evaluation for multimodal medical image fusion

Chuangeng Tian¹

Juyuan Zhang¹

Lu Tang²*

¹School of Information Engineering (School of Big Data), Xuzhou University of Technology, Xuzhou, China
²School of Medical Imaging, Xuzhou Medical University, Xuzhou, China

Multimodal medical Image fusion (MMIF) has received widespread attention due to its promising application in clinical diagnostics and treatment. Due to the inherent limitations of fusion algorithms, the quality of obtained medical fused images (MFI) varies significantly. An objective evaluation of MMIF can quantify the visual quality differences in fused images and facilitate the rapid development of advanced MMIF techniques, thereby enhancing fused image quality. However, rare research has been dedicated to the MMIF objective evaluation. In this study, we present a multi-scale aware attention network for MMIF quality evaluation. Specifically, we employ a Multi-scale Transform structure that simultaneously processes these multi-scale images using an ImageNet pre-trained ResNet34. Subsequently, we incorporate an online class activation mapping mechanism to focus visual attention on the lesion region, enhancing representative discrepancy features closely associated with MFI quality. Finally, we aggregate these enhanced features and map them to the quality difference. Due to the lack of dataset for the objective evaluation task, we collect 129 pairs of source images from public datasets, namely, the Whole Brain Atlas, and construct a MMIF quality database containing 1,290 medical fused images generated using MMIF algorithms. Each fused image was annotated with a subjective quality score by experienced radiologists. Experimental results demonstrate that our method produces a satisfactory consistent with subjective perception, superior to the state-of-the-art quality evaluation methods. The source images dataset is publicly available at: http://www.med.harvard.edu/AANLIB/home.html.

1 Introduction

Multimodal medical image fusion (MMIF) is increasingly common in clinical diagnostics. MMIF algorithms aim to generate high-quality fused images from multimodal input images [1–3]. However, most existing MMIF algorithms struggle to achieve optimal fusion due to inherent model limitations. Even worse, instead of promoting, fused image quality declined during the fusion process, even increasing the risk of misdiagnosis. Figure 1 illustrates fusion results from different MMIF algorithms, where the first four images exhibit lower quality compared to the last one, with the first image being the worst. As observed, low-quality fused images fail to convey the critical information of the original images, contradicting the very purpose of image fusion. Conversely, high-quality fused images provide clinicians with more reliable information, enhancing diagnostic confidence and decision-making. Hence, it is natural to consider how to achieve a fairer evaluation of these fused images.

Figure 1

Figure 1. A case of fused images via different Multimodal medical image fusion (MMIF) algorithms.

In previous work, researchers generally compare the fusion results using both subjective and objective assessments [4–8]. Subjective quality evaluation refers to the visual judgment of image quality by human observers based on perceptual impressions, typically using scoring or ranking methods to quantify visual performance [9]. While this approach closely reflects clinical perception, it is labor-intensive and not scalable for large volumes of medical data. To address this limitation, objective quality assessment methods have been extensively developed to automatically evaluate fused images through computational models and algorithms [10–16]. These methods avoid human bias and enable large-scale assessment by quantifying image quality using well-defined criteria. Generally, objective evaluation methods can be classified into full-reference, reduced-reference, and no-reference approaches [17–19]. Since no ground-truth fused images exist, the no-reference approach is the most suitable for this task. This approach is not only more theoretically realistic but also exhibits higher applicability in clinical settings, as physicians are the ultimate beneficiaries of quality evaluation, the results of image quality assessment can vary depending on the scenario (e.g., the presence or absence of lesion regions in the image), leading to potential instability. No reference evaluation algorithms are roughly divided into hand-crafted metrics and deep learning-based metrics. For instance, Yang et al. [11] gauged structural similarity information of fused images. Qu et al. [15] used mutual information to measure fused images. Tang et al. [17] adopted non-subsampled contourlet transform (NSCT) and pulse coupled neural network (PCNN) for medical fusion image evaluation. However, these studies are limited in their ability to effectively capture hand-crafted features. To alleviate this limitation, deep learning-based metrics have been reported for MMIF quality assessment. Tian et al. [20] exploited a generative adversarial network (GAN) to implement objective evaluation of MMFI. However, such models often face criticism for being “black-box” approaches, making it difficult to gain sufficient trust from radiologists.

In this study, we construct a medical image fusion quality dataset and utilize it to evaluate the performance of the proposed MS-ANN model for MMIF quality assessment. We first conduct multi-scale transform to capture different scale information of fused images. Meanwhile, input these multi-scale images to fine-tuned ImageNet pre-trained ResNet34. Then, we utilize an online class activation mapping mechanism (CAM) to capture visualization attention to the lesion regions, such operation is highly related to radiologists making decisions. Finally, by aggregating the multi-scale streams to complement each other, we obtain richer, enhanced discrepancy features that are subsequently mapped to the quality differences of the fused images.

The key contributions of the proposed MS-AAN are summarized as follows.

(1) Given the limited research on objective evaluation for MMIF, we propose a no-reference fused image quality assessment method based on a multi-scale aware attention network, termed MS-AAN. MS-AAN not only automatically predicts the quality of fused images but also enhances model interpretability.

(2) To characterize quality discrepancies in fused images, we capture and aggregate multi-scale features by utilizing multi-scale transformer and ImageNet pre-trained ResNet34. Such multi-scale streams complement each other and can obtain plentiful details of quality discrepancy-related cues.

(3) To locate lesion clews and enhance feature representation, we propose a CAM attention network, which can pay attention to the lesion regions via generating localization heat maps. It is highly related to radiologists making decisions. In this way, our MS-AAN earns the trust of radiologists.

2 Related work

2.1 Objective evaluation of multimodal medical image fusion

Multimodal medical image fusion (MMIF) plays an important role in clinical diagnostics and treatment. For radiologists, high-quality fused images can enhance diagnostic confidence and aid in follow-up treatment planning. Plenty of MMIF quality evaluation algorithms have been reported. For instance, Xydeas et al. [10] used gradient information from source images to evaluate fused images. Yang et al. [11] gauged structure similarity information of fused images. Li et al. [12] adopted edge information from the source image to the fused image for objective assessment. Zhao et al. [13] proposed phase congruency to evaluate fused images. Zheng et al. [14] designed perceptual evaluation via a ratio of spatial frequency error. Qu et al. [15] used mutual information to measure fused images. Liu et al. [16] adopted entropy for fused image objective assessment. Tang et al. [17] adopted non-subsampled contourlet transform and pulse coupled neural network for medical fusion image evaluation. However, these handcrafted methods often lack the ability to effectively capture complex representation features. As a result, deep learning-based metrics for MMIF evaluation have attracted much attention. Tian et al. [20] introduced a generative adversarial network to implement MMIF evaluation. Wang et al. [21] proposed a no-reference image quality assessment framework that incorporates an adaptive graph attention module to enhance both local and contextual information. Liu et al. [9] developed a CNN-based multi-focus image fusion quality assessment model using hierarchical semantic features to better capture focus-level details. Additionally, Yue et al. [18] introduced a pyramid-based framework for assessing the quality of retinal images, which improves robustness to various types of distortions commonly found in clinical data. However, such studies often face challenges in addressing the “black-box” nature of the model. This limits the ability to sufficient trust from radiologists. Despite the growing interest in MMIF evaluation, few studies have focused on objective evaluation, and there is a lack of high-quality fused images. As a result, no reference metric demonstrates significant practical value for this task.

2.2 Multi-scale aware network

In recent years, multi-scale transform has achieved progress in the field of multimodal medical image fusion [22, 23], especially non-subsampled contourlet transform (NSCT) has displayed tremendous results [24, 25]. Specifically, Huang et al. [25] proposed SPECT and CT image fusion based on NSCT and PCNN. Yin et al. [24] used NSCT and PCNN for medical image fusion. Tang et al. [17] proposed a medical fusion image evaluation method based on NSCT and PCNN. Therefore, the combination of NSCT and PCNN has been proven to be a highly effective strategy for MMIF and MMIF quality evaluation. Inspired by this, can we replace PCNN with deep learning? Recent advancements in pre-trained CNNs on ImageNet have demonstrated their ability to extract richer features [26–28]. Motivated by the above fact, we employ a simple yet effective approach by combining NSCT with a pre-trained CNN to capture richer multi-scale feature representations.

2.3 CAM attention mechanism

Recent years have witnessed that the CAM is an effective tool for model interpretability. Zhou et al. used CAM to locate class-relevant objects [29]. Subsequently, gradient-weighted CAM was further extended to obtain better localization [30]. Ouyang et al. adopted gradient-weighted CAM to learn chest X-ray abnormality localization [31]. Tang et al. utilized an online CAM mechanism to concentrate on thyroid nodule localization, improving the model interpretability [32]. Thus, in this paper, we further extend the CAM attention mechanism to guide the network in focusing on lesion regions, enhancing the representative discriminative features, which ensures alignment with radiologists’ decision-making.

3 Methods

The proposed MS-ANN model is designed to comprehensively capture perceptual quality information from multimodal fused medical images. Its architecture comprises three main components: a multi-scale transform module, an ImageNet pre-trained ResNet34 backbone, and a CAM attention mechanism, as illustrated in Figure 2. First, we construct a multi-scale stream network with NSCT by down-sampling the input fused images to generate representations at four different scales. Each scale is processed by four ResNet34 backbone, which is selected for its efficiency and strong feature representation ability. Using a pre-trained model also facilitates robust learning with limited data. To enhance model interpretability and ensure the network emphasizes diagnostically relevant regions, we incorporate a CAM-based attention mechanism after feature extraction. Finally, the attention-refined features from all scales are concatenated and mapped to a quality score through fully connected layers.

Figure 2

Figure 2. Flowchart of the proposed MS-AAN.

3.1 Multi-scale aware neural network

We adopt the NSCT to perform multi-scale and multi-directional decomposition on the medical fused image. NSCT is a shift-invariant extension of the contourlet transform that enables rich representation of image features across different scales and directions, which is particularly beneficial for medical image analysis. Specifically, the medical fused image F is transformed into multiple sub-band $\{F_{m, α}\}$ at each level m $\in [1, 4]$ and direction. This decomposition allows the network to capture structural details at various resolutions, which is formulated as Equation 1:

F_{m, α} = M S T (F) (1)

Where $M S T (\cdot)$ repents the MST functions. Following this transformation, we use an ImageNet pre-trained ResNet34 as the backbone to extract high-level semantic features from the decomposed components. Particularly, these multiple sub-bands are input to ResNet34, and we use Rectified Linear Unit (ReLU) as the activation function, which is formulated as Equation 2:

F_{m, α}^{o} = R e L U (c o n v (F_{m, α}, K)) = R e L U (\sum_{n = 1}^{t} F_{m, α}^{n} ⊙ K + A) (2)

Where $F_{m, α}^{o}$ stands for output features. K represents a kernel of convolutional layer. $F_{m, α}^{n}$ is $n_{t h}$ channel of $F_{m, α}$ with totally $t$ channels, $A$ and $⊙$ represent the bias and convolution operation, respectively.

3.2 Aggregation of multi-scale feature

Considering the advantages of multi-scale transform, we aggregate the output features of multi-scale streams for MMIF quality evaluation. Firstly, we perform concatenate operations on four multi-scale stages, as shown in Equation 3:

f^{o} = F_{1, α}^{o} \oplus F_{2, α}^{o} \oplus F_{3, α}^{o} \oplus F_{4, α}^{o} (3)

where $\oplus$ stands for concatenate operation. Then, we compute global average pooling (GAP), as shown in Equation 4:

I_{G} = \frac{1}{W_{I_{c}} * H_{I_{c}}} \sum_{j = 1}^{W_{I_{c}} * H_{I_{c}}} I_{c}^{j} (4)

Where $I_{c}^{j}$ denotes the pixel value of $j$ -th in $I_{c}$ , $I_{c}$ stands for output of the last layer. $W_{I_{c}}$ and $H_{I_{c}}$ represent the width and height of $I_{c}$ , respectively. The enhancement feature transfers to a convolution layer, and we conduct GAP and global max pooling (GMP). Finally, a simple addition operation is carried out to aggregate GAP and GMP, which is formulated as Equation 5:

f^{t o t a l} = G A P (f^{o}) + G M P (f^{o}) (5)

3.3 CAM attention mechanism

To capture quality discrepancy features of lesion region from the whole medical fused images, we introduce the CAM attention mechanism. Specifically, we generate the attention feature map $M$ by applying a nonlinear activation function to the final aggregated feature map $f^{t o t a l}$ , which is described in Equation 5. This representation integrates multi-scale semantic information and is more suitable for highlighting perceptually important regions. The resulting attention map has a spatial resolution of 1/16 relative to the input image $\{F_{m, α}\}$ and guides the network to focus on diagnostically relevant areas during quality assessment. Then, conducting a normalization on $M$ to [0, 1]. After that, performing the sigmoid operation for soft masking, named $S (M)$ , is formulated as Equation 6:

S (M) = \frac{1}{1 + ex p (- μ (M - β))} (6)

Where $μ$ and $β$ stand for hyper-parameters. Dice loss is used as the attention loss function, denoted as $L_{a}$ , and is defined as shown in Equation 7:

L_{a} = D i c e (S (M), G) (7)

Where $G$ is the ground truth of the lesion mask. Finally, in the fully connected layer, we conduct Cross Entropy loss for quality classification, dubbed $L_{c}$ , as shown in Equation 8:

L_{c} = - \sum [f \log ({\hat{f}}_{x}) - (1 - f) (1 - \log {\hat{f}}_{x}))] (8)

Where $f$ stands for class label, ${\hat{f}}_{x} = [{\hat{f}}_{1}, {\hat{f}}_{2}, {\hat{f}}_{3}, {\hat{f}}_{4}, {\hat{f}}_{5}]$ , $x = 1, 2, 3, 4, 5$ , which denote the five classes quality results of medical fused images.

3.4 Total loss function

As observe in Figure 2, total loss function of our MS-ANN, comprise of attention $L_{a}$ and classification $L_{c}$ , which is denoted as shown in Equation 9:

L_{t} = L_{a} {+ γ L}_{c} (9)

4 Experiments

4.1 Dataset

In this study, we perform medical fused data for appraising the developed MS-ANN in MMIF quality assessment. Specifically, we collect 129 pairs of source images from public datasets, i.e., Whole Brain Atlas, which include CT and MR, MR-T1 and MR-T2, MR-T2 and PET, MR-T2 and SPECT, as shown in Figure 3. The selected images span a wide range of anatomical structures and clinical conditions (e.g., tumors, lesions, and degenerative changes), ensuring that the dataset is both diverse and representative of real-world clinical fusion scenarios. We then apply ten representative state-of-the-art MMIF algorithms [16, 24, 33–40], resulting in a total of 1,290 fused images. This dataset construction process is consistent with our previous work, where more technical details of the fusion methods can be found [20, 41]. For subjective quality assessment, each fused image is annotated with a Mean Opinion Score (MOS) ranging from 1 (lowest quality) to 5 (highest quality), as independently rated by two experienced radiologists. To ensure the reliability and consistency of the subjective assessment, a senior radiologist further reviewed and validated the assigned scores.

Figure 3

Figure 3. Some examples of source images.

To rigorously evaluate the effectiveness of the proposed MS-ANN, we adopt four widely recognized quantitative assessment metrics [42, 43]: Pearson’s Linear Correlation Coefficient (PLCC), Spearman’s Rank Correlation Coefficient (SRCC), Kendall’s Rank Correlation Coefficient (KRCC), and Root Mean Square Error (RMSE). These metrics are designed to measure the alignment between the predicted quality scores generated by the model and the ground-truth MOS provided by expert radiologists. Specifically, PLCC, SRCC, and KRCC are used to evaluate the consistency between the predicted quality scores and the ground-truth MOS, with higher values indicating better consistency with human perception. RMSE measures the absolute prediction error, where lower values represent better performance. These metrics are widely used in the field and ensure comparability with previous IQA studies [9, 18, 19, 21, 44].

4.2 Performance comparison

To validate the effectiveness of the proposed MS-ANN, we compare it with six mainstream methods, including multiple pseudo reference images-based quality metric (MPRI) [44], Tsallis entropy-based quality metric (TE) [45], mutual information-based quality metric (MI) [46], the objective evaluation of fusion performance (OEFP) [10], the ratio of spatial frequency error-based quality metric (RSFE) [14], the NSCT-PCNN-based quality metric (NSCT-PCNN) [17], the adaptive graph attention (AGA) for blind image quality assessment method [21], statistically based approach (SBA) for multi-focus image fusion quality assessment [9], and pyramid networks with quality-aware contrast loss (PNQC) for retinal image quality assessment [18]. Among these metrics, higher values of MPRI, TE, MI, OEFP, NSCT-PCNN, AGA, SBA, and PNQC indicate better quality, whereas lower values of RSFE denote better quality.

We compute the PLCC, SRCC, KRCC and RMSE values of six mainstream methods and MS-ANN, as shown in Table 1. The highest scores are highlighted in bold. Based on Table 1, our MS-ANN achieves the best performance, significantly outperforming the six competing models. Specifically, compared to the second-ranked RIQA, our proposed method improves PLCC from 0.8106 to 0.9131, SRCC from 0.8016 to 0.9061, KRCC from 0.7681 to 0.8560, while declining RMSE 0.2119 from to 0.1166.

Table 1

Table 1. Comparison performance of MS-ANN with other six metrics.

4.3 Ablation study

We conduct ablation studies to discuss the contribution of each important part of the MS-ANN. We first train each component independently on the medical fused dataset and then jointly optimize all components of MS-ANN. The results are presented in Table 2.

Table 2

Table 2. Ablation studies on the proposed MS-ANN.

First, the baseline model refers to ResNet34 without ImageNet pre-training, achieving a PLCC of 0.7971, SRCC of 0.8022, KRCC of 0.7199, and RMSE of 0.2936. Second, we apply a pre-training strategy to enhance the ability to capture features. As shown in the second row of Table 2, performance significantly improves, with PLCC increasing from 0.7971 to 0.8633, SRCC from 0.8022 to 0.8571, and KRCC from 0.7199 to 0.7761, while RMSE decreases from 0.2936 to 0.1696. These results demonstrate that the ImageNet pre-trained model outperforms the baseline model without pre-training. This improvement may be attributed to the effective use of pre-trained knowledge, which helps mitigate the challenge of limited training data. Third, we further introduce NSCT to capture more multi-scale features. With the addition of multi-scale transform, the results show noticeable improvements when comparing baseline + Pre and baseline + Pre + multi-scale: PLCC increases by 2.83% (0.8633 vs. 0.8916), SRCC by 2.40% (0.8571 vs. 0.8811), and KRCC by 4.95% (0.7761 vs. 0.8256), while RMSE decreases by 3.95% (0.1696 vs. 0.1301). Moreover, we integrate the CAM mechanism to guide the model’s attention toward lesion regions, thereby enhancing both feature representation and interpretability. As shown in Table 2, the proposed MS-ANN (Baseline + Pre + multi-scale + CAM) achieves superior performance compared to the variant without CAM (Baseline + Pre + multi-scale). Specifically, PLCC increases from 0.8916 to 0.9131, SRCC from 0.8811 to 0.9061, KRCC from 0.8256 to 0.8560, and RMSE decreases from 0.1301 to 0.1166. These improvements demonstrate that CAM significantly enhances the model’s ability to capture quality-related features. More importantly, the lesion-focused attention maps provide intuitive visual explanations, which can assist radiologists in verifying model predictions and build greater confidence in clinical use. As shown in Figure 4, the CAM-based heatmaps illustrate the model’s ability to concentrate on diagnostically relevant regions, offering visual support for the model’s quantitative superiority.

Figure 4

Figure 4. Generated attention maps of our methods on four medical fused images.

4.4 External validation

To further validate the generalization ability of our MS-ANN, we conduct an external independent evaluation using the multimodal medical image fusion database [17]. It is important to note that the performance metrics reported in Table 3 differ from those in Table 2 because they are obtained under different evaluation settings. Specifically, Table 2 reports results from ablation studies conducted on the training dataset to analyze the contribution of each model component, whereas Table 3 presents results from a separate external dataset. As shown in Table 3, our model achieves promising performance, with a PLCC of 0.8591, SRCC of 0.8388, KRCC of 0.7916, and RMSE of 0.1721. These results demonstrate the robustness and effectiveness of MS-ANN in assessing multimodal medical image fusion quality across different datasets.

Table 3

Table 3. The result of the external validation.

5 Conclusion

In this paper, we develop a quality evaluation metric for multimodal medical image fusion, called no reference multi-scale aware attention network (MS-ANN). Specifically, we first apply a multi-scale transform to extract different scale information from fused images and feed these transformed images into an ImageNet pre-trained ResNet34. This multi-scale strategy enables complementary feature extraction, capturing rich details relevant to quality assessment. Then, we propose a CAM attention network, which captures visualization attention to the lesion regions to facilitate model interpretability. Finally, we employ a concatenation operation to refine quality discrepancy features and map them to the quality differences in multimodal fusion images. However, the dataset used in this study exhibits an imbalance between MRI-PET and MRI-SPECT image pairs, with MRI-SPECT images being more prevalent. Moreover, the diversity of medical conditions and anatomical regions is somewhat limited, which may affect the model’s generalization to other clinical settings or imaging modalities. In future work, we aim to address these limitations by expanding the dataset to cover a broader range of organs and clinical conditions, thereby improving the robustness and generalization capability of the proposed MS-ANN model. Additionally, while our study adopts widely accepted statistical metrics to evaluate image quality prediction, it is important to recognize the potential influence of MMIF quality on downstream clinical tasks such as diagnosis accuracy or treatment decisions. High-quality fused images can provide clearer lesion boundaries, improved structural detail, and more reliable functional information, which are crucial in radiological assessment and therapy planning. In future work, we intend to design user studies or integrate radiologist-in-the-loop evaluations to measure the actual diagnostic utility of images rated by our model. Such assessments would offer a more comprehensive validation of the model’s clinical value and help bridge the gap between objective image quality assessment and practical medical outcomes. Despite these limitations, the proposed MS-ANN shows strong consistency with subjective perception, offering potential to facilitate clinical diagnosis and guide the development of advanced multimodal medical image fusion techniques.

Data availability statement

The original contributions presented in the study are included in the article/supplementary material, further inquiries can be directed to the corresponding author.

Author contributions

CT: Conceptualization, Methodology, Supervision, Validation, Writing – original draft. JZ: Methodology, Validation, Writing – original draft. LT: Conceptualization, Project administration, Supervision, Validation, Writing – original draft, Writing – review and editing.

Funding

The author(s) declare that financial support was received for the research and/or publication of this article. This work was supported by the Xu Zhou Science and technology Program, China (KC22466) and National Natural Science Foundation of China (82001912).

Conflict of interest

The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Generative AI statement

The author(s) declare that no Generative AI was used in the creation of this manuscript.

Publisher’s note

All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article, or claim that may be made by its manufacturer, is not guaranteed or endorsed by the publisher.

References

1. Birkfellner W, Figl M, Furtado H, Renner A, Hatamikia S, Hummel J. Multi-modality imaging: a software fusion and image-guided therapy perspective. Front Phys (2018) 6:66. doi:10.3389/fphy.2018.00066