A multi-view multimodal deep learning framework for Alzheimer's disease diagnosis

Feng, Jianxin; Zhao, Xinyu; Liu, Zhiguo; Ding, Yuanming; Wang, Feng

doi:10.3389/fnins.2025.1658776

ORIGINAL RESEARCH article

Front. Neurosci., 01 October 2025

Sec. Brain Imaging Methods

Volume 19 - 2025 | https://doi.org/10.3389/fnins.2025.1658776

This article is part of the Research TopicExploring Neuropsychiatric Disorders Through Multimodal MRI: Network Analysis, Biomarker Discovery, and Clinical InsightsView all articles

A multi-view multimodal deep learning framework for Alzheimer's disease diagnosis

Jianxin Feng^1,2^*

Xinyu Zhao^1,2

Zhiguo Liu^1,2

Yuanming Ding^1,2

Feng Wang³^*

¹Communication and Network Key Laboratory, Dalian University, Dalian, China
²School of Information Engineering, Dalian University, Dalian, China
³Dalian University Affiliated Xinhua Hospital, Dalian University, Dalian, China

Introduction: Early diagnosis of Alzheimer's disease (AD) remains challenging due to the high similarity among AD, mild cognitive impairment (MCI), and cognitively normal (CN) individuals, as well as confounding factors such as population heterogeneity, label noise, and variations in imaging acquisition. Although multimodal neuroimaging techniques like MRI and PET can provide complementary information, current approaches are limited in multimodal fusion and multi-scale feature aggregation.

Methods: We propose a novel multimodal diagnostic framework, Alzheimer's Disease Multi-View Multimodal Diagnostic Network (ADMV-Net), to enhance recognition accuracy across all AD stages. Specifically, a dual-pathway Hybrid Convolution ResNet module is designed to fuse global semantic and local boundary information, enabling robust three-dimensional medical image feature extraction. Furthermore, a Multi-view Fusion Learning mechanism, which comprises a Global Perception Module, a Multi-level Local Cross-modal Aggregation Network, and a Bidirectional Cross-Attention Module, is introduced to efficiently capture and integrate multimodal features from multiple perspectives. Additionally, a Regional Interest Perception Module is incorporated to highlight brain regions strongly associated with AD pathology.

Results: Extensive experiments on public datasets demonstrate that ADMV-Net achieves 94.83% accuracy and 95.97% AUC in AD versus CN classification, significantly outperforming mainstream methods. The framework also shows strong discriminative capability and excellent generalization performance in multi-class classification tasks.

Discussion: These findings suggest that ADMV-Net effectively leverages multimodal and multi-view information to improve the diagnostic accuracy of AD. By integrating global, local, and regional features, the framework provides a promising tool for assisting early diagnosis and clinical decision-making in Alzheimer's disease. The implementation code is publicly available at https://github.com/zhaoxinyu-1/ADMV-Net.

1 Introduction

Alzheimer's disease (AD) is an irreversible neurodegenerative disorder (Hu et al., 2023) characterized primarily by memory decline, cognitive impairment, and loss of daily living abilities. With the acceleration of global aging, the number of AD patients is rapidly increasing, with projections indicating that the global patient population will reach 139 million by 2050, imposing substantial economic and psychological burdens on both society and individuals (Alzheimer's Disease, 2023). Early diagnosis is crucial for delaying disease progression, improving patients' quality of life, and reducing stress on families and society.

Neuroimaging techniques such as structural magnetic resonance imaging (sMRI) and positron emission tomography (PET) can capture abnormal changes in the brain (Damulina et al., 2020), making them essential tools for AD diagnosis. In recent years, deep learning technologies have achieved breakthroughs in this field, particularly excelling in multimodal data processing. Under end-to-end training (Choudhury et al., 2024), these techniques leverage neural network backpropagation to learn data-driven representations associated with pathology, thereby reducing the need for manual feature engineering. In AD diagnosis research, single-modal approaches suffer from limited information and cannot comprehensively reflect the complexity of the disease, resulting in insufficient sensitivity and specificity for early identification and precise diagnosis (Fjell et al., 2010). Consequently, multimodal approaches have emerged as an effective solution. By integrating information from different modalities such as sMRI and PET, these methods can more comprehensively reflect pathological changes and improve diagnostic accuracy (Chen et al., 2023).

In recent years, multimodal feature fusion has achieved significant progress in neuroimaging analysis. Sparse graph optimization (Liang et al., 2024) and transfer learning (Ramani et al., 2024) Wu et al. (2022)) have improved classification performance, while integration of cognitive tests with genetic factors (Zhang et al., 2024), refined ROI selection (Lei et al., 2024), and volumetric segmentation (You et al., 2023) have enhanced analytical precision. Tang et al. (2024a)) proposed CEFM combined with ECSA, which significantly enhanced AD feature recognition capability, and Jia et al. (2024)) constructed a multimodal global-local fusion framework that effectively integrated clinical tabular data with MRI information. MLCA (Wan et al., 2023), CBAM (Woo et al., 2018), and slice-level (Chen et al., 2022), multi-patch (Ye et al., 2024), and 3D multi-head attention (Huang et al., 2024) mechanisms have improved model sensitivity to important features by focusing on key regions. Cross-modal long-range dependency modeling based on Transformers (Tang et al., 2024b)(Alinsaif, 2025) has enriched multi-scale feature representation, while harmonic wavelet regional pyramids (Liu et al., 2023b), kernel attention fusion (Pei et al., 2022), and the combination of self-attention pooling with graph convolution (Sang and Li, 2024) have enhanced perception of complex patterns. Dynamic balancing strategies [HAMF (Lu et al., 2024), WMCL-HN (Yu et al., 2025)] and multi-scale convolutional ensemble learning (Yan et al., 2025) have optimized model performance, and BiFPN (Tan et al., 2020) has achieved efficient fusion of hierarchical features from different modalities. However, most methods still rely on simple concatenation or weighting and fail to fully exploit complementary information between modalities, leaving room for improvement in fusion depth.

To address the aforementioned issues, this paper proposes a multi-view multimodal Alzheimer's disease diagnostic model—ADMV-Net. This framework first extracts global semantic and local boundary features in parallel from sMRI and PET images through a dual-pathway Hybrid Convolution ResNet (HCNet). Subsequently, we design a Multi-view Fusion Learning (MVFL) mechanism to capture complementary information from global, local, and latent views, significantly enhancing feature representation. Finally, we utilize a Regional Interest Perception Module (RIPM) to construct a brain region weight matrix that identifies key brain regions associated with Alzheimer's disease. The main contributions of this paper include:

1. We propose a novel multi-view multimodal fusion model that effectively integrates three-dimensional imaging data from PET, GM, and WM modalities to improve diagnostic accuracy for AD.

2. We introduce a dual-path feature extraction structure, HCNet, which achieves efficient fusion of global semantic and local boundary information, thereby improving the feature representation of three-dimensional medical images.

3. We design a multi-view fusion learning module, MVFL, which captures diverse features from multiple perspectives through global, local, and latent learning modules, further strengthening feature representation.

4. We used a brain region weight matrix to learn the importance of different brain regions.

2 Materials and methods

2.1 Dataset and preprocessing

The study utilized paired T1-weighted MRI and PET scan data from the Alzheimer's Disease Neuroimaging Initiative (ADNI) (Petersen et al., 2010) and the Australian Imaging, Biomarkers and Lifestyle Study of Ageing (AIBL) (Selvaraju et al., 2017) databases. The ADNI database included 339 participants with Alzheimer's Disease (AD), 473 participants classified as Cognitively Normal (CN), and 525 participants identified with Mild Cognitive Impairment (MCI), with female ratios of 50.16%, 49.28%, and 52.6%, and average ages of 75.23, 74.98, and 75.11 years, respectively. The AIBL study comprised 82 participants with AD, 105 participants with CN, and 95 participants with MCI, with female ratios of 50.36%, 51.24%, and 49.99%, and average ages of 73.56, 72.26, and 74.41 years, respectively. All subjects underwent both sMRI and PET examinations, with the ADNI dataset specifically including FDG-PET and PIB-PET imaging.

We employed SPM and CAT tools to perform rigorous preprocessing of MRI and PET data to ensure quality consistency. MRI preprocessing included unified voxel resampling, slice timing correction, head motion correction, normalization to standard space, and tissue segmentation (GM, WM, and CSF), followed by extraction of GM and WM. The PET data preprocessing pipeline involved MRI alignment, spatial normalization, skull stripping, and smoothing to optimize signal quality. Based on functional relevance considerations, we used the AAL116 template to divide the whole brain into 116 anatomically and functionally defined ROIs and selected the first 90 ROIs (excluding cerebellar regions) as the final analysis targets.

2.2 Experimental setup and evaluation metrics

Our experiments were conducted in a Linux environment equipped with dual NVIDIA RTX 4090 GPUs and 120GB memory, implemented using Python 3.10 and Torch 2.2.0. Model training employed a stochastic gradient descent optimizer with an initial learning rate of 0.001 and dynamic adjustment using a cosine annealing strategy. Network training parameters were set with a batch size of 16 and a total of 40 training epochs.To assess the statistical significance of performance differences between methods, we conducted paired t-tests for all comparative experiments.

Models were trained separately for three classification tasks (AD vs CN, AD vs MCI, CN vs MCI), and ten-fold cross-validation was employed to ensure result reliability. Evaluation metrics included accuracy (ACC), sensitivity (SEN), specificity (SPEC), area under the receiver operating characteristic curve (AUC), and balanced accuracy (BAC).

\begin{array}{l} A C C = \frac{T P + T N}{T P + T N + F P + F N} & (1) \end{array}

\begin{array}{l} S E N = \frac{T P}{T P + F N} & (2) \end{array}

\begin{array}{l} S P E C = \frac{T N}{T N + F P} & (3) \end{array}

\begin{array}{l} B A C = \frac{S E N + S P E C}{2} & (4) \end{array}

Where TP represents the number of correctly identified positive samples, FP represents the number of negative samples incorrectly classified as positive, FN represents the number of positive samples incorrectly classified as negative, and TN represents the number of correctly identified negative samples.

2.3 ADMV-Net framework

To address the common challenge of insufficient multimodal fusion in AD diagnosis, this study proposes a multi-view multimodal Alzheimer's disease diagnostic model, ADMV-Net. Existing approaches predominantly rely on simple concatenation or weighted aggregation, which are limited in their ability to simultaneously capture global macroscopic patterns, local key region features, and cross-modal semantic relationships. Moreover, population heterogeneity, label noise, and scanner/site differences further diminish fusion effectiveness and model generalizability. In response, ADMV-Net models inter-modality feature relationships from global, local, and semantic perspectives, incorporating a dynamic brain region weighting mechanism to significantly enhance diagnostic performance. The overall framework is illustrated in Figure 1.

Figure 1

Diagram illustrating a neural network architecture for brain imaging analysis. It uses PET, GM, and WM inputs processed by HCnet layers. The data passes through MLCA, BCAM, and GPM modules before reaching the RIPM module. Outputs include classifications for AD, CN, and MCI. The network involves concatenation and matrix addition steps.

Figure 1. The overall architecture of ADMV-Net. The diagram depicts the primary structural components of the model and the corresponding data processing workflow, illustrating the overall strategy for multimodal feature extraction and fusion.

2.4 Hybrid convolution ResNet(HCNet)

With the widespread adoption of deep learning in medical image analysis, classic network architectures such as ResNet have become mainstream choices for multimodal feature extraction. However, traditional convolutional networks exhibit notable limitations when handling three-dimensional medical images. On one hand, standard convolutions struggle to capture precise anatomical boundaries. On the other hand, existing network designs fail to effectively balance global semantic information with local structural details. To address these challenges, we propose the Hybrid Convolution ResNet (HCNet). HCNet achieves efficient feature fusion and representation optimization through three parallel pathways.

Specifically, the Standard Convolution Path (SCP) builds a large receptive field by stacking deep multi-scale features, indirectly modeling global semantic information; the Semantic Difference-guided Path (SDP) introduces a deep feature map G as a diffusion guide, regulating local feature propagation around boundaries to enhance the perception of ambiguous structural interfaces. The third, a direct mapping path, preserves the integrity of input features to prevent information loss. The detailed architecture is illustrated in Figure 2.

Figure 2

Diagram illustrating a neural network model with two main components: SCP and SDP. SCP includes convolution (Conv), batch normalization (BN), ReLU activation, and another convolution layer. SDP features deep feature extraction (DF) and explicit-implicit differential convolution. Semantic guidance connects SDP outputs back to SCP. Symbols include 3D convolution, matrix operations, and legends indicating various components.

Figure 2. The workflow of HCNet. The dual-path convolutional structure processes global and boundary features separately and fuses them through an adaptive mechanism.

The SCP is based on 3D convolution combined with Batch Normalization and ReLU activation to extract spatial contextual relationships, responsible for capturing macroscopic features and global semantic information. The SDP focuses on overcoming the limitations of standard convolutions in precisely describing boundary regions, accurately characterizing ambiguous interfaces between anatomical structures. Functioning as a “push” mechanism, the SDP reduces boundary uncertainty between classes. This module simulates a nonlinear anisotropic diffusion process, extracting edge features through explicit and implicit differential kernels, while leveraging deep semantic guidance to enhance boundary representation.

Specifically, the iterative update formula of the SCP is given by:

\begin{array}{l} {\hat{F}}_{p}^{t + 1} = \sum_{p_{e} \in δ_{p}} h (|| G_{p_{e}} - G_{P} | |^{2}) \cdot (F_{p_{e}}^{t} - F_{p}^{t}) & (5) \end{array}

\begin{array}{l} F^{t + 1} = λ \cdot F^{t} + ν \cdot {\hat{F}}^{t + 1} & (6) \end{array}

Where, F^t denotes the feature map at the t iteration, G represents the semantic guidance feature from the deep decoder, h(.) is a learnable nonlinear mapping function, δ_p denotes the local neighborhood centered at position P, and λ and ν are adjustable weighting parameters.

Finally, the features extracted by SCP and SDP are adaptively fused and combined with those from the direct mapping path, integrating contextual information with boundary-guided cues to enhance the model's representation of ambiguous boundaries.

In ADMV-Net, we employ four HCNet modules to extract feature representations at different hierarchical levels. These multi-level feature maps not only demonstrate the model's capability to progressively capture global semantics and boundary details but also provide a rich foundation of multi-scale representations for subsequent modules. The resulting set of features ${H_{g}^{1}, H_{W}^{1}, H_{P}^{1}, \dots, H_{g}^{4}, H_{W}^{4}, H_{P}^{4}}$ serves as the input to the MVFL module, enabling multi-view fusion modeling.

2.5 Global Perception Module

In the task of multimodal Alzheimer's disease diagnosis, modeling global semantic information is crucial for capturing the brain's overall pathological features. However, the spatial resolution loss in deep features extracted by existing models often leads to blurred macroscopic semantic information. Additionally, global correlation patterns across multimodal data are difficult to model effectively through a single pathway. To address these issues, we propose the Global Perception Module (GPM). This module efficiently models global features across multimodal data by leveraging long-range dependency modeling and an adaptive fusion mechanism.

Specifically, we take the output of the last layer of the feature extraction network , $H_{g}^{4}, H_{w}^{4}$ and $H_{P}^{4}$ as the input data. The GPM first applies a 3D convolution for shallow feature extraction, resulting in a feature map F. Next, the key and value of F are fed into a Multi-Head Attention layer, where they interact with F's query to perform attention calculations and obtain enhanced features. Inspired by Yuan et al. (2021)) and Zeng et al. (2022)), we use depthwise separable convolution (DWConv) during feature modeling to capture local features and positional information, while removing explicit positional encoding.

Additionally, after the addition operation, $H_{g}^{4}, H_{w}^{4}$ and $H_{P}^{4}$ are fed into the Window Attention module to prevent feature shift between different branches. Here, “feature shift” refers to differences in the distributions of features from different modalities or scales, which can cause one branch to dominate during fusion and thus disrupt overall consistency. The addition operation provides initial alignment, while the sliding-window mechanism of Window Attention adaptively models local interactions, effectively mitigating such shifts and enhancing feature representation with relatively low computational cost.

It is noteworthy that multimodal fusion is influenced not only by biases introduced by confounding factors such as age, sex, scanner manufacturer, and imaging parameters, which can create spurious correlations and reduce model transferability, but also by potential statistical dependencies across different data sources arising from shared biological or pathological mechanisms, measurement procedures, or preprocessing steps. These dependencies further compromise the independence between modalities and the effectiveness of fusion strategies. To mitigate these effects, we introduce RegBN (Ghahremani Boozandani and Wachinger, 2023) prior to feature fusion. RegBN is a regularization-based batch normalization method specifically designed for multimodal data, which removes the need for learnable parameters. This not only simplifies the training and inference pipelines but also helps stabilize feature distributions across modalities.Overall, the data processing workflow of the GPM can be represented as:

\begin{array}{l} F_{g} = F F N (W A (H_{g}^{4} + H_{w}^{4} + H_{P}^{4}) H_{g}^{4^{'}} H_{w}^{4^{'}} H_{p}^{4^{'}}) & (7) \end{array}

Where, WA(.) denotes the window attention mechanism, while, $H_{g}^{4^{'}}, H_{w}^{4^{'}}$ and $H_{p}^{4^{'}}$ represent the features processed through multi-head attention and RegBN, respectively. The processing procedure is described by Equation 8:

\begin{array}{l} {\begin{array}{l} H_{g}^{4^{'}} = M H A (H_{g}^{4}) \\ H_{w}^{4^{'}} = R e g B N (H_{g}^{4^{'}}, M H A (H_{w}^{4})) \\ H_{p}^{4^{'}} = R e g B N (H_{w}^{4^{'}}, M H A (H_{p}^{4})) \end{array} & (8) \end{array}

Finally, the features fused by the Feed-Forward Network (FFN) are flattened and output as the global fusion features $F_{g} \in ℝ^{c_{4}}$ .The overall architecture is shown in Figure 3.

Figure 3

Diagram of a neural network architecture with three parallel input paths labeled $H_g^4$, $H_w^4$, and $H_p^4$, each undergoing convolution and multi-head attention. Outputs combine via matrix multiplication, then regularization through RegBN. Outputs are flattened and sent through a feedforward network (FFN) for output $F_g$. Key denotes elements like 3D convolution, feedforward network, depthwise convolution, Gaussian Error Linear Unit, and matrix operations.

Figure 3. The workflow of GPM. Modeling of multimodal global semantics through multi-head attention and window mechanisms.

2.6 Multi-level Local Cross-modal Aggregation Module

In multimodal fusion tasks, MRI and PET modalities exhibit significant complementarity in local detail information. To further enhance the interaction of local information between modalities, we propose the Multi-level Local Cross-modal Aggregation Module (MLCA). This module integrates features from the first three residual blocks to achieve multi-scale semantic fusion. Additionally, it employs a bidirectional pathway and a learnable weighting mechanism to enable deep coupling of local information. The overall architecture is illustrated in Figure 4.

Figure 4

Diagram illustrating a neural network architecture with three layers. Each layer includes sub-components labeled as $ H_g, H_w, H_p $, followed by a convolutional block. Components are connected via max pooling, upsampling, and global average pooling, highlighted in a color-coded legend. Matrix addition is depicted with green circles. The flow ends with a flatten operation producing output $ F_l $.

Figure 4. The workflow of MLCA. It fuses local features from different modalities layer by layer and strengthens important feature regions through a weighting mechanism.

MLCA comprises three main components: (1) Channel alignment, (2) bidirectional feature interaction and reconstruction, and (3) cross-scale fusion and aggregation.

In the Channel alignment stage, the initial fusion features from MRI and PET are denoted as $H^{1}, H^{2}, H^{3} \in ℝ^{d_{4} \times h_{4} \times w_{4} \times c_{4}}$ . These features are projected into a unified 128-channel space via 3D convolutions to eliminate channel dimensional inconsistencies across modalities while preserving the original spatial structures. The aligned feature set is thus expressed as $H^{i} \in ℝ^{d_{4} \times h_{4} \times w_{4} \times 128}, i = 1, 2, 3$ .

The bidirectional feature interaction and reconstruction stage follows, wherein a dual-path mechanism–comprising top-down and bottom-up pathways–is employed to propagate and refine features across scales. To facilitate cross-scale integration and detailed enhancement, cross-scale weighted fusion nodes are introduced at each level. The output of each fusion node is formulated as:

\begin{array}{l} F_{o u t} = \frac{\sum_{i} w_{i} . U (F_{i})}{\sum_{i} w_{i} + ϵ} & (9) \end{array}

Where, F_i represents the input features from different scales or directions, W_i is the learnable positive weight parameter, U(.) denotes the upsampling or downsampling operation, and ϵ is the stability factor.

Finally, in the cross-scale aggregation stage, all the fused local features are normalized to a fixed size of 4 × 4 × 4 through global adaptive pooling and then flattened into a one-dimensional vector to form the final local fusion feature $F_{m} \in ℝ^{4}$ .

2.7 Bilinear Contextual Attention Module (BCAM)

Although existing multimodal feature fusion methods have shown great potential in Alzheimer's disease detection tasks, they often overlook deep interactions between modalities, ignoring latent information. To address this issue, we propose the BCAM to enhance latent feature representation and improve the model's classification performance, as illustrated in Figure 5.

Figure 5

Flowchart of a neural network process. Three input streams, $ H^4_g $, $ H^4_w $, and $ H^4_p $, undergo Local Average Pooling (LAP), Global Average Pooling (GAP), reshaping, and various layers like convolution and activation (sigmoid). Outputs $ H^{ge}_g $, $ H^{ge}_w $, and $ H^{ge}_p $ are flattened and summed ($ H^1_{ge} $, $ H^2_{ge} $, $ H^3_{ge} $). These are combined and processed with L2 normalization to produce $ F_b $. The diagram includes symbols for operations and paths.

Figure 5. The workflow of BCAM. Latent features are obtained using the outer product operation.

The BCAM begins by performing local and global average pooling on the input features to capture both regional details and overall modality information. After flattening the pooled features, they pass through 1D convolutions for channel dimension compression and weight allocation, thereby implementing a channel-level attention mechanism.

The features are then reshaped to their original dimensions and normalized to the [0,1] interval using the Sigmoid function, producing both local and global attention weights. BCAM employs a dynamic weight fusion strategy to integrate features from various receptive fields. The fused attention map is upsampled to restore its spatial dimensions and then multiplied element-wise with the original features to enhance the key regional information while suppressing redundant features.

After feature enhancement, BCAM introduces a cross-modal interaction mechanism to capture potential correlation patterns between modalities via an outer product operation, thus revealing deep semantic relationships between the modalities.

To reduce computational complexity, we first perform LAP (Local Average Pooling), GAP (Global Average Pooling), and flatten operations on $H_{p}^{4}, H_{g}^{4}$ and $H_{w}^{4}$ to obtain the dimensionality-reduced scale features $H_{p}^{g e}, H_{g}^{g e}, H_{w}^{g e} \in ℝ^{c_{4} \times \frac{d_{a}}{2}}$ . Subsequently, by fusing these two sets of features, we acquire the interactive feature representation $H_{1}^{g e}, H_{2}^{g e}, H_{3}^{g e} \in ℝ^{c_{4} \times \frac{d_{a}}{2} \times \frac{d_{a}}{2}}$ .

\begin{array}{l} H_{1}^{g e} (t) = F_{o u t e r} (H_{p}^{g e} (t), H_{g}^{g e} (t)) & (10) \end{array}

\begin{array}{l} H_{2}^{g e} (t) = F_{o u t e r} (H_{g}^{g e} (t), H_{w}^{g e} (t)) & (11) \end{array}

\begin{array}{l} H_{3}^{g e} (t) = F_{o u t e r} (H_{p}^{g e} (t), H_{w}^{g e} (t)) & (12) \end{array}

After the outer product operation, the three sets of fused features are summed. Following this, they are pooled and L2-normalized to obtain the latent feature representation that encapsulates cross-modal correlations.

\begin{array}{l} F_{b} = || s u m p o o l i n g (H_{1}^{g e} + H_{2}^{g e} + H_{3}^{g e}) | |_{2} \in ℝ^{c_{4}} & (13) \end{array}

Finally, the features F_g, F_m, and F_b are concatenated and reshaped to form the fused feature representation $F_{c} \in ℝ^{c_{4}}$ .

2.8 Regional Interest Perception Module (RIPM)

As a typical stage-progressive neurodegenerative disease, AD exhibits distinct patterns of brain region changes at different stages, which are crucial for early diagnosis and intervention (Qiu et al., 2024). To address this, we employ the Regional Interest Perception Module (RIPM) to identify key brain regions, as shown in Figure 6.

Figure 6

Flow chart illustrating a neural network process involving dimension change, matrix multiplication, and gate recurrent units (GRU). It starts with inputs $ y_c $ and $ F_c $, undergoing dimensional change, followed by matrix operations $ W_R^Q $, $ W_R^K $, and $ W_R^V $. Outputs pass through a GRU and softmax function, connecting to a fully connected layer (FC), producing output $ \bar{R} $. A box denotes GRU, another denotes FC, and green circles indicate matrix multiplication points. Labels include “define $ w_0 $” and “update.”

Figure 6. The workflow of RIPM. The module dynamically adjusts the importance weights of different brain regions through t iterations to identify key brain regions.

In the implementation, we set the number of iterations to t and initialize the brain region weight matrix $ω_{i} \in ℝ^{1 \times 90}$ . Through t iterations, the importance weights of different brain regions are dynamically adjusted to continuously optimize the weights. In the feature processing stage, the multimodal fusion feature $y_{c} \in ℝ^{2}$ is first dimensionally transformed to generate multimodal classification features and initial Region of Interest (ROI) features $R_{0} \in ℝ^{1 \times 90}$ . Subsequently, y_c is feature-transformed to generate the hidden feature $ŷ_{c}^{R} \in ℝ^{1 \times 90}$ .

To deeply capture the interaction information between ROIs, we use the query weight matrix $W_{R}^{Q} \in ℝ^{90 \times 90}$ , key weight matrix $W_{R}^{K} \in ℝ^{90 \times 90}$ , and value weight matrix $W_{R}^{V} \in ℝ^{90 \times 90}$ . By performing linear mapping with $ỹ_{c}^{R}$ and R₀, we obtain $Q^{R} \in ℝ^{c_{4} \times 90}$ , $K^{R} \in ℝ^{c_{4} \times 90}$ , $V^{R} \in ℝ^{c_{4} \times 90}$ , calculated as follows:

\begin{array}{l} Q^{R} = W_{R}^{Q} ỹ_{c}^{R}, K^{R} = W_{R}^{K} R_{0}, V^{R} = W_{R}^{V} R_{0} & (14) \end{array}

Subsequently, the ROI weight matrix is continuously updated using a Gated Recurrent Unit (GRU).

\begin{array}{l} R_{h} = (s o f t m a x ((ω_{p r e v} Q^{R}) {(ω_{p r e v} K^{R})}^{T})) * ω_{p r e v} V^{R} & (15) \end{array}

\begin{array}{l} ω_{i} = F_{G R U} (R_{h}^{T}, w_{p r e v}^{T}) & (16) \end{array}

After t rounds of iterative updates, the final salient brain region feature $R_{h} \in ℝ^{1 \times 90}$ is obtained. R_h is linearly mapped to generate the final ROI feature $\tilde{R} \in ℝ^{90}$ . Further, $\tilde{R}$ is dimensionally transformed to obtain $y_{c}^{r} \in ℝ^{2}$ , which is then weighted and fused with the multimodal classification feature y_c to form the final classification feature $ŷ_{c} \in ℝ^{2}$ . By dynamically adjusting the importance weights of different brain regions, this approach provides a more effective research strategy for salient brain region extraction and multimodal feature optimization.

3 Results

3.1 Optimal iteration number analysis

To determine the optimal interpretability parameters of ADMV-Net across the three classification tasks, we systematically evaluated the effect of the number of iterations in the RIPM module on both model performance and the stability of brain region weights, with the results presented in Table 1. From a quantitative interpretability perspective, the AD vs CN and MCI vs CN tasks achieved peak performance at seven iterations, indicating that a moderate number of iterations allows RIPM to effectively capture inter-regional brain interaction patterns while maintaining biologically plausible weight distributions. By comparison, the AD vs MCI task indicates that excessive iterations may introduce noise features unrelated to disease pathology, thereby reducing the biological interpretability of the model. Based on these quantitative analyses, we employed the corresponding optimal number of iterations for each task, ensuring that the RIPM module provides stable and reliable weight assignments for each brain region, thus offering clinicians quantitative insights into region-specific importance.

Table 1

Table 1. Performance comparison across different tasks and datasets.

3.2 Multi-fold loss curve analysis

Through analysis of loss curves from multi-fold cross-validation, as shown in Figure 7, we found that the loss change trends across different folds remain consistent on both the large-scale ADNI dataset and the smaller AIBL dataset, strongly validating the stability of the method and reliability of the results.

Figure 7

Six line graphs showing the loss in ADNI and AIBL datasets over epochs for different comparisons: AD vs. CN, MCI vs. CN, and AD vs. MCI. Each graph includes ten folds represented in various colors, depicting a general decrease in loss over time.

Figure 7. Loss Curves for different tasks.

3.3 Ablation experiment results

3.3.1 Feature extraction network ablation

To comprehensively evaluate HCNet's performance in 3D medical image feature extraction, we conducted performance tests for three classification tasks on both ADNI and AIBL datasets. Results in Table 2 show that HCNet achieved optimal performance in AD vs CN tasks across both datasets. In the more challenging MCI vs CN and AD vs MCI tasks, HCNet achieved the highest accuracy (72.77% and 85.81%, respectively) and sensitivity (72.69% and 89.27%, respectively), demonstrating good generalization capability. Although its specificity was slightly lower than CMUNEX, HCNet still maintained significant advantages in overall performance. In contrast, ResNet18 and DWConv showed relatively unstable performance, further highlighting HCNet's advantages in identifying MCI patients. The statistical test results indicate that the improvements of HCNet in the main performance indicators are significant. To more intuitively demonstrate the comprehensive performance of each model across evaluation metrics, we created the radar chart shown in Figure 8. The area size in the chart reflects the overall performance of the model across four metrics: ACC, SEN, SPEC, and AUC, with larger areas indicating better model performance.

Table 2

Table 2. Performance comparison of different feature extraction networks on the three classification tasks.

Figure 8

Six radar charts compare four models: ResNet18, DWConv, CMUNEX, and HCNet across different conditions. The charts evaluate metrics including sensitivity (SEN), specificity (SPEC), accuracy (ACC), and area under the curve (AUC) for Alzheimer's disease (AD) versus cognitively normal (CN) and mild cognitive impairment (MCI) across ADNI and AIBL datasets. Each chart shows overlapping polygonal shapes representing the models' performance metrics.

Figure 8. Radar charts illustrating the performance of HCNet and other comparative models on different classification tasks across the ADNI and AIBL datasets. (a) AD vs CN task; (b) AD vs MCI task; (c) MCI vs CN task.

3.3.2 Module ablation

To validate the effectiveness of each component in ADMV-Net, we conducted systematic ablation experiments, with the results presented in Table 3. Analysis of the individual contributions of each module indicates that the GPM module, through its multi-head self-attention mechanism, establishes long-range dependencies. Combined with the brain region weighting of RIPM, it achieved 92.43% ACC and 94.57% AUC in the AD vs CN task, demonstrating the foundational role of global semantic awareness. The MLCA module, leveraging bidirectional feature interaction for deep cross-scale local information aggregation, further enhanced performance when integrated with RIPM, reaching 93.25% ACC and 94.68% AUC. The BCAM module, which explores latent inter-modality associations via outer product operations, achieved 91.95% ACC with RIPM alone, but its deep semantic modeling capabilities became evident in subsequent module combinations.

Table 3

Table 3. Performance comparison of ablation results in the three classification tasks.

The synergistic effects between modules were more pronounced. The combined configuration of GPM and MLCA (GPM+MLCA+RIPM) increased performance to 94.73% ACC and 94.95% AUC, with SEN reaching 95.56%, fully demonstrating the complementarity between global and local features. Similarly, the combination of GPM and BCAM (GPM+BCAM+RIPM) achieved 93.86% ACC, highlighting the effective integration of global awareness and latent feature learning. Critical ablation experiments showed that removing RIPM from the three core modules (GPM+MLCA+BCAM) led to a decrease in performance to 93.59% ACC and 94.22% AUC, underscoring the key role of brain region weighting.

When the complete model integrated all four components, it achieved optimal performance across all tasks: 94.83% ACC and 95.97% AUC for AD vs CN, 72.77% ACC and 76.84% AUC for MCI vs CN, and 85.81% ACC and 88.93% AUC for AD vs MCI. These results fully validate the effectiveness of the multi-view fusion architecture and the complementary synergistic contributions of its components.

3.4 Comparison experiment

3.4.1 Cross-dataset validation

To validate the cross-dataset generalization capability of ADMV-Net, we designed rigorous cross-validation experiments. After training the model on the ADNI dataset, we tested it on the AIBL dataset to evaluate the model's adaptability and stability across different datasets. To balance potential bias between sensitivity (SEN) and specificity (SPEC) caused by different data distributions, we introduced balanced accuracy (BAC) as a core evaluation metric. Experimental results shown in Table 4 demonstrate that ADMV-Net not only achieved the highest accuracy (ACC) and area under the curve (AUC) across all three tasks (92.37%/94.51%, 71.66%/74.86%, and 90.78%/88.07%, respectively), but also consistently outperformed other comparative methods on the BAC metric, fully demonstrating its strong discriminative capability and consistency across different cognitive impairment classification tasks. A more intuitive representation is shown in Figure 9, which displays a heatmap of BAC for different models, where colors closer to blue-green indicate better performance.

Table 4

Table 4. Cross-dataset validation results.

Figure 9

Heatmap showing BAC performance across methods and tasks, with categories AD vs CN, MCI vs CN, and AD vs MCI. Performance values range from 69.11 to 92.93 with colors shifting from light yellow to dark blue. Methods compared are OLFG, MDLNet, Diamond, and ADMV-Net.

Figure 9. Heatmaps of BAC scores for different models in cross-dataset validation. The closer the color is to blue-green, the better the model's performance.

3.4.2 Model comparison

To comprehensively evaluate the effectiveness of the proposed method, we conducted a systematic comparison with state-of-the-art approaches across three classification tasks. The experimental results are presented in Table 5. In the AD vs CN task, ADMV-Net achieved an accuracy (ACC) of 94.83% and an area under the ROC curve (AUC) of 95.97%, outperforming the current best-performing model, Diamond, which attained 92.37% and 94.53%, respectively. Furthermore, sensitivity (SEN) and specificity (SPEC) reached 94.67% and 93.76%, respectively, indicating that the model not only improves overall classification accuracy but also maintains a low false-positive rate.

Table 5

Table 5. Performance comparison with state-of-the-art methods.

In terms of computational efficiency, ADMV-Net also demonstrated substantial advantages. The model comprises only 11.04 million parameters, markedly fewer than OLFG's 34.25M and Diamond's 24.53M, while requiring 18.95 GFLOPs, far lower than OLFG's 133.47 GFLOPs. This computational efficiency enhances the feasibility of deploying ADMV-Net on standard clinical hardware, enabling support for real-time clinical decision-making. For visual illustration, Figure 10 presents the ROC curves of the models, where a larger area under the curve indicates superior performance.

Figure 10

Three ROC curve graphs compare model performance. (a) Shows high AUC scores: OLFG (0.9028), MDLNet (0.9264), Diamond (0.9353), and ADMV-Net (0.9557). (b) Shows lower scores: OLFG (0.7094), MDLNet (0.7536), Diamond (0.7399), ADMV-Net (0.7684). (c) Shows moderate scores: OLFG (0.8434), MDLNet (0.8239), Diamond (0.8217), ADMV-Net (0.8893). Random chance line (AUC = 0.5) is present in all graphs.

Figure 10. ROC curves of different models for the three classification tasks. (a) AD vs CN task; (b) MCI vs CN task; (c) AD vs MCI task.

In the more challenging MCI vs CN task, ADMV-Net continued to exhibit strong discriminative capability and stability, achieving an ACC of 72.77%, AUC of 76.84%, and SPEC of 76.41%, thereby validating its effectiveness for early screening of mild cognitive impairment. In the AD vs MCI task, the model maintained leading performance with ACC and AUC values of 85.81% and 88.93%, respectively, and achieved a favourable balance between SEN (89.27%) and SPEC (84.56%), further demonstrating its robust performance in distinguishing different stages of cognitive impairment and highlighting its potential for clinical application.

4 Discussion

4.1 Model performance and core advantages

Based on the comprehensive experimental results from Tables 2–5, ADMV-Net demonstrates significant performance advantages in multimodal Alzheimer's disease diagnostic tasks. These advantages are primarily reflected in the following three synergistic improvements.

In feature extraction, traditional 3D convolution methods (Kim et al., 2024; Gao et al., 2023; Pandey et al., 2024) struggle to balance global semantic information with local details in medical image processing. To address this issue, we propose a dual-pathway convolution structure, HCNet. This structure captures global semantic and local edge information through parallel channels, effectively resolving this contradiction and significantly improving the accuracy and completeness of feature representation.

Regarding multimodal fusion strategies, existing research primarily employs simple feature concatenation or weighted averaging methods (Lin et al., 2017; Liu et al., 2023a; Bravo-Ortiz et al., 2024), which fail to fully exploit deep complementary information between different modalities. To address this limitation, we designed the MVFL mechanism, which performs interactive fusion of sMRI and PET features from three views: global (GPM), local (MLCA), and latent association (BCAM). Ablation experiments validated the effectiveness of MVFL in consistently improving model performance and enhancing fine-grained cognitive difference capture capabilities, demonstrating the advantages of multi-view fusion.

Furthermore, to enhance model interpretability, we utilize RIPM to automatically learn brain region weight matrices through a data-driven approach, highlighting key brain regions associated with disease and reducing dependence on traditional population-based statistical region-of-interest methods (Kwon, 2023; Qiao et al., 2022; Cao et al., 2017).

4.2 Performance comparison and method evaluation

Compared to current mainstream methods, ADMV-Net demonstrates significant advantages across all tasks. Specifically, in the AD vs CN task, our model improves accuracy from Diamond's (Li et al., 2024) 92.37% to 94.83%, while AUC also increases from 94.53% to 95.97%. In the more challenging MCI vs CN task, ADMV-Net continues to outperform methods such as OLFG and MDLNet, fully demonstrating the sensitivity of the multi-view fusion strategy to subtle cognitive differences. More importantly, cross-dataset validation shows that under the ADNI training and AIBL testing setup, ADMV-Net maintains leading performance, indicating good adaptability to changes in data distribution. Notably, some studies (Chen et al., 2025; Zhang et al., 2023) indicate that validation on only a single dataset without cross-dataset validation fails to comprehensively assess model generalization capability, thereby limiting application potential in real-world scenarios. In contrast, the cross-dataset validation employed in this paper further highlights ADMV-Net's advantages in model robustness and practicality.By analyzing the model's misclassification cases, we observed that the primary failure modes of ADMV-Net are concentrated around borderline cases. In the AD vs MCI task, misclassifications predominantly occurred for early-stage AD patients (misidentified as MCI) and late-stage MCI patients (misidentified as AD). In the MCI vs CN task, errors were mainly associated with the identification of mild MCI patients, aligning closely with the known challenges in clinical diagnosis. These findings underscore the inherent difficulty of early Alzheimer's disease detection and provide valuable guidance for future model refinement.

4.3 Clinical implications

ADMV-Net demonstrates considerable potential for clinical application. By leveraging both sMRI and PET data, the method can significantly enhance the accuracy of early Alzheimer's disease diagnosis, particularly in distinguishing AD patients from cognitively normal individuals (CN). This capability enables clinicians to identify high-risk individuals at an earlier stage, allowing timely intervention and potentially improving patient prognosis. Moreover, the RIPM module further highlights the importance of specific brain regions, providing clinicians with a clearer understanding of the model's decision-making rationale, thereby increasing confidence in diagnostic outcomes and supporting informed clinical decision-making. Collectively, these advantages suggest that ADMV-Net is not only suitable for early diagnosis but can also assist in long-term disease monitoring, tracking disease progression, and evaluating treatment efficacy, offering a more comprehensive resource for clinical management.

4.4 Research limitations and future prospects

While this study has achieved promising results, several limitations remain that warrant further improvement. First, the analysis was based solely on sMRI and PET modalities, a choice primarily motivated by data availability and methodological comparability. Nonetheless, other imaging modalities, such as fMRI and DTI, offer valuable insights into functional network dynamics and white matter structural connectivity, which are also critical for the early detection of Alzheimer's disease. Future work will aim to extend multimodal integration and multi-omics fusion to achieve a more precise and comprehensive modelling of disease mechanisms.

Second, the current validation relied exclusively on the publicly available ADNI and AIBL datasets. Although these datasets are of high quality and well-standardized, they may not fully capture the complexity of real-world clinical settings. In future studies, we plan to conduct multicentre clinical validation, incorporating both retrospective analyses and prospective studies to evaluate the model's performance in practical diagnostic workflows, thereby enhancing the robustness and clinical applicability of ADMV-Net.

Finally, this study employed cross-sectional data analysis, focusing on the static differentiation of distinct cognitive states. While effective in distinguishing AD, MCI, and CN conditions, it lacks dynamic modelling of disease progression. Future research will integrate longitudinal data to track multimodal imaging changes over time, enabling the development of predictive models for disease progression and providing guidance on optimal timing for clinical interventions.

5 Conclusion

The ADMV-Net model proposed in this study demonstrates excellent performance in multimodal Alzheimer's disease diagnostic tasks, effectively fusing complementary information from MRI and PET while enhancing feature representation through multi-view mechanisms and accurately identifying key brain regions. Experimental results show that ADMV-Net outperforms existing advanced methods across multiple classification tasks, achieving 94.83% accuracy and 95.97% AUC in AD versus CN classification tasks, with good generalization capability and robustness. This model not only achieves important technical breakthroughs and proposes innovative solutions in multi-view fusion and feature extraction, but also provides strong technical support for early AD diagnosis and clinical applications. In the future, we will continue to deepen related research and promote further development and application of multimodal deep learning in the field of neurodegenerative disease diagnosis.

Data availability statement

Publicly available datasets were analyzed in this study. This data can be found here: This study analyzed two publicly available, de-identified datasets: the Alzheimer's Disease Neuroimaging Initiative (ADNI) dataset, accessible via the LONI Image and Data Archive (IDA) at: https://adni.loni.usc.edu/data-samples/adnidata/ (Accession number: sa000002 on NIAGADS) and the Australian Imaging Biomarkers and Lifestyle Study of Ageing (AIBL) dataset, also available through the LONI IDA platform (project “AIBL”) at: https://ida.loni.usc.edu/collaboration/access/appApply.jsp?project=AIBL.

Ethics statement

The studies involving humans were approved by the ADNI, database was collected under protocols approved by institutional review boards, with ethics oversight and informed consent at each participating site https://adni.loni.usc.edu/wp-content/uploads/2017/09/ADNID_Approved_Protocol_11.19.14.pdf, the AIBL study protocol received approval from institutional ethics committees including St.‘Vincent's Health, the University of Melbourne (HREC No. 028/06), Hollywood Private Hospital, Austin Health, and Edith Cowan University, and all participants provided written informed consent https://pmc.ncbi.nlm.nih.gov/articles/PMC11491991/. The studies were conducted in accordance with the local legislation and institutional requirements. Written informed consent for participation was not required from the participants or the participants' legal guardians/next of kin in accordance with the national legislation and institutional requirements.

Author contributions

JF: Resources, Writing – review & editing. XZ: Validation, Visualization, Writing – original draft. ZL: Software, Writing – review & editing. YD: Resources, Supervision, Writing – review & editing. FW: Resources, Writing – review & editing.

Funding

The author(s) declare that financial support was received for the research and/or publication of this article. This work was supported by the Interdisciplinary project of Dalian University DLUXK-2023-ZD-001.

Acknowledgments

The authors sincerely thank Dalian University and the Affiliated Xinhua Hospital of Dalian University for their key support and invaluable assistance throughout this work. The authors would like to extend their sincere gratitude to the ADNI and AIBL databases for their invaluable support.

Conflict of interest

The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Generative AI statement

The author(s) declare that no Gen AI was used in the creation of this manuscript.

Any alternative text (alt text) provided alongside figures in this article has been generated by Frontiers with the support of artificial intelligence and reasonable efforts have been made to ensure accuracy, including review by the authors wherever possible. If you identify any issues, please contact us.

Publisher's note

All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article, or claim that may be made by its manufacturer, is not guaranteed or endorsed by the publisher.

References

Alinsaif, S. (2025). Dca-enhanced Alzheimer's detection with shearlet and deep learning integration. Comput. Biol. Med. 185:109538. doi: 10.1016/j.compbiomed.2024.109538

PubMed Abstract | Crossref Full Text | Google Scholar

Alzheimer's Disease (2023).Alzheimer's Diseasefacts and figures. Alzheimers. Dement. 19, 1598–1695. doi: 10.1002/alz.13016

PubMed Abstract | Crossref Full Text | Google Scholar

Bravo-Ortiz, M. A., Guevara-Navarro, E., Holguín-García, S. A., Rivera-Garcia, M., Cardona-Morales, O., Ruz, G. A., et al. (2024). SpectroCVT-Net: a convolutional vision transformer architecture and channel attention for classifying Alzheimer's disease using spectrograms. Comput. Biol. Med. 181:109022. doi: 10.1016/j.compbiomed.2024.109022

PubMed Abstract | Crossref Full Text | Google Scholar

Cao, P., Shan, X., Zhao, D., Huang, M., and Zaiane, O. (2017). Sparse shared structure based multi-task learning for mri based cognitive performance prediction of Alzheimer's disease. Pattern Recogn. 72, 219–235. doi: 10.1016/j.patcog.2017.07.018

Crossref Full Text | Google Scholar

Chen, J., Wang, Y., Zeb, A., Suzauddola, M., Wen, Y., Initiative, A. D. N., et al. (2025). Multimodal mixing convolutional neural network and transformer for Alzheimer's disease recognition. Expert Syst. Appl. 259:125321. doi: 10.1016/j.eswa.2024.125321

Crossref Full Text | Google Scholar

Chen, L., Qiao, H., and Zhu, F. (2022). Alzheimer's disease diagnosis with brain structural mri using multiview-slice attention and 3D convolution neural network. Front. Aging Neurosci. 14:871706. doi: 10.3389/fnagi.2022.871706

PubMed Abstract | Crossref Full Text | Google Scholar

Chen, Z., Liu, Y., Zhang, Y., Li, Q., and Alzheimer's Disease Neuroimaging Initiative. (2023). Orthogonal latent space learning with feature weighting and graph learning for multimodal Alzheimer's disease diagnosis. Med. Image Anal. 84:102698. doi: 10.1016/j.media.2022.102698

PubMed Abstract | Crossref Full Text | Google Scholar

Choudhury, C., Goel, T., and Tanveer, M. (2024). A coupled-gan architecture to fuse MRI and pet image features for multi-stage classification of Alzheimer's disease. Inf. Fusion 109:102415. doi: 10.1016/j.inffus.2024.102415

Crossref Full Text | Google Scholar

Damulina, A., Pirpamer, L., Soellradl, M., Sackl, M., Tinauer, C., Hofer, E., et al. (2020). Cross-sectional and longitudinal assessment of brain iron level in alzheimer disease using 3-T MRI. Radiology 296, 619–626. doi: 10.1148/radiol.2020192541

PubMed Abstract | Crossref Full Text | Google Scholar

Fjell, A. M., Walhovd, K. B., Fennema-Notestine, C., McEvoy, L. K., Hagler, D. J., Holland, D., et al. (2010). Csf biomarkers in prediction of cerebral and clinical change in mild cognitive impairment and Alzheimer's disease. J. Neurosci. 30, 2088–2101. doi: 10.1523/JNEUROSCI.3785-09.2010

PubMed Abstract | Crossref Full Text | Google Scholar

Gao, X., Cai, H., and Liu, M. (2023). A hybrid multi-scale attention convolution and aging transformer network for Alzheimer's disease diagnosis. IEEE J. Biomed Health Inform. 27, 3292–3301. doi: 10.1109/JBHI.2023.3270937

PubMed Abstract | Crossref Full Text | Google Scholar

Ghahremani Boozandani, M., and Wachinger, C. (2023). Regbn: batch normalization of multimodal data with regularization. Adv. Neural Inf. Process. Syst. 36, 21687–21701. doi: 10.48550/arXiv.2310.00641

Crossref Full Text | Google Scholar

Hu, Z., Wang, Z., Jin, Y., and Hou, W. (2023). VGG-TSwinformer: transformer-based deep learning model for early Alzheimer's disease prediction. Comput. Methods Programs Biomed. 229:107291. doi: 10.1016/j.cmpb.2022.107291

PubMed Abstract | Crossref Full Text | Google Scholar

Huang, J., Lin, L., Yu, F., He, X., Song, W., Lin, J., et al. (2024). Parkinson's severity diagnosis explainable model based on 3D multi-head attention residual network. Comput. Biol. Med. 170:107959. doi: 10.1016/j.compbiomed.2024.107959

PubMed Abstract | Crossref Full Text | Google Scholar

Jia, N., Jia, T., Zhao, L., Ma, B., and Zhu, Z. (2024). Multi-modal global-and local-feature interaction with attention-based mechanism for diagnosis of Alzheimer's disease. Biomed. Signal Process. Control 95:106404. doi: 10.1016/j.bspc.2024.106404

Crossref Full Text | Google Scholar

Kim, S. K., Duong, Q. A., and Gahm, J. K. (2024). Multimodal 3d deep learning for early diagnosis of Alzheimer's disease. IEEE Access 9, 67660–67666. doi: 10.1109/ACCESS.2024.3381862

Crossref Full Text | Google Scholar

Kwon, M. J. (2023). Changes of the texture and volume of brain MRI in suspected non Alzheimer pathology and Alzheimer's disease. Alzheimers. Dement. 19:e077646. doi: 10.1002/alz.077646

Crossref Full Text | Google Scholar

Lei, B., Liang, Y., Xie, J., Wu, Y., Liang, E., Liu, Y., et al. (2024). Hybrid federated learning with brain-region attention network for multi-center Alzheimer's disease detection. Pattern Recogn. 153:110423. doi: 10.1016/j.patcog.2024.110423

Crossref Full Text | Google Scholar

Li, Y., Ghahremani, M., Wally, Y., and Wachinger, C. (2024). Diamond: dementia diagnosis with multi-modal vision transformers using MRI and pet. arXiv [preprint] arXiv:2410.23219. doi: 10.1109/WACV61041.2025.00021

Crossref Full Text | Google Scholar

Liang, S., Chen, T., Ma, J., Ren, S., Lu, X., and Du, W. (2024). Identification of mild cognitive impairment using multimodal 3D imaging data and graph convolutional networks. Phys. Med. Biol. 69:235002. doi: 10.1088/1361-6560/ad8c94

PubMed Abstract | Crossref Full Text | Google Scholar

Lin, T.-Y., Dollár, P., Girshick, R., He, K., Hariharan, B., and Belongie, S. (2017). “Feature pyramid networks for object detection,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (Honolulu, HI: IEEE), 2117–2125.

Google Scholar

Liu, F., Wang, H., Liang, S.-N., Jin, Z., Wei, S., Li, X., et al. (2023a). MPS-FFA: a multiplane and multiscale feature fusion attention network for Alzheimer's disease prediction with structural mri. Comput. Biol. Med. 157:106790. doi: 10.1016/j.compbiomed.2023.106790

PubMed Abstract | Crossref Full Text | Google Scholar

Liu, H., Cai, H., Yang, D., Zhu, W., Wu, G., and Chen, J. (2023b). Learning pyramidal multi-scale harmonic wavelets for identifying the neuropathology propagation patterns of Alzheimer's disease. Med. Image Anal. 87:102812. doi: 10.1016/j.media.2023.102812

PubMed Abstract | Crossref Full Text | Google Scholar

Lu, P., Hu, L., Mitelpunkt, A., Bhatnagar, S., Lu, L., and Liang, H. (2024). A hierarchical attention-based multimodal fusion framework for predicting the progression of Alzheimer's disease. Biomed. Signal Process. Control 88:105669. doi: 10.1016/j.bspc.2023.105669

Crossref Full Text | Google Scholar

Pandey, P. K., Pruthi, J., Khan, S. B., Alkhaldi, N. A., and Saraee, D. (2024). Improved Alzheimer's detection with a modified multi-focus attention mechanism using computational techniques. Recent. Pat. Eng. 19:0118722121312906240913012729. doi: 10.2174/0118722121312906240913012729

Crossref Full Text | Google Scholar

Pei, Z., Wan, Z., Zhang, Y., Wang, M., Leng, C., and Yang, Y.-H. (2022). Multi-scale attention-based pseudo-3d convolution neural network for Alzheimer's disease diagnosis using structural MRI. Pattern Recogn. 131:108825. doi: 10.1016/j.patcog.2022.108825

Crossref Full Text | Google Scholar

Petersen, R. C., Aisen, P. S., Beckett, L. A., Donohue, M. C., Gamst, A. C., Harvey, D. J., et al. (2010). Alzheimer's disease neuroimaging initiative (ADNI) clinical characterization. Neurology 74:201–209. doi: 10.1212/WNL.0b013e3181cb3e25

PubMed Abstract | Crossref Full Text | Google Scholar

Qiao, J., Wang, R., Liu, H., Xu, G., and Wang, Z. (2022). Brain disorder prediction with dynamic multivariate spatio-temporal features: application to Alzheimer's disease and autism spectrum disorder. Front. Aging Neurosci. 14:912895. doi: 10.3389/fnagi.2022.912895

PubMed Abstract | Crossref Full Text | Google Scholar

Qiu, Z., Yang, P., Xiao, C., Wang, S., Xiao, X., Qin, J., et al. (2024). 3D multimodal fusion network with disease-induced joint learning for early Alzheimer's disease diagnosis. IEEE Trans. Med. Imaging. 43, 3161–3175. doi: 10.1109/TMI.2024.3386937

PubMed Abstract | Crossref Full Text | Google Scholar

Ramani, R., Ganesh, S. S., Rao, S., and Aggarwal, N. (2024). Integrated multi-modal 3D-CNN and RNN approach with transfer learning for early detection of Alzheimer's disease. Iran J. Sci Technol Trans Electr Eng, pages 1–25. doi: 10.1007/s40998-024-00769-z

Crossref Full Text | Google Scholar

Sang, Y., and Li, W. (2024). Classification study of Alzheimer's disease based on self-attention mechanism and dti imaging using GCN. IEEE Access 12:24387–24395. doi: 10.1109/ACCESS.2024.3364545

Crossref Full Text | Google Scholar

Selvaraju, R. R., Cogswell, M., Das, A., Vedantam, R., Parikh, D., and Batra, D. (2017). “Grad-CAM: visual explanations from deep networks via gradient-based localization,” in Proceedings of the IEEE International Conference on Computer Vision (ICCV) (Honolulu, HI: IEEE), 618–626.

Google Scholar

Tan, M., Pang, R., and Le, Q. V. (2020). “Efficientdet: scalable and efficient object detection,' in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (Seattle, WA: IEEE), 10781–10790.

Google Scholar

Tang, C., Xi, M., Sun, J., Wang, S., Zhang, Y., Initiative, A. D. N., et al. (2024a). MACFNet: detection of Alzheimer's disease via multiscale attention and cross-enhancement fusion network. Comput. Methods Programs Biomed. 254:108259. doi: 10.1016/j.cmpb.2024.108259

PubMed Abstract | Crossref Full Text | Google Scholar

Tang, Y., Xiong, X., Tong, G., Yang, Y., and Zhang, H. (2024b). Multimodal diagnosis model of Alzheimer's disease based on improved transformer. Biomed. Eng. Online 23:8. doi: 10.1186/s12938-024-01204-4

PubMed Abstract | Crossref Full Text | Google Scholar

Wan, D., Lu, R., Shen, S., Xu, T., Lang, X., and Ren, Z. (2023). Mixed local channel attention for object detection. Eng. Appl. Artif. Intell. 123:106442. doi: 10.1016/j.engappai.2023.106442

Crossref Full Text | Google Scholar

Woo, S., Park, J., Lee, J.-Y., and Kweon, I. S. (2018). “CBAM: convolutional block attention module,” in Proceedings of the European Conference on Computer Vision (ECCV) (Munich: ECCV), 3–19.

Google Scholar

Wu, H., Luo, J., Lu, X., and Zeng, Y. (2022). 3d transfer learning network for classification of Alzheimer's disease with MRI. Int. J. Mach. Learn. Cybern 13, 1997–2011. doi: 10.1007/s13042-021-01501-7

Crossref Full Text | Google Scholar

Yan, F., Peng, L., Dong, F., and Hirota, K. (2025). MCNEL: a multi-scale convolutional network and ensemble learning for Alzheimer's disease diagnosis. Comput. Methods Programs Biomed. 2025:108703. doi: 10.1016/j.cmpb.2025.108703

PubMed Abstract | Crossref Full Text | Google Scholar

Ye, J., Zeng, A., Pan, D., Zhang, Y., Zhao, J., Chen, Q., et al. (2024). MAD-Former: a traceable interpretability model for Alzheimer's disease recognition based on multi-patch attention. IEEE J. Biomed Health Inform. 28, 3637–3648. doi: 10.1109/JBHI.2024.3368500

PubMed Abstract | Crossref Full Text | Google Scholar

You, X., Ding, M., Zhang, M., Zhang, H., Yu, Y., Yang, J., et al. (2023). PnPNet: pull-and-push networks for volumetric segmentation with boundary confusion. arXiv [preprint] arXiv:2312.08323. doi: 10.48550/arXiv.2312.08323

Crossref Full Text | Google Scholar

Yu, R., Peng, C., Zhu, J., Chen, M., and Zhang, R. (2025). Weighted multi-modal contrastive learning based hybrid network for Alzheimer's disease diagnosis. IEEE Trans. Neural Syst. Rehabil. Eng. 33, 1135–1144. doi: 10.1109/TNSRE.2025.3549730

PubMed Abstract | Crossref Full Text | Google Scholar

Yuan, Y., Fu, R., Huang, L., Lin, W., Zhang, C., Chen, X., et al. (2021). HRFormer: high-resolution transformer for dense prediction. arXiv [preprint] arXiv:2110.09408. doi: 10.48550/arXiv.2110.09408

Crossref Full Text | Google Scholar

Zeng, W., Jin, S., Liu, W., Qian, C., Luo, P., Ouyang, W., et al. (2022). “Not all tokens are equal: Human-centric visual analysis via token clustering transformer,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (New Orleans, LA: IEEE), 11101–11111.

Google Scholar

Zhang, G., Nie, X., Liu, B., Yuan, H., Li, J., Sun, W., et al. (2023). A multimodal fusion method for Alzheimer's disease based on dct convolutional sparse representation. Front. Neurosci. 16:1100812. doi: 10.3389/fnins.2022.1100812

PubMed Abstract | Crossref Full Text | Google Scholar

Zhang, M., Cui, Q., Lü, Y., and Li, W. (2024). A feature-aware multimodal framework with auto-fusion for Alzheimer's disease diagnosis. Comput. Biol. Med. 178:108740. doi: 10.1016/j.compbiomed.2024.108740

PubMed Abstract | Crossref Full Text | Google Scholar

Keywords: Alzheimer's disease, multimodal fusion, Multi-view learning, cross-modal attention, neuroimaging

Citation: Feng J, Zhao X, Liu Z, Ding Y and Wang F (2025) A multi-view multimodal deep learning framework for Alzheimer's disease diagnosis. Front. Neurosci. 19:1658776. doi: 10.3389/fnins.2025.1658776

Received: 03 July 2025; Accepted: 10 September 2025;
Published: 01 October 2025.

Edited by:

Yuqi Fang, Nanjing University, China

Reviewed by:

Dhivviyanandam Irudayaraj, North Bengal St. Xavier's College, India
Hui Cheng, University of Hertfordshire, United Kingdom

Copyright © 2025 Feng, Zhao, Liu, Ding and Wang. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

*Correspondence: Jianxin Feng, ZmVuZ2ppYW54aW44NjNAMTYzLmNvbQ==; Feng Wang, ODA3MjE4MTM1QHFxLmNvbQ==

Disclaimer: All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article or claim that may be made by its manufacturer is not guaranteed or endorsed by the publisher.