Adaptive Multimodal Neuroimage Integration for Major Depression Disorder Detection

Major depressive disorder (MDD) is one of the most common mental health disorders that can affect sleep, mood, appetite, and behavior of people. Multimodal neuroimaging data, such as functional and structural magnetic resonance imaging (MRI) scans, have been widely used in computer-aided detection of MDD. However, previous studies usually treat these two modalities separately, without considering their potentially complementary information. Even though a few studies propose integrating these two modalities, they usually suffer from significant inter-modality data heterogeneity. In this paper, we propose an adaptive multimodal neuroimage integration (AMNI) framework for automated MDD detection based on functional and structural MRIs. The AMNI framework consists of four major components: (1) a graph convolutional network to learn feature representations of functional connectivity networks derived from functional MRIs, (2) a convolutional neural network to learn features of T1-weighted structural MRIs, (3) a feature adaptation module to alleviate inter-modality difference, and (4) a feature fusion module to integrate feature representations extracted from two modalities for classification. To the best of our knowledge, this is among the first attempts to adaptively integrate functional and structural MRIs for neuroimaging-based MDD analysis by explicitly alleviating inter-modality heterogeneity. Extensive evaluations are performed on 533 subjects with resting-state functional MRI and T1-weighted MRI, with results suggesting the efficacy of the proposed method.


INTRODUCTION
Major depressive disorder (MDD) is one of the most common mental health disorders, affecting as many as 300 million people annually (Organization et al., 2017). This disease is generally characterized by depressed mood, diminished interests, and impaired cognitive function (Alexopoulos, 2005;Pizzagalli et al., 2008;Otte et al., 2016). Despite decades of research in basic science, clinical neuroscience and psychiatry, the pathological, and biological mechanisms of major depression remain unclear (Holtzheimer III and Nemeroff, 2006). The traditional diagnosis of MDD mainly depends on criteria from the diagnostic and statistical manual of mental disorders (DSM) and treatment response (Papakostas, 2009), which could be subjective and susceptible. As a robust complement to clinical neurobehavior-based detection, computer-aided diagnosis based on neuroimaging data hold the promise of objective diagnosis and prognosis of mental disorders (Foti et al., 2014;Liu and Zhang, 2014;Bron et al., 2015;Shi et al., 2018;Zhang L. et al., 2020;Buch and Liston, 2021).
Multiple neuroimaging modalities, such as resting-state functional magnetic resonance imaging (rs-fMRI) and structural MRI (sMRI), can provide complementary information in discovering objective disease biomarkers, and have been increasingly employed in automated diagnosis of various brain disorders (Hinrichs et al., 2011). Resting-state fMRI helps capture large-scale abnormality or dysfunction on functional connectivity network (FCN) by measuring boldoxygen-level-dependent (BOLD) signals of subjects (Van Den Heuvel and Pol, 2010;Wang et al., 2019;Zhang Y. et al., 2020;Sun et al., 2021), and thus, can measure hemodynamic response related to neural activity in the brain dynamically. Structural MRI provides relatively high-resolution structural information of the brain, enabling us to study pathological changes in different brain tissues, such as gray matter (GM), white matter (WM), and cerebrospinal fluid (CSF) (Cuadra et al., 2005). It is critical to integrate rs-fMRI and sMRI data to facilitate automated diagnosis of MDD and related disorders.
Existing neuroimaging-based MDD studies usually focus on discovering structural or functional imaging biomarkers, by employing various machine learning approaches such as support vector machines (SVM), Gaussian process classifier (GPC), linear discriminant analysis (LDA), and deep neural networks (Sato et al., 2015;Bürger et al., 2017;Rubin-Falcone et al., 2018;Li et al., 2021). However, these methods generally ignore the potentially complementary information conveyed by functional and structural MRIs. Several recent studies propose to employ functional and structural MRIs for MDD analysis, but they usually suffer from significant inter-modality data discrepancy (Fu et al., 2015;Maglanoc et al., 2020;Ge et al., 2021).
In this article, we propose an adaptive multimodal neuroimage integration (AMNI) framework for automated MDD detection using functional and structural MRI data. As shown in Figure 1, the proposed AMNI consists of four major components: (1) a graph convolutional network (GCN) for extracting feature representations of functional connectivity networks derived from rs-fMRI scans; (2) a convolutional neural network (CNN) for extracting features representations of T1-weighted sMRI scans; (3) a feature adaptation module for alleviating intermodality difference by minimizing a cross-modal maximum mean discrepancy (MMD) loss; and (4) a feature fusion module for integrating features of two modalities for classification (via Softmax). Experimental results on 533 subjects from the RESTmeta-MDD Consortium (Yan et al., 2019) demonstrate the effectiveness of AMNI in MDD detection.
The major contributions of this work are summarized below: • An adaptive integration framework is developed to fuse functional and structural MRIs for automated MDD diagnosis by taking advantage of the complementary information of the two modalities. This is different from previous approaches that focus on only discovering structural or functional imaging biomarkers for MDD analysis. • A feature adaptation strategy is designed to explicitly reduce the inter-modality difference by minimizing a cross-modal maximum mean discrepancy loss to re-calibrate features extracted from two heterogeneous modalities. • Extensive experiments on 533 subjects with rs-fMRI and sMRI scans have been performed to validate the effectiveness of the proposed method in MDD detection.
The rest of this article is organized as follows. In Section 2, we briefly review the most relevant studies. In Section 3, we first introduce the materials and then present the proposed method as well as implementation details. In Section 4, we introduce the experimental settings and report the experimental results. In Section 5, we investigate the effect of several key components in the proposed method and discuss limitations as well as possible future research directions. We finally conclude this article in Section 6.

RELATED WORK
In this section, we briefly introduce the most relevant studies on structural and functional brain MRI analysis, as well as multimodal neuroimaging-based diagnosis of brain disorders.

Brain Structural MR Imaging Analysis
Currently, MRI is the most sensitive imaging test of the brain in routine clinical practice. Structural MRIs can non-invasively capture the internal brain structure and atrophy, assisting us to understand the brain anatomical changes caused by various mental disorders. Conventional sMRI-based MDD analysis is usually performed manually by human beings via visual assessment (Scheltens et al., 1992), which could be subjective and susceptible. To this end, many machine learning methods (Gao et al., 2018), such as support vector machines (SVM), Gaussian process classifier (GPC), and linear discriminant analysis (LDA), have been used for automated MRI-based MDD diagnosis. However, these methods generally rely on handcrafted MRI features and these features may be suboptimal for subsequent analysis, thus significantly limiting their practical utility. In recent years, deep learning methods such as convolutional neural networks (CNNs) have been widely used in the fields of computer vision and medical image analysis (Yue-Hei Ng et al., 2015;Chen et al., 2016;Zhang L. et al., 2020). As a special type of multi-layer neural network, CNN is capable of automatic feature learning, which eliminates the subjectivity in extracting and selecting informative features for specific tasks (Lee et al., 2017). Based on the LeNet5 network, Sarraf and Tofighi (2016)  FIGURE 1 | Illustration of the proposed adaptive multimodal neuroimage integration (AMNI) framework, including (1) a graph convolutional network (GCN) for extracting features of functional connectivity networks derived from resting-state functional MRI (rs-fMRI) data, (2) a convolutional neural network (CNN) for extracting features of T1-weighted structural MRI (sMRI) data, (3) a feature adaptation module for alleviating inter-modality difference by minimizing a cross-modal maximum mean discrepancy (MMD) loss, and (4) a feature fusion module for integrating sMRI and fMRI features for classification. MDD, major depressive disorder; HC, healthy control.
Compared with 2D convolution, 3D convolution on the entire MR image is able to capture the rich spatial information, which is essential for disease classification.

Brain Functional MR Imaging Analysis
Existing studies have revealed that fMRI can capture largescale abnormality or dysfunction on functional connectivity networks by measuring the blood-oxygen-level in the brain (Van Den Heuvel and Pol, 2010;Zhang et al., 2019). With fMRI data, we usually construct a functional connectivity network for representing each subject, where each node represents a specific brain region-of-interest (ROI) and each edge denotes the pairwise relationship between ROIs (Honey et al., 2009;Dvornek et al., 2017). By capturing the dependencies between BOLD signals of paired ROIs, functional connectivity networks (FCNs) have been widely used to identify potential neuroimaging biomarkers for mental disorder analysis. Previous studies often extract handcrafted FCN features (e.g., clustering coefficient and node degree) to build prediction/classification models (Guo et al., 2021;Zhang et al., 2021), but the definition of the optimal FCN features highly relies on expert knowledge, so it is often subjective. Extracting effective feature representations of functional connectivity networks is essential for subsequent analysis.
Recent studies have shown that spectral graph convolutional networks (GCNs) are effective in learning representations of brain functional connectivity networks, where each FCN is treated as a graph (Bruna et al., 2013;Parisot et al., 2018;Bai et al., 2020;Yao et al., 2021). Motivated by breakthroughs of deep learning on grid data, people make efforts to extend CNN to graphs, giving rise to the spectral graph convolutional networks (GCNs) (Bruna et al., 2013). Recent studies have shown that GCNs are effective in learning representations of brain functional connectivity networks compared to traditional machine learning algorithms. For example, Parisot et al. (2018) proposed a GCN-based method for group-level population diagnosis that exploited the concept of spectral graph convolutions. Yao et al. (2021) presented a mutual multi-scale triplet GCN model to extract multi-scale feature representations of brain functional connectivity networks. Bai et al. (2020) developed a backtrackless aligned-spatial GCN model to transitively align vertices between graphs and learn effective features for graph classification. Compared with traditional CNN with Euclidean data, GCN generalizes convolution operations to non-Euclidean data, and helps mine topological information of brain connectivity networks.

Multimodal Neuroimaging-Based Brain Disease Diagnosis
Previous studies have been shown that multimodal neuroimaging data can provide complementary information of individual subjects to improve the performance of computer-aided disease diagnosis (Sui et al., 2013;Calhoun and Sui, 2016;Maglanoc et al., 2020;Guan and Liu, 2021). For example, Sui et al. (2013) developed a machine learning model to enable fusion of three or more multimodal datasets based on multi-set canonical correlation analysis and joint independent component analysis algorithms. Maglanoc et al. (2020) used linked independent component analysis to fuse structural and functional MRI features for depression diagnosis. Even though previous studies have yielded promising performance, they often extract sMRI and fMRI features manually, which requires domainspecific knowledge (Shen et al., 2017). Several deep learning models of multimodal medical image fusion are proposed to employ multimodal neuroimaging data for brain disease diagnosis (Rajalingam and Priya, 2018). However, existing studies usually focus on combining feature representation of multiple modalities and ignore significant inter-modality heterogeneity (Huang et al., 2019). To this end, we propose an adaptive multimodal neuroimage integration (AMNI) framework for automated MDD diagnosis based on restingstate functional MRI and T1-weighted structural MRI data. The proposed method can not only extract high-level feature representations of structural and functional data via CNN and GCN, respectively, but also alleviate the heterogeneity between modalities with the help of a unique feature adaptation module.

MATERIALS AND METHODS
In this section, we first introduce the materials and image preprocessing method used in this work, and then present the proposed method and implementation details.

Data Acquisition
Resting-state fMRI and T1-weighted structural MRI data were acquired from 282 MDD subjects and 251 healthy controls (HCs) recruited from the Southwest University, an imaging site of the REST-meta-MDD consortium (Yan et al., 2019). Restingstate fMRI were acquired through a Siemens scanner with the following parameters: repetition time (TP) = 2, 000 ms, echo time (TE) = 30 ms, flip angle = 90 o , slice thickness = 3.0 mm, gap= 1.0 ms, time point= 242, voxel size = 3.44 × 3.44 × 4.00 mm 3 . More detailed information can be found online 1 . The demographic and clinical information of these studied subjects is summarized in Table 1.

Image Pre-processing
The resting-state fMRI and structural T1-weighted MRI scans were pre-processed using the Diffeomorphic Anatomical Registration Through Exponentiated Lie algebra (DPARSF) software (Yan and Zang, 2010) with a standardized protocol (Yan et al., 2016). For rs-fMRI data, we first discard the first 10 volumes the initial 10 volumes were discarded, and slice-timing correction was performed. Then, the time series of images for each subject were realigned using a six-parameter (rigid body) linear transformation. After realignment, individual T1-weighted images were co-registered to the mean functional image using a 6 degrees-of-freedom linear transformation without re-sampling and then segmented into gray matter (GM), white matter (WM) and cerebrospinal fluid (CSF). Finally, transformations from individual native space to MNI space were computed with the Diffeomorphic Anatomical Registration Through Exponentiated Lie algebra (DARTEL) tool (Ashburner, 2007). After that, the fMRI data were normalized with an EPI template in the MNI space, and resampled to the resolution of 3 × 3 × 3 mm 3 , followed by spatial smoothing using a 6 mm full width half maximum Gaussian kernel. Note that subjects with poor image quality or excessive head motion (mean framewise-displacement >0.2 mm) were excluded from analysis (Jenkinson et al., 2002). 1 http://rfmri.org/REST-meta-MDD Finally, we extracted the mean rs-fMRI time series with bandpass filtering (0.01 − 0.1 Hz) of a set of 112 pre-defined regionsof-interest (ROIs), including cortical and subcortical areas based on the Harvard-Oxford atlas. Each T1-weighted structural MR image was also segmented into three tissues (i.e., GM, WM, and CSF) and transformed into the MNI space with DARTEL tool (Ashburner, 2007), resulting in a 3D volume (size: 121 × 145 × 121). Here, we employ gray matter volume in the MNI space for representing the original sMRI.

Proposed Method
As illustrated in Figure 1, the proposed AMNI consists of four major components: (1) a GCN module to extract features from rs-fMRI, (2) a CNN module to extract features from T1-weighted sMRI, (3) a feature adaptation module to reduce inter-modality discrepancy, and (4) a feature fusion module for classification, with details introduced below.

GCN for Functional MRI Feature Learning
Based on resting-state fMRI data, one usually constructs a functional connectivity matrix/network (FCN) for representing each subject, with each node representing a specific brain ROI and each edge denoting the pairwise functional connection/relationship between ROIs (Honey et al., 2009;Dvornek et al., 2017). That is, FCNs help capture the dependencies between BOLD signals of paired ROIs. Considering the fact that FCNs are non-Euclidean data, we treat each functional connectivity network as a specific graph and resort to spectral graph convolutional network (GCN) for FCN feature learning by capturing graph topology information. Previous studies have shown that GCN is effective in learning graph-level representations by gradually aggregating feature vectors of all nodes . In this work, we aim to learn graph-level representations based on node representations of input FCNs.
(i) Graph Construction. Denote N and M as the numbers of ROIs and time points, respectively, where N = 112 and M = 232 in this work. We assume that the rs-fMRI time-series data for a subject is Y = (y 1 , · · · , y N ) T ∈ R N×M , where each element y n ∈ R M (n = 1, · · · , N) denotes BOLD measurements of the n-th ROI at M successive time points.
As the simplest and most widely used method, Pearson correlation (PC) is usually used to construct functional connectivity networks from raw rs-fMRI time-series data. Denote B = (b ij ) ∈ R N×N as the functional connectivity matrix based on the Pearson correlation algorithm. Each element b ij ∈ [−1, 1] in B represents the Pearson correlation coefficient between the i-th and j-th ROIs, defined as follows: whereȳ i andȳ j are the mean vector corresponding to y i ∈ R M and y j ∈ R M , respectively, and M represents the length of time points of BOLD signals in each brain region. For each subject, we regard each brain FCN as an undirected graph G = {V, E}, where V = {v 1 , · · · , v N } is a set of N nodes/ROIs and b ij ∈ B denotes the functional connectivity between a paired nodes v i and v j . Since spectral GCNs work on adjacency matrices by updating and aggregating node features (Bruna et al., 2013), it is essential to generate such an adjacency matrix A and a node feature matrix X from each graph G.
To reduce the influence of noisy/redundant information, we propose to construct a K-Nearest Neighbor (KNN) graph based on each densely-connected functional connectivity matrix. Specifically, a KNN graph is generated by only keep the top k important edges according to their functional connectivity strength (i.e., PC coefficient) for each node. Then, the topology structure of the graph G can be described by adjacency matrix A = (a ij ) ∈ {0, 1} N×N , where a ij = 1 if there exists an edge between the i-th and the j-th ROIs, and a ij = 0, otherwise. In addition, the node features are defined by the functional connection weights of edges connected to each node, i.e., corresponding to a specific row in the functional connectivity matrix. Thus, the node features of the graph G can be represented by the node feature matrix X = B.
(ii) Graph Feature Learning. In GCN models, the convolution operation on the graph is defined as the multiplication of filters and signals in the Fourier domain. Specifically, GCN model learns new node representations by calculating the weighted sum of feature vectors of central nodes and the neighboring nodes. Mathematically, the simplest spectral GCN layer (Kipf and Welling, 2016) can be formulated as: where H l is the matrix of activations in the l-th layer, and W l is a layer-specific trainable weight matrix. In addition, A = D − 1 2 AD − 1 2 is the normalized adjacency matrix with self loops, and σ (·) is an activation function, such as the ReLU(·) = max(0, ·). In addition, D is the diagonal degree matrix, with the i-th diagonal element defined as d i = i =j A ij .
In the GCN module in our AMNI framework, we stack two graph convolutional layers with the adjacency matrix A and node features matrix X as inputs. The output of this two-layer GCN module is calculated as: Note that the number of neurons in the two graph convolutional layers is set as 64 and 64, respectively.
Given that this is a graph classification task, we employ a simple graph pooling strategy  to generate graphlevel FCN representations. To be specific, we employ both global average pooling and global max pooling that aggregate node features to generate new feature representations. The output feature of the graph pooling layer is as follows: where N is the number of ROIs, z i is the feature vector of ith ROI obtained by the graph convolution operation, and || denotes concatenation. By stacking multiple graph convolution layers and graph pooling layers, GCN can learn higher-order node features from neighboring nodes. In addition, GCN propagates information on a graph structure and gradually aggregates the information of neighboring nodes, which allows us to effectively capture the complex dependencies among ROIs.

CNN for Structural MRI Feature Learning
In recent years, convolutional neural networks (CNNs) have shown much predomination in image recognition and classification (Simonyan and Zisserman, 2014;He et al., 2016). Due to the 3D nature of structural MR images (sMRI), it is important to learn feature representations of all three dimensions from volumetric medical data. Considering that 3D convolutional kernels can encode richer spatial information, we adopt 3D CNN model to extract feature representations of T1-weighted MRIs.
In the AMNI framework, the CNN module consists of four convolution blocks and two fully-connected (FC) layers for local to global sMRI feature extraction. To be specific, each convolution block consists of one convolutional layer, one batch normalization layer, one activation function and one max pooling layer. To capture local patterns, 3D convolution is achieved by convolving a 3D kernel over 3D feature cubes. Formally, the j-th feature map in the i-th layer, denoted as v i,j , is given by where W i,j and b i,j are the kernel weights and the bias for the j-th feature map, respectively, V i−1 are the sets of input feature maps connected to the current layer from the (i − 1)th layer, * is the convolution operation, and f is the non-linear activation function. The size of each convolution filter is 3 × 3 × 3, and the numbers of convolution filters are set to 16, 32, 64, 128, respectively. In addition, max pooling is applied for each 2 × 2 × 2 region which reduces the spatial size of the feature maps and the number of parameters, and ReLU is used as the activation function. Meanwhile, batch normalization technique can promote faster convergence and better generalization of trained networks. For the pooling layer, we use the Global Average Pooling (GAP) operation (Lin et al., 2013), which performs downsampling by computing the mean of the height, width, and depth dimensions of the input. The formula for GAP is as follows: where v h,w,d j is the value at position (h, w, d) of the j-th input feature map, H, W, and D are the height, width, and depth respectively and g j is getting value of the j-th input feature map through GAP. Thus, the sMRI feature g S generated by CNN is given by: where c is the number of input feature map. It can be seen that the GAP layer converts a 4D tensor to a 1-dimensional feature vector, thus significantly reducing the number of network parameters.
The two fully-connected layers have 128 and 64 neurons, respectively. To avoid overfitting, we employ the dropout technique (Srivastava et al., 2014), with a probability of 0.5 after each fully-connected layer. More detailed information about the CNN architecture can be found in Table 2.

Feature Adaptation Module
Due to the heterogeneous nature of multimodal data, it is necessary to reduce the discrepancy between feature representations of different modalities before feature fusion. Inspired by existing studies on domain adaptation (Tzeng et al., 2014), we employ a cross-modal loss based on maximum mean discrepancy (MMD) (Gretton et al., 2012) to re-calibrate channel-wise features extracted from sMRI and fMRI. Denote G F and G S as feature representations of fMRI and sMRI, respectively. The cross-modal MMD loss L M is formulated as follows: where φ(·) denotes the feature map associated with the kernel map, and g F and g S are elements in G F and G S , respectively. During model training, the cross-modal MMD loss will be used as a regularization term to penalize heterogeneity of the features between the two modalities. As shown in Figure 1, this cross-modal MMD loss is applied to features from two fully-connected layers in the proposed CNN and GCN modules. This would enable the feature adaptation module to learn shared and aligned information across modalities by minimizing the distribution difference between two feature representations.

Feature Fusion Module
To enable our AMNI method to capture the complementary information provided by functional and structural MRIs, we also design a feature fusion module for classification/prediction.
Assuming that F 1 and F 2 are two feature representations obtained by feature adaptation module, we first concatenate them to obtain a new representation. The new representation F can be described as follows: After concatenation, the obtained new representation is fed to two fully-connected layers (with 64 and 2 neurons, respectively), and the learned features are further fed into a Softmax layer for classification. During the training stage, we use the cross-entropy loss function to optimize the parameters in our AMINI model. The classification loss L C is defined as: where N is the number of samples, and y i is the true label of the i-th sample, with 1 representing the sample being a MDD patient and 0 denoting the sample being a healthy control. In addition, p is the predicted probability that the sample belongs to the MDD category. In our model, we aim to minimize not only the classification loss, but also the cross-modal loss to reduce the inter-modality difference. Hence, the total loss function L of the proposed AMNI is defined as follows: where λ is a hyperparameter to tune the contributions of two terms in Equation (11).

Implementation Details
We optimize the proposed AMNI model via the Adam (Kingma and Ba, 2014) algorithm, with the learning rate of 0.0001, weight decay rate of 0.0015, training epoch of 100, and minibatch size of 16. The proposed model is implemented based on Pytorch (Paszke et al., 2017), and the model is trained by using a single GPU (NVIDIA Quadro RTX 6000 with 24 GB memory). The hyperparameter λ in Equation (11) is empirically set as 0.01. And we will experimentally investigate its influence in Section 5.

EXPERIMENTS
In this section, we introduce experimental settings and several competing methods, present the experimental results, and visualize feature distributions of different methods. Cn, the n-th convolutional layer; Mn, the n-th max pooling layer; GAP, global average pooling; FC, fully-connected layer.

Experimental Settings
We randomly select 80% samples as training data, and the remaining 20% samples are used as test data. To avoid bias introduced by random partition, we repeat the random partition procedure 10 times independently, and record the mean and standard deviation results. Eight metrics are used to evaluate the performance of different methods in the task of MDD detection (i.e., MDD vs. HC classification), including accuracy (ACC), sensitivity (SEN), specificity (SPE), balanced accuracy (BAC), positive predicted value (PPV), negative predictive value (NPV), F1-Score (F1), and area under the receiver operating characteristic curve (AUC).

Methods for Comparison
In this work, we compare the proposed AMNI method with six traditional machine learning methods and three popular deep learning methods. More details can be found below.
(1) PCA+SVM-s: The PCA+SVM-s method only uses sMRI data. The 3D image of the whole brain is down-sampled from 121 × 145 × 121 to 61 × 73 × 61, and further flattened into a vectorized feature representation for each subject. We use principal component analysis (PCA) (Wold et al., 1987) by keeping the top 32 principal components to reduce feature dimension based on the above feature representations of all subjects. Finally, the support vector machine (SVM) with Radial Basis Function (RBF) kernel is employed for classification.
(2) EC+SVM: The EC+SVM method uses rs-fMRI data. Similar to our AMNI, we first construct a functional connectivity matrix based on Pearson correlation coefficient for each subject. We then extract eigenvector centralities (EC) (Bonacich, 2007), which measure a node's importance while giving consideration to the importance of its neighbors in the FC network, as features of the FCN and feed these 112-dimensional features into an SVM classifier with RBF kernel for disease detection.
(3) DC+SVM: Similar to EC+SVM, the DC+SVM method first constructs a FCN based on Pearson correlation coefficient for each subject, and then extracts degree centrality (DC) (Nieminen, 1974) as FCN features by measuring node importance based on the number of links incident upon a node.
The 112-dimensional DC features are finally feed into an SVM for classification.
(4) CC+SVM: Similar to EC/CC+SVM, this method extracts the local clustering coefficient (CC) (Wee et al., 2012) to measure clustering degree of each node in each FCN. The 112dimensional CC features are fed into an SVM for classification.
(5) PCA+SVM-f: In the PCA+SVM-f method, the upper triangle of a FC matrix is flattened into a vector for each subject after the FC matrix is constructed. Then, we use PCA by keeping the top 32 principal components to reduce feature dimension based on the above feature representations of all subjects. Finally, an SVM is used for classification.
(6) PP+SVM: In this method, we integrate rs-fMRI and sMRI features for classification based on SVM. Specifically, we first employ PCA+SVM-s and PCA+SVM-f to extract features from structural and functional MRIs, respectively. Then, we concatenate features of these two modalities for the same subject, followed by an SVM for classification.
(7) 2DCNN: In this method, we employ the original FC matrix of each subject as input of a CNN model (LeCun et al., 1989). Specifically, this CNN contains three convolutional layers and two fully-connected layers. Each convolutional layer is followed by batch normalization and ReLU activation. The channel numbers for the three convolutional layers are 4, 8, and 8, respectively, and the corresponding size of the convolution kernel is 3 × 3, 5 × 5, 7 × ×7, respectively. The two fully-connected (FC) layers contain 4, 096 and 2 neurons, respectively.
(8) ST-GCN: We also compare our method with the spatiotemporal graph convolutional network (ST-GCN), a state-of-theart method for modeling spatio-temporal dependency of fMRI data (Gadgil et al., 2020). Specifically, the ST-GCN comprises two layers of spatio-temporal graph convolution (ST-GC) units, global average pooling and a fully connected layer. Note that each ST-GC layer produces 64-channel outputs with the temporal kernel size of 11, a stride of 1, and a dropout rate of 0.5.
(9) 3DCNN+2DCNN: In this method, we employ 3DCNN and 2DCNN to extract features from sMRI and fMRI, respectively. We then concatenate features learned from 3DCNN and 2DCNN, and feed the concatenated features to a fully-connected layer and the softmax layer for classification.

Experimental Results
The quantitative results of the proposed AMNI and nine competing methods in the task of MDD vs. HC classification are reported in Table 3. In Figures 2A,B, we also show ROC curves of different methods. From Table 3 and Figures 2A,B, we have the following interesting observations. First, our AMNI and two deep learning methods (i.e., 2DCNN and 3DCNN+2DCNN) generally achieve better performance in terms of eight metrics, compared with six traditional machine learning methods. For example, in terms of ACC values, the AMNI yields the performance improvement of 5.7%, compared with the best traditional machine learning method (e.g., PP+SVM) in MDD detection. These results demonstrate that, deep learning methods that can learn diagnosis-oriented neuroimage features is more effective in MDD detection, compared with traditional machine learning methods that rely on handcrafted features. Second, three multimodal methods (i.e., PP+SVM, 3DCNN+2DCNN, and AMNI) generally outperform their single-modality counterparts (i.e., PCA+SVM-s, PCA+SVM-f, and 2DCNN). For instance, both our AMNI and 3DCNN+2DCNN methods that integrate sMRI and fMRI data are superior to 2DCNN which only use functional data. This implies that taking advantage of multimodal MRIs (as we do in this work) helps promote the diagnosis performance, thanks to the complementary information provided by functional and structural MRIs. Furthermore, our proposed AMNI achieves better performance in terms of most metrics, compared with eight competing methods. These results imply that adaptive integration of multimodal neuroimages helps boost the performance of MDD identification.

Statistical Significance Analysis
We further calculate predicted probability distribution difference on test data between our model and each of eight competing methods by paired sample t-test. Denote u 1 and u 2 as the population mean of predicted probability distributions from our AMNI and one competing method, respectively. The hypotheses can be expressed as follows: where H 0 is the null hypothesis, meaning that our model and the competing method do not have significant difference. And H 1 is the alternative hypothesis, meaning that our model and the Frontiers in Neuroinformatics | www.frontiersin.org competing method have significance difference. The test statistic for the paired samples t-test is as follows: wherex diff is sample mean of the differences, s diff is sample standard deviation of the differences and n is the sample size (i.e., number of pairs). The p-values that corresponds to the test statistic t are shown in Table 4. Table 4, all obtained p-values are less than our chosen significance level (i.e., 0.05). Therefore, H 0 is rejected, which means that our AMNI method differs significantly from each of the eight competing methods.

Feature Visualization
In Figure 3, we visualize the data distributions of features derived from two multimodal methods (i.e., PP+SVM and AMNI) via t-SNE (Van der Maaten and Hinton, 2008). Note that the features of PP+SVM are generated by concatenating handcrafted features from two modalities, while the features of our AMNI are extracted based on an end-to-end deep learning model (see Figure 1). As shown in Figure 3, the feature distributions of two categories (i.e., MDD and HC) generated from our AMNI method have more significant difference, while their feature distribution gap is not evident for the PP+SVM method. This may indicate that our AMNI can learn more discriminative features for MDD detection by explicitly reducing the inter-modality discrepancy, compared with the traditional PP+SVM method.

Ablation Study
To evaluate the effectiveness of each component in the proposed AMNI, we further compare AMNI with its three variants: (1) AMNI-s that only uses CNN branch and feature fusion module of AMNI, without considering functional MRI, (2) AMNI-f that only uses GCN branch and feature fusion module of AMNI, without considering structural MRI, (3) AMNI-w/oMMD that directly feeds concatenated fMRI and sMRI features (via GCN and CNN modules, respectively) into the feature fusion module for classification, without using the proposed feature adaption module. The experimental results are reported in Figures 4, 2C.
It can be seen from Figure 4 that two multimodal methods (i.e., AMNI-w/oMMD and AMNI) generally outperform the single modality methods (i.e., AMNI-s and AMNI-f). This further demonstrates that multimodal data can provide complementary information to help boost the performance of MDD identification. Besides, our AMNI achieves consistently better performance compared with AMNI-w/oMMD that ignores the heterogeneity between the two modalities. These results further validate the effectiveness of the proposed feature adaption module in alleviating the inter-modality discrepancy between different modalities. In addition, Figure 2C suggests that our proposed AMNI achieves good ROC performance and the best AUC value compared with its three variants.

Influence of Hyperparameter
The hyperparameter λ in Equation (11) is used to tune the contribution of the proposed feature adaptation module for re-calibrating feature distributions of two modalities. We now report the classification accuracy of the proposed AMNI with different values of λ in Figure 5. As shown in Figure 5, with λ = 0.01, our AMNI can achieve best performance. But using a too large value (e.g., λ = 1) will yield worse performance. A possible reason is that focusing too much on the reduction of differences between modalities (with a large λ) may lose the specific and unique information of each modality, thereby degrading the learning performance.

Influence of Graph Construction Strategy
In the main experiment, we build a KNN graph to generate an adjacency matrix for each FCN. To investigate the influence of the use of different graph construction strategies, besides KNN, we also construct a fully-connected graph and a threshold graph to generate the adjacency matrix, respectively. For the fullyconnected graph, we directly take A = (|w ij |) as the adjacency matrix, which is an edge-weighted graph. For the threshold graph, we generate the adjacency matrix A by binarizing the FC matrix B to regulate the sparsity of the graph. Thus, the adjacency matrix can be described as A = (a ij ) ∈ {0, 1} N×N , where a ij = 1 if the connection coefficient between i-th and j-th ROI is greater than a threshold q; and a ij = 0, otherwise. The threshold q is set as 0.2 here. The experimental results of our AMNI with three different graph construction strategies are reported in Figure 6. As can be seen from Figure 6, our AMNI model based on KNN graph outperforms its two variants that use fully-connected graph and threshold graph. The underlying reason could be that KNN graph can preserve node-centralized local topology information while removing noisy/redundant information in graph Yao et al., 2021).

Influence of Network Architecture
To explore the influence of different network architectures of AMNI on the experimental results, we adjust the the network depth of two branches of the AMNI model, respectively. On the one hand, with the CNN branch fixed, we vary the number of graph convolutional layers for the GCN branch of AMNI   and report the corresponding results of AMNI in Table 5. This table shows that the AMNI achieves the overall best performances (e.g., ACC=0.6495 and AUC=0.6648) with two graph convolutional layers in the GCN branch. In addition, as the number of graph convolutional layers increases (see AMNI-G3 and AMNI-G4), the performance is not good. This may be due to the over-smoothing problem (that is, Laplacian smoothing makes the node representations more similar as the graph convolutional layer increases; Yang et al., 2020), which may reduce the discriminative compatibility of learned features. On the other hand, we fix the GCN branch and vary the architecture of the CNN in AMNI for performance evaluation. Specifically, we vary the number of convoluational layers in CNN within [3,6] and report the results of AMIN in MDD vs. HC classification in Table 5. This table shows that fine-tuning the network architecture of the CNN branch in AMNI achieves comparable results, which implies that our AMNI is robust to different network architectures. Further, AMNI with five convoluational layers in the CNN branch (e.g., AMNI-G5) achieves better performance in terms of accuracy, sensitivity, balanced accuracy, positive predicted value and F1-Score.
Besides, we also further discuss the influence of network width of each branch on the experimental results. For one thing, with the CNN branch fixed, we change the number of neurons in the graph convolutional layers and then report the corresponding results of AMNI in Table 6. It can be found from Table 6 that the AMINI model using different numbers of neurons in graph convolutional layers achieves comparable experimental results, FIGURE 6 | Results of the proposed AMNI based on three different graph construction methods (e.g.fully-connected graph, threshold graph, and KNN graph) in the task of MDD vs. HC classification, with best results shown in bold. Note that AMNI-Gn contains n graph convolutional layers in the GCN module of AMNI, and AMNI-Cn contains n convolutional layers in the CNN module of AMNI. which means our model is not very sensitive to the change of network width of the GCN branch. For another thing, with the GCN branch fixed, we change the number of filters in each 3D convolutional layer and record the results in Table 6. As shown in Table 6, with the increase of the number of filters in 3D CNN module of AMNI, the model (i.e., AMNI-c3 and AMNI-c4) generally achieves better performance. This may be due to that using more filters in CNN can capture richer features across global and local information of sMRI.

Influence of Multimodality Fusion Strategy
We fuse fMRI and sMRI data at the feature-level (see Figure 1) in the main experiments. We further investigate the influence of different fusion strategies by comparing our AMNI (using feature-level fusion) with its variant (called AMNI_lf) using a decision-level fusion strategy. As shown in Figure 7, in the AMNI_lf, the fMRI feature derived from GCN is fed into two fully connected layers and a Softmax layer for feature abstraction and classification. Similarly, the sMRI feature derived from CNN is fed into three fully connected layers and a Softmax layer. The outputs of these two branches are further fused via a weighted sum operation. We vary the weighted ratio between fMRI and sMRI branches within [ 0.2 0.8 , 0.5 0.5 , 0.8 0.2 ] and denote these three methods as AMNI_lf1, AMNI_lf2, and AMNI_lf3, respectively, with the experimental results shown in Figure 8.
As shown in Figure 8, as the weight of GCN branch increases, the model achieves better performance in terms of most metrics. FIGURE 8 | Experimental results of late fusion method and our AMNI method in MDD vs. HC classification. Note that AMNI_lf1, AMNI_lf2, and AMNI_lf3 denote that the weight ratio between fMRI and sMRI branch is 0.2 0.8 , 0.5 0.5 , and 0.8 0.2 , respectively.
However, the results of AMNI using the decision-level fusion method are generally inferior to that of the feature-level fusion method proposed by this article. This implies that feature-level fusion of functional and structural representations could be more effective.

Limitations and Future Work
Several limitations need to be considered. First, we only integrate T1-weighted MRI and functional MRI data for automated MDD diagnosis. Actually, diffusion tensor imaging (DTI) data can examine and quantify white matter microstructure of the brain, which can further help uncover the neurobiological mechanisms of MDD. Therefore, it is valuable to incorporate DTI data into multimodal research in our future work. Second, we use functional connectivity networks for representing rs-fMRI data and treat them as input of the proposed method. It is interesting to extract diagnosis-oriented fMRI features, as we do for T1weighed MRIs, which will also be our future work. Besides, a feature adaptation module with a cross-modal MDD loss is designed for reducing inter-modality data heterogeneity. Many other data adaptation methods (Ben-David et al., 2007) can also be incorporated into the proposed AMNI framework for further performance improvement.

CONCLUSION
In this article, we propose an adaptive multimodal neuroimage integration (AMNI) framework for automated MDD diagnosis based on functional and structural MRI data. We first employ GCN and CNN to learn feature representations of functional connectivity networks and structural MR images. Then, a feature adaptation module is designed to alleviate inter-modality difference by minimizing the distribution difference between two modalities. Finally, high-level features extracted from functional and structural MRI modalities are integrated and delivered to a classifier for disease detection. Experimental results on 533 subjects with rs-fMRI and T1-weighted sMRI demonstrate the effectiveness of the proposed method in identifying MDD patients from healthy controls.

DATA AVAILABILITY STATEMENT
The datasets presented in this study can be found in online repositories. The names of the repository/repositories and accession number(s) can be found at: REST-meta-MDD Consortium Data Sharing.

ETHICS STATEMENT
The studies involving human participants were reviewed and approved by REST-meta-MDD Consortium Data Sharing. The patients/participants provided their written informed consent to participate in this study.