Identifying Imaging Genetics Biomarkers of Alzheimer’s Disease by Multi-Task Sparse Canonical Correlation Analysis and Regression

Imaging genetics combines neuroimaging and genetics to assess the relationships between genetic variants and changes in brain structure and metabolism. Sparse canonical correlation analysis (SCCA) models are well-known tools for identifying meaningful biomarkers in imaging genetics. However, most SCCA models incorporate only diagnostic status information, which poses challenges for finding disease-specific biomarkers. In this study, we proposed a multi-task sparse canonical correlation analysis and regression (MT-SCCAR) model to reveal disease-specific associations between single nucleotide polymorphisms and quantitative traits derived from multi-modal neuroimaging data in the Alzheimer’s Disease Neuroimaging Initiative (ADNI) cohort. MT-SCCAR uses complementary information carried by multiple-perspective cognitive scores and encourages group sparsity on genetic variants. In contrast with two other multi-modal SCCA models, MT-SCCAR embedded more accurate neuropsychological assessment information through linear regression and enhanced the correlation coefficients, leading to increased identification of high-risk brain regions. Furthermore, MT-SCCAR identified primary genetic risk factors for Alzheimer’s disease (AD), including rs429358, and found some association patterns between genetic variants and brain regions. Thus, MT-SCCAR contributes to deciphering genetic risk factors of brain structural and metabolic changes by identifying potential risk biomarkers.


INTRODUCTION
Imaging genetics has recently emerged as a method for investigating imaging and genetic biomarkers related to diseases such as Alzheimer's disease (AD) (Bogdan et al., 2017). Identified neuroimaging and genetics biomarkers can provide a complementary understanding of the brain's structure and metabolism (Zhang et al., 2011). Moreover, the vast amounts of diagnostic and neuropsychological information from various perspectives enable the discovery of disease-specific biomarkers. Therefore, it is essential to simultaneously analyze multiple neuroimaging techniques, such as magnetic resonance imaging (MRI), fluorodeoxyglucose positron emission tomography (FDG-PET), genotyping, and clinical diagnostic data. In this study, we aimed to build a model to identify disease-specific biomarkers across multiple imaging modalities, which can be used as an effective clue for disease diagnosis and targeted therapy.
Numerous studies have attempted to identify the associations between genotypic data such as single nucleotide polymorphisms (SNPs) and neuroimaging quantitative traits (QTs) (Rasetti and Weinberger, 2011). Because genotypic data and imaging QTs are multivariate, several bi-multivariate methods have been proposed to better characterize their associations. Liu et al. explored parallel independent component analysis (PICA) to detect the associations between brain function and genetic variants. However, this method cannot restore meaningful SNPs and regions of interest (ROIs), which has led to a lack of reasonable biomarker interpretation (Liu et al., 2009). Sparse canonical correlation analysis (SCCA) has a strong capability for bi-multivariate association identification and interpretable variable selection. Accordingly, many efforts have attempted to apply SCCA to neuroimaging genetics. Boutte et al. introduced an SCCA model with least absolute shrinkage and selection operator (LASSO) constraints on neuroimaging genetics data fusion (Boutte and Liu, 2010). Hao et al. presented a multi-view SCCA model to establish associations between SNPs, QTs, and cognitive outcomes (Hao et al., 2017). However, these multiview SCCA models are a simple extension to conventional SCCA models. The requirement that SNP canonical weight vectors associate with all modal data is too strict, and could result in not making full use of all modal information. To address this limitation, Du et al. developed a multi-task SCCA model that could be used to jointly analyze SNPs and multiple neuroimaging data by treating each association as an individual learning task (Du et al., 2021). However, this model's neglect of diagnostic information means that biomarkers identified by these multipledata models may not be sufficiently disease-specific.
To detect more complex and meaningful associations, studies to date have applied diagnostic information into SCCA methods (Yan et al., 2018;Du et al., 2020). Yan et al. proposed an outcomerelevant SCCA model based on a subject similarity matrix (Yan et al., 2018). Du et al. integrated multi-task SCCA and logistic regression in a sophisticated model to identify robust diseaserelated imaging and genetic patterns by incorporating diagnostic status information (Du et al., 2020). Classified diagnostic information, such as AD, mild cognitive impairment (MCI), and healthy control (HC), facilitates the association between SNPs and QTs; however, roughly dividing the disease stages does not provide any more accurate information than do continuous neuropsychological assessments measured from different angles.
To address the above problems, we proposed a novel SCCA model with the capacity to extract disease-specific biomarkers across multiple neuroimaging modalities. The proposed multitask sparse canonical correlation analysis and regression (MT-SCCAR) model integrates multi-task SCCA and multi-task linear regression in a fused model and uses multiple cognitive scores (CSs) as auxiliary information to induce associations between SNPs and QTs. Multi-task sparse canonical correlation analysis and regression considers the relationships within subjects from different disease courses and can find disease-specific biomarkers. We also considered underlying hierarchical information among SNPs by modeling structural relationships as divided by gene or by linkage disequilibrium (LD) in a group sparsity penalty. To evaluate MT-SCCAR's effectiveness, we performed extensive experiments to find associations between SNPs and two imaging QTs, including gray matter density and standard uptake value ratio (SUVR) extracted from MRI and positron emission tomography (PET), respectively. Compared with the other two multi-modal SCCA models that used real Alzheimer's Disease Neuroimaging Initiative (ADNI) cohort data, MT-SCCAR not only outperformed these models in its ability to identify genetic AD risk factors, but also detected robust AD brain risk regions across multiple neuroimaging modalities. Thus, our proposed model has the potential to understand disease mechanisms from both structural and metabolic perspectives.

Data Sources and Preprocessing
Real neuroimaging and genetic data used in this study were obtained from the ADNI1 database. A total of 305 non-Hispanic Caucasian subjects with genotype, neuroimaging, and cognitive assessment data at the ADNI1 baseline were downloaded from the LONI website, 1 including 83 HC, 148 MCI, and 74 AD subjects. The Mini-Mental State Examination (MMSE) is a numeric scale to test cognitive functions, including attention, calculation, and responsiveness to simple commands (Tombaugh and McIntyre, 1992). The Functional Activities Questionnaire (FAQ) evaluates instrumental activities of daily life, such as financial management and meal preparation (Teng et al., 2010). The Alzheimer's Disease Assessment Scale Cognitive Subscale (ADAS-Cog) mainly measures cognitive ability such as word recall, comprehension of spoken language, and orientation (Cano et al., 2010). Table 1 shows the characteristics of the subjects.

Genotyping Data and Processing
Genotypes for 305 subjects were performed using the Illumina HumanHap610-Quad BeadChips from the ADNI1 database. The SNP data were lifted to hg19 build using lift over tool (Kent et al., 2002). To get pure SNP data, we used a genetic analysis tool PLINK (Purcell et al., 2007) to filter the SNPs using the following quality control criteria: gender check, sibling pair identification, call rate check (<90%) per subject and SNP marker, the Hardy-Weinberg Equilibrium (HWE p < 10-6), and marker removal by the minor allele frequency (MAF < 0.05), SNP data were further imputed using Michigan imputation server to estimate 1 http://adni.loni.usc.edu/   the missing genotypes based on the HRC r1.1 2016 panel (Das et al., 2016). The post-imputation quality control used the rsq > 0.3 and MAF of 0.1 . Since our study focused on the top 20 AD risk genes listed on the AlzGene database 2 and references (Tanzi et al., 2007;Wang et al., 2012a). After imputation, we selected all the SNPs within ± 5k base pairs of the gene boundary using the ANNOVAR annotation (Wang et al., 2010). The above procedures yielded 3793 SNPs belonging to the top 20 risk genes. Figure 1 presents the AD risk genes and the number of preselected SNPs. Moreover, considering the structural relationship among SNPs, we used Haploview (Barrett et al., 2004) to divide the LD block using the LD-Spline algorithm with D > 0.8, resulting in 209 blocks containing 3770 SNPs. A total of 894 tag SNPs were also assigned by Haploview in pairwise mode and an r 2 threshold was set to 0.8. These tagged SNPs represented the genetic variation across a particular region and could facilitate the association study (Montpetit et al., 2006). Furthermore, each SNP value was coded in an additive fashion to reflect the number of minor alleles.

Neuroimaging Data and Processing
The baseline 1.5T MRI scans were aligned to the standard Montreal Neurological Institute (MNI) space, resampled to 2 × 2 × 2 mm 3 voxels, registered by SPM software package (Ashburner and Friston, 2007). Then, we extracted the gray matter tissue from the MRI scans and calculated mean gray matter densities of 116 ROIs based on MarsBar AAL atlas (Tzourio-Mazoyer et al., 2002). After removing 26 ROIs of the cerebellum, mean gray matter densities of 90 ROIs were used as QTs in our study.
The FDG-PET scans were co-registered to each subject's same visit MRI scans and normalized to MNI space by SPM tool. We further excluded white matter regions by masking the PET with gray matter masks obtained by the segmentation of the same subject's co-registered MRI. Then, the PET scans were normalized into the cerebellar gray matter reference region defined on the AAL atlas to generate SUVR images. After this, we used SUVR of 90 ROIs as QTs in our study by removing the 26 ROIs of cerebellum. Moreover, all the QTs were adjusted to exclude the influence of gender, age, and education.

Methods
In this paper, we denote lowercase letters as vectors, uppercase letters as matrices. ||x|| 2 denotes the Euclidean norm, ||X|| 2,1 denotes the sum of the Euclidean norms of the rows of X, and ||X|| 1,1 denotes the absolute sum of all elements of X.

The CS-Related Features Selection Model for Imaging Genetics
Assuming that there are n subjects with pSNPs, qROIs from M imaging modalities, and G different cognitive outcomes. We used X ∈ R n × p ,Y m ∈ R n × q (m = 1, ..., M), and z g ∈ R n × 1 g = 1, ..., G to represent genetic data, multiple imaging data, and cognitive scores, respectively. The basic 2 www.alzgene.org principle of MT-SCCAR is to find U ∈ R p × M and V ∈ R q × M to maximize the correlation between Xu m and Y m u m , where u im indicates the weight of the ith SNP for the mth modality, and v jm indicates the weight of the jth ROI for the mth modality. To identify imaging genetic biomarkers that are relevant to CS and disease, the multi-task linear regression objective was combined with the multi-task SCCA (MTSCCA) objective, which can be formulated as: (1) The above model consists of four parts, L R (V) detects disease-relevant imaging QTs. L SCCA (U, V) captures the bimultivariate associations between SNPs and multiple imaging QTs. (U) and (V) are the regularization terms to enforce sparsity of U and V, so only a small number of interpretable variables can be selected. This model integrates the advantages of MTSCCA and linear regression, which has a certain superiority in using complementary cognitive information. Figure 2 provides a schematic overview of MT-SCCAR. SNPs were classified into the same group by either gene or LD. Accordingly, SNPs with gene or LD information and tagSNPs were input to the SCCA component separately, which was used to establish the relationships between genetic data and multiple imaging data. The linear regression component was used to introduce CSs into the SCCA part. The multi-task modeling method guaranteed the ability to process multiple imaging and CS data. Unlike conventional unsupervised SCCA models, MT-SCCAR is a supervised SCCA model, which considers the relationships within subjects from different disease courses.

The Linear Regression Model for CS-QT Associations
In the proposed model, the associations between CSs and multi-modal neuroimaging QTs were established by multi-task regression. For each task, we built a regression model for revealing CS-related neuroimaging QTs:

Algorithm: MT-SCCAR algorithm
Input: The genetic data X ∈ R n × p , the neuroimaging data Y ∈ R n × q of M modalities, and the CS data Ensure: canonical weights Vand U where M is the number of neuroimaging modalities, C is the number of cognitive assessments, and n is the total amount of subjects. v m is the canonical weight of QTs for the mth modalities,y l m is the neuroimaging data vector of the lth subjects for the mth modalities, and z l c is the score of the lth subjects for the cth cognitive assessments. This multi-task regression model can jointly utilize neuropsychological assessments from different complementary perspectives.

The MTSCCA Model for SNP-QT Associations
Unlike conventional multi-view SCCA models, MTSCCA learns multiple SCCA tasks together by treating each imaging modality association model as a task. This model was proposed by Du et al. (2021) and can be defined as: For canonical weights U and V, each column u m and v m represents an individual learning task for different modalities. The main advantage of this multi-task strategy is that SNP canonical weight vectors do not need to be associated with all imaging modalities simultaneously. Each task focuses on identifying SNPs that are associated with only one imaging modality.

The Regularization Terms
Multiple neuroimaging modalities can provide more comprehensive information in terms of both structural and functional perspectives. In our model, two principal tasks corresponded to two neuroimaging modalities. MT-SCCAR should be able to identify neuroimaging QTs shared among multiple modalities and to enforce individual level sparsity. Hence, (V) was composed of two parts, which can be defined as: where λ v1 and λ v2 are positive parameters and can be tuned via cross-validation. The first penalty was defined as: This term aims to enforce task-consistent (modality-consistent) sparsity on V, which encourages multi-modal imaging QTs to share similar canonical weights.
The second penalty was defined as: This term indicates the absolute sum of all elements of V, which helps to screen the entire ROIs to find the relevant ROIs. Similarly, the regularization terms of U also include the above two penalties, which can help discover SNPs that may affect multiple brain regions. It is common knowledge that some SNPs located in the same gene or LD block often have similar functions and are jointly related to specific ROIs. It is essential to model underlying hierarchical information among SNPs by adding an extra penalty. Therefore, we defined (U) as follows: where λ u1 , λ u2 , and λ u3 are positive parameters, the third penalty (Wang et al., 2012a) can be formulated as: where K denotes the number of groups divided by gene or LD. This penalty penalizes canonical weights as a whole for each task and thus can fully use the structural information.

The Optimization Algorithm
In order to address the problem defined in Equation (1), according to the method that has been well studied previously (Du et al., 2021), we can rewrite Equation (1): We then use the Lagrange multiplier to solve this problem by taking the partial derivatives of Equation (9) regarding u m andv m separately, which can change the formula from nonconvex to convex.
First, we treat U as constant, the Lagrange multiplier of Equation (9) can be simplified as: by dropping the constant terms, and γ v is a positive parameter. For each v m , We further take the partial derivatives of Equation (10) and let the result be zero: where D v1 is a diagonal matrix with the ith element as 1 2||v i,:|| 2 (i ∈ 1, q ), and D v2 is a diagonal matrix with ith element as 1 2||v im || 2 (i ∈ 1, q , and m ∈ [1, M]). Obviously, we can take an Frontiers in Genetics | www.frontiersin.org iterative rule to solve this problem since both D v1 and D v2 are rely on canonical weights V. This rule can be formulated as: Then, we treat V as a constant, the Lagrange multiplier of Equation (9) can be simplified as: by dropping the constant terms, and γ u is also a positive parameter. Similar to v m , for U, we let the partial derivatives of Equation (13) to be zero: where D u1 is a diagonal matrix with the ith element as , I k is an identity matrix of the same size with kth SNP groups, and Hence, the iterative rules can be formulated as: Based on the above analysis, the optimization algorithm of the proposed method is shown in Table 2. We can update V and U alternatively in each iteration until the predefined convergence criterion is satisfied.

Experimental Settings
To comprehensively evaluate the effectiveness of our proposed MT-SCCAR model, two similar models that can analyze multimodal data were compared with MT-SCCAR. They are threeview SCCA (TSCCA) and MTSCCA. Three-view SCCA can process neuroimaging, genetics, and cognitive scores data by extending conventional two-view association to three data types. MTSCCA was used to evaluate the regression part of our proposed model performance.
There are seven parameters in our model. Tuning all these parameters will pay a high cost. In our experiment, we fixed γ u and γ v to 1 since they mainly control the amplitude of V and U (Chen and Liu, 2011). To tune these  parameters to appropriate values, we adopted a nested fivefold cross-validation strategy. Specifically, we tuned them in the range of 10 −3 , 10 −2 , 10 −1 , 1, 10, 10 2 , 10 3 until the highest mean testing canonical correlation coefficients (CCCs) was generated in the inner loop. CCC was defined as the Pearson correlation coefficient between Xu and Yv, and can be used as a quantitative measure of SCCA model performance (Hao et al., 2017). For multi-task learning, CCC can be calculated by corr (X m u m , Y m v m ) for mth task. Also, we terminated the iteration when both max u

Results on Synthetic Data
We generated ten synthetic datasets with the same ground truth of loading vectors but different noise levels. Assuming that X ∈ R n × p ,Y ∈ R n × q , and Z ∈ R n × q denote SNP, MRI, and PET for all synthetic data sets, respectively. X was generated by X = ul + e,Y was generated by Y = vl + e, and Z was generated by Z = wl + e, where u, v, and ware known loading vectors, l is a latent vector with a 3-component Gaussian distribution to simulate the disease course (Yan et al., 2018), and e is derived from the Gaussian distribution N 0, σ 2 e with σ 2 e as the noise variance. In our study, n, p, and q were set to 90, 100, and 90, respectively. All the 90 samples were classed into three groups with centers -5, 0, 5. For neuropsychological assessment data, c was generated by c = l + e. To assess the model performance at various noise levels, we tested different noise variances ranging from 1 to 10, with a step size of 1. The five-fold cross-validation results are shown in Figures 3, 4. Figure 3 plots the testing CCC for three models with changing noise levels. Higher CCC indicates better performance in identifying underlying associations. As expected, the performance decreased with increased noise levels for all models. All three models performed similarly well at low noise levels. Models with the multi-task framework (MTSCCA, MT-SCCAR) performed better than TSCCA at medium noise levels. Then MT-SCCAR outperformed the other two models as the noise level was further increased, suggesting that MT-SCCAR had a strong ability to resist noise. Figure 4 shows the true signal of canonical weights and canonical weights estimated by three models with a noise level of 10. Important features were highlighted in the heatmaps displaying ground truth. We could clearly observe that the weight u estimated by MTSCCA was ambiguous. It was therefore difficult to recognize important features. TSCCA did not identify complete important features. MT-SCCAR estimated the best canonical weights that were consistent with the ground truths. These results implied that the proposed model had the potential to extract important features in real neuroimaging genetics studies.

Results on Neuroimaging and Genetics Data
In real neuroimaging genetics data application, all subjects with SNP, MRI, PET, and three different cognitive information data were inputted into MT-SCCAR.A total of 3793 SNPs with LD or gene group information and 894 tag SNPs were used separately. The group sparsity penalty treated each tagSNP as an individual group. We then averaged the CCCs based on fivefold cross-validation, representing the mean strength of identified associations between SNPs and two imaging QTs.
As illustrated in Table 3, TSCCA achieved the highest training CCCs but performed poorly in testing CCCs. These unreasonable results may be caused by overfitting (Du et al., 2021). Multitask sparse canonical correlation analysis and regression achieved the highest testing CCCs on both MRI and PET. Specifically, MT-SCCAR (LD) and MT-SCCAR (gene) achieved the highest testing CCC on SNP-MRI association and SNP-PET association, respectively. Notably, MT-SCCAR (gene) achieved relatively small testing CCC on SNP-MRI association; MT-SCCAR (LD) achieved a more balanced result than those of MT-SCCAR (gene), which indicates that using LD group information is more beneficial than using gene group information. The training CCCs of MT-SCCAR with tagSNP were higher than those of MT-SCCAR with group information since the different numbers  of SNPs were used. Moreover, MTSCCA also performed better than TSCCA, which means the superiority of multi-task models when dealing with multiple imaging QTs and genetic data.

The Top Selected ROIs
In addition to the CCCs, the canonical weights were also one of the focuses of our study since they can help us find brain regions being highly related to AD. Figure 5 shows the comparison of mean canonical weights of two imaging QTs based on five-fold cross-validation trials. Each row represents an SCCA model. The heatmap color represents the estimated weight of each model, so the selected QTs were highlighted in Figure 5. We can clearly observe that several brain regions were selected by both MRI and PET scans, such as the right hippocampal and the right angular gyrus, indicating that these regions may be modality-consistent. Additionally, TSCCA identified only modality-consistent QTs but failed to identify modality-specific QTs. This was due to the nature of its modeling strategy and may have resulted in crucial biomarkers being ignored. Multi-task models can identify modality-specific and modality-consistent QTs, which also implied the limitations of conventional multi-view SCCA models. In order to more accurately analyze the identified brain regions, using the proposed model with LD group information, the top ten ROIs of each modality were selected and sorted according to the absolute values of canonical weights. As shown in Table 4, ROIs that were jointly selected by two imaging modalities are shown in boldface, all of which are known to be closely related to the pathogenesis of AD according to previous research. The hippocampus is essential for forming new memories and was reported as one of the earliest affected brain regions in AD and MCI (Moreno-Jimenez et al., 2019). Both left and right caudate nucleus have been reported that their volume is significantly different between AD and normal control (Cho et al., 2014;Botzung et al., 2019). The right angular gyrus is considered to be closely related to language ability, and patients with angular gyrus syndrome are often found to have damage in this brain   area (Horwitz et al., 1998). The right parahippocampal gyrus affects the encoding and maintenance of bound information related to working memory (Luck et al., 2010). The metabolic reduction in the posterior cingulate gyrus is a very early sign in AD (Minoshima et al., 1997). Notably, all the remaining brain regions have also been reported to be associated with AD in published literature. These satisfactory results were due to the inclusion of cognitive information into the linear regression to adjust weighting. In order to further thoroughly verify that the neuroimaging biomarkers found by the proposed model are more diseaserelated than those found by the other two models. Selecting the top ten QTs as input features, support vector machine (SVM) with Gaussian radial basis function (RBF) kernel and random forest (RF) were adopted as classification methods. The parameters were tuned with five-fold cross-validation based on the training sets. Figure 6 presents the classification accuracies of the two classifiers. The testing classification results showed that the classifier using the features selected by MT-SCCAR achieved the highest accuracies, thus indicating the superiority of MT-SCCAR in identifying disease-related biomarkers. Notably, the testing classification accuracies were relatively low for both SVM and RF, probably due to inevitable noise during the feature extraction process of brain imaging. These results were also consistent with previous studies (Wang et al., 2012b;Adeli et al., 2017).

The Top Selected SNPs
In addition to neuroimaging biomarkers, SCCA models can also identify genetic biomarkers. We averaged the SNP canonical weights into a single vector and selected the top ten SNPs. As illustrated in Table 5, the proposed model with LD or gene group information yielded meaningful results. For example, rs769449 (APOE) is located in promoter and enhancer areas for multiple brain tissues and is associated with AD . Moreover, the well-known AD risk biomarker rs429358 (APOE) was also identified by the proposed model, demonstrating its strong correlation ability. The remaining five SNPs of the proposed model, i.e., rs7256200 (3.3 kb of APOE), rs10414043 (3kb of APOE), rs4901317 (FERMT2), rs449647 (0.5 kb of APOE), and rs405509 (0.2 kb of APOE), have also been documented to increase the risk of AD in previous studies (Lin et al., 2017;Xiao et al., 2017). However, four selected SNPs have not yet been reported to be related to AD. They still need further research to confirm in the future. Next, we compared the top ten SNPs identified by MT-SCCAR (LD and gene) with the 894 tagSNPs. Interestingly, MT-SCCAR (LD) identified six tagSNPs (rs7256200, rs4901317, rs429358, rs7157639, rs449647, and rs3829947). Multi-task sparse canonical correlation analysis and regression (gene) identified five tagSNPs (rs7256200, rs7157639, rs405509, rs429358, and rs4901317). This implied that using tagSNP will reduce the number of SNPs that need to be analyzed and facilitate identifying significant SNPs. The proposed model with tagSNP also identified some significant SNPs. For example, rs59325138 (3.6 kb of APOE) has been reported to modify the cerebrospinal fluid apolipoprotein E protein levels (Cervantes et al., 2011). The Beta-Amyloid (1-42), an AD biomarker, is associated with rs439401 (1.8kb of APOE) (Xu et al., 2014). The TSCCA identified the rs4292358 and three other SNPs (rs405509, rs7412, and rs7794735) that have been reported previously (Arking et al., 2008;Ma et al., 2016;Zhen et al., 2017). The MTSCCA also identified four SNPs (rs769449, rs7256200, rs10414043, and rs7794735) but cannot identify rs429358. In summary, the proposed model was more accurate for identifying disease-specific genetic biomarkers than the other two models. Alzheimer's disease (AD) usually first affects the hippocampus, resulting in cognitive decline and memory loss (Moreno-Jimenez et al., 2019). Therefore, when selecting the same number of features, the predictive effect of the QTs of the hippocampus can be used to evaluate model performance. Based on this analysis, we built a regression model to predict the QTs of the hippocampus from MRI and PET scans. Different numbers of SNPs were selected from 100 to 1000 with a step of 100. Using a support vector machine (SVR) with RBF kernel, we calculated the average root mean squared error (RMSE) for each model based on five-fold cross-validation. For a fair comparison, we only compared TSCCA, MTSCCA, MT-SCCAR (gene), and MT-SCCAR(LD) since MT-SCCAR (tagSNP) used only 894 tagSNPs. Figure 7 shows the testing RMSE of the left and right hippocampus obtained by different imaging techniques. Smaller RMSE indicates that the selected SNPs are more related to AD. According to Figure 7, the prediction errors were lowest for the proposed model. These results suggested that the proposed model outperformed the other two models on four imaging QTs.

Pairwise Correlation Analyses
Based on the top ten selected ROIs and SNPs obtained by the proposed model with LD group information, we drew heatmaps of pairwise correlation coefficients between SNPs and two imaging QTs. As illustrated in Figure 8, it is clearly observed that the selected SNPs were mainly located in and around the APOE region. APOE is the major genetic risk factor for AD (Munoz et al., 2019). Moreover, the association patterns of SNPs and ROIs selected by MRI and PET were very similar, which indicated the ability of our model to identify modality-consistent biomarkers.
To gain more insight, we further analyzed four undocumented SNPs (rs4904901, rs7157639, rs11629428, and rs3829947) identified by MT-SCCAR with LD group information. The imaging QTs which had the strongest association with these four SNPs were singled out. Consequently, a total of eight SNP-ROI pairs were generated to validate the proposed model. These associations can also allow us to explore relationships from the microscopic molecular level to the macroscopic brain level. Table 6 shows the Pearson correlation coefficients and p-values of eight SNP-ROI pairs. The p-values of all eight pairs were small, indicating a significant correlation within each pair. For rs4904901,it was correlated strongest with the same brain region across both imaging modalities, which suggests it is a modalityconsistent association pattern. For the rest of the SNPs, the heterogeneous association patterns may have great potential to help us understand how changes in molecular level influence brain structure and metabolic.

CONCLUSION
In this paper, we proposed the MT-SCCAR model to investigate potential neuroimaging and genetic biomarkers. Compared with TSCCA and MTSCCA, the proposed model integrated genotype, multiple neuroimaging, and neuropsychological assessments into a single model to analyze multi-modal information. We tested our model on synthetic and ADNI data sets and compared its association results with those of TSCCA and MTSCCA. We found that our model demonstrated higher CCCs of 0.34 ± 0.04 (LD) and 0.39 ± 0.03 (gene) compared with the CCCs of TSCCA (0.23 ± 0.03) and MTSCCA (0.30 ± 0.06). Moreover, MT-SCCAR identified a small number of SNPs from enormous SNPs that were related to AD, wherein all of the top ten selected ROIs were AD brain risk regions. These satisfactory results show that MT-SCCAR outperforms TSCCA and MT-SCCA in detecting disease-specific biomarkers on multi-modal data.
The proposed model incorporates SNPs, neuroimaging measurements, and cognitive scores. However, there are a number of biological pathways that correlate with structural changes in the brain. Therefore, future efforts should aim to integrate data across more levels (i.e., gene expression, cell, and DNA methylation) for a more sophisticated understanding of the biological pathways leading from gene to disease.

DATA AVAILABILITY STATEMENT
The datasets for this article are not publicly available but are available upon request at the following private repository: Alzheimer's Disease Neuroimaging Initiative, http://adni. loni.usc.edu/data-samples/access-data/, https://ida.loni.usc. edu/pages/access/studyData.jsp (The dataset contains the neuropsychological assessment data), and https://ida.loni.usc. edu/pages/access/geneticData.jsp (The dataset contains the genetics data). Requests to access the datasets should be directed to (Alzheimer's Disease Neuroimaging Initiative or catherine.conti@ucsf.edu). The code is available at https://github. com/ftorange/MT-SCCAR.

AUTHOR CONTRIBUTIONS
WK and FK designed the model and analyzed the results. FK prepared data and drafted the manuscript. SW and FK performed the pre-processing with imaging and genetics data. WK helped with data interpretation and manuscript drafting. All authors read and approved the final manuscript.