Associating brain imaging phenotypes and genetic risk factors via a hypergraph based netNMF method

Abstract Alzheimer’s disease (AD) is a severe neurodegenerative disease for which there is currently no effective treatment. Mild cognitive impairment (MCI) is an early disease that may progress to AD. The effective diagnosis of AD and MCI in the early stage has important clinical significance. Methods To this end, this paper proposed a hypergraph-based netNMF (HG-netNMF) algorithm for integrating structural magnetic resonance imaging (sMRI) of AD and MCI with corresponding gene expression profiles. Results Hypergraph regularization assumes that regions of interest (ROIs) and genes were located on a non-linear low-dimensional manifold and can capture the inherent prevalence of two modalities of data and mined high-order correlation features of the two data. Further, this paper used the HG-netNMF algorithm to construct a brain structure connection network and a protein interaction network (PPI) with potential role relationships, mine the risk (ROI) and key genes of both, and conduct a series of bioinformatics analyses. Conclusion Finally, this paper used the risk ROI and key genes of the AD and MCI groups to construct diagnostic models. The AUC of the AD group and MCI group were 0.8 and 0.797, respectively.


Introduction
Alzheimer's disease (AD) is a neurodegenerative disease with insidious onset and progressive development. The most common early symptom is difficulty remembering recent events (Scheltens et al., 2021). As the disease progresses, patients gradually lose their ability to care for themselves and eventually die from complications such as infection (Scheltens et al., 2021). The exact cause of AD is still unknown, but it has a long-standing preclinical feature: mild cognitive impairment (MCI). The typical symptoms of MCI patients are memory loss and there may be damage to one or more cognitive areas (Carlson and Prusiner, 2021). Still, it is not enough to affect the patient's daily life, and the diagnostic criteria for AD have not yet been met. Previous studies show that structural and functional abnormalities of the brain and phenotypic or molecular abnormalities associated with AD are associated (Simrén et al., 2021). However, the source or cause of these abnormalities is unclear.
In recent years, imaging genetic analysis related to brain diseases has attracted much attention. Imaging genetics explores the effect of genetic variation on brain structure, metabolism, and function through association analysis of radionics data (such as sMRI, PET, fMRI; Jiang et al., 2018) and genetic data (such as DNA methylation, SNP, gene expression; Du et al., 2020).
The association mechanism of imaging genetics is gradually revealed through robust association analysis algorithms. By improving sparse canonical correlation analysis (SCCA) with various strategies, Du et al. (2020) proposed various innovative algorithms based on SCCA to analyze brain imaging genetics data. For example, they proposed a dirty multi-task sparse canonical correlation analysis (dirty MT-SCCA) to study imaging genetic problems involving multimodal brain imaging. This method used multi-task learning and parameter decomposition to simultaneously identify patternconsistent and pattern-specific brain regions and SNP loci. Furthermore, they proposed a parametric decomposition-based sparse multi-view canonical correlation analysis (PDSMCCA) method to identify modality-sharing and specific information from multimodal data to gain insight into the complex pathology of brain diseases (Zhang et al., 2022). However, the above two algorithms only supported the analysis of multimodal brain imaging and single-modal genetic data. The sparse multi-view sparse canonical correlation analysis (SMCCA) algorithm can simultaneously perform correlation analysis on data from multiple modalities. However, the direct fusion of multiple SCCA objectives can cause gradient domination problems, resulting in SMCCA being a suboptimal model. Therefore, they proposed two adaptive multi-view canonical correlation analysis algorithms to solve the problem of gradient domination, integrating the underlying relationships between protein expression, SNPs, and neuroimaging (Du et al., 2021).
On the other hand, multi-objective optimization algorithms are also emerging in the field of imaging genetics. Bi et al. (2022) integrated the fMRI and SNP data of Parkinson's patients through multiple optimization methods for multimodal analysis. They used a correlation analysis method to construct fused features from the sequence information and SNPs of regions of interest (ROIs). Then, a weighted evolution strategy was introduced into ensemble learning, and a new weighted evolutionary random forest (WERF) model was constructed to eliminate inefficient features. In addition, they also proposed a clustered evolutionary random forest (CERF) method to detect discriminative genes and brain regions and found some interesting associations between brain regions and genes in Parkinson's patients (Bi et al., 2021). However, the above algorithms cannot interpret the results from the perspective of network regulation. Therefore, this paper used a hypergraph-based netNMF (HG-netNMF) algorithm to explore the relationship between genes and sMRI from both gene network and brain network construction. The algorithm exploited hypergraph regularization to mine higher-order features of both modal data. Further, the key genes and ROIs were explored from the critical network modules. Bioinformatics analysis and diagnostic model construction were performed to provide new insights into the imaging genetic association mechanism of AD and MCI.

Nonnegative matrix factorization
Nonnegative matrix factorization (NMF) algorithm is a classic dimensionality reduction algorithm that were widely used in image processing, audio processing, biological data analysis, and other fields (Kim and Tidor, 2003;Zhang et al., 2012;Yang et al., 2021). It decomposes the matrix X n p ∈ ×  to obtain the basis matrix W n k ∈ ×  and the coefficient matrix H k p ∈ ×  , and its objective function is as follows:

Network non-negative matrix factorization
The traditional NMF algorithm can only decompose single-modal data and cannot analyze the network control module composed of multi-modal data. When X is a symmetric similarity matrix, Chen and Zhang (2018) proposed the netNMF algorithm to decompose multiple matrices simultaneously. It is worth noting that the algorithm can be used to analyze the network modules of various data. For example, their respective correlation networks and mutual correlation networks can be calculated for analysis for the two expression matrices X n p The objective function of the netNMF algorithm is as follows: Among them, R p p 11 ∈ ×  , R q q 22 ∈ ×  are the respective association networks of X 1 and X 2 . R p q 12 ∈ ×  is the cross-correlation network of X 1 and X 2 . G p k 22 ∈ ×  are non-negative factor matrices. The algorithm can be analyzed from the network module level constructed by the data, and R G G T F 12 1 2 2 − is used to identify the one-to-one relationship between the two types of modules.
In fact, R p p are the symmetric similarity matrices corresponding to the two types of features, respectively, and R p q 12 ∈ ×  is the nonnegative similarity matrix between the two types of data. S 11 and S 22 represent the matrix describing the similarity of the two networks obtained from the decomposition of R 11 and R 22 , respectively. Off-diagonal elements in S 11 and S 22 indicate the importance of relationships between rois and between genes.

Hypergraph learning
In graph learning, vertices and edges can describe the relationship between multiple objects. However, simple graphs may not depict complex relationships in practical situations. Therefore, hypergraph theory is widely used to mine higher-order correlations between complex things (Xiao et al., 2020). A hypergraph can connect more than two Further, define degree diagonal matrices D v and D e . Among them, . Then, the similarity matrix S that defines the hypergraph G is as follows.

S HWD H
Similar to the Laplacian matrix definition for simple graphs, the Laplacian matrix for hypergraphs is defined as follows.

Hypergraph-based network non-negative matrix factorization
The previously introduced netNMF algorithm does not consider the correlation within different networks, while the hypergraph Laplacian matrix can integrate high-order correlation information of different modal data. It helps the algorithm identify more biologically meaningful network modules and improves its performance to a certain extent. This paper defines hypergraph regularization as follows.
The following formula can be obtained using (10) to calculate the partial derivatives of S 11 , S 22 , G 1 and G 2 , respectively.
Line graph of the relationship between the number of neighbors and the value of the relative error when building a hypergraph in the two groups. (A,B) Are the cases of AD group and MCI group, respectively.
Frontiers in Aging Neuroscience 04 frontiersin.org Further, through the Karush-Kuhn-Tucher (KKT) condition, the iteration rules of S 11 , S 22 , G 1 and G 2 can be obtained, as shown below.

Network module selection method
Two types of modules, G 1 and G 2 , can be identified from the ROI matrix R 11 and gene network matrix R 22 constructed in this paper. Specifically, zscore is introduced into this paper to calculate the weights of genes and ROIs, and genes and ROIs with weights above the threshold are treated as members of the corresponding community.
In this paper, the threshold was set to 1.

Evaluation indicators for regression analysis and diagnostic model construction
In this paper, regression analysis and diagnostic model construction were carried out on the Top elements in the network module obtained by the HG-netNMF algorithm. For regression analysis, this paper introduced mean absolute error (MAE) and root mean square error (RMSE) to evaluate the regression performance, and their definitions are as follows: where  i y represents the predicted value, and y i represents the true value. In addition, when constructing the diagnostic model, this paper introduced the Receiver Operating Characteristic (ROC) curve to evaluate the classification performance of the algorithm, the abscissa is the false positive rate (FPR), and the ordinate is the true positive rate (TPR). The AUC value is the area covered by the ROC curve. AUC can measure the classification effect of the classifier.

Data acquisition and preprocessing
The real data used in this paper are all from the ADNI database. We collected sMRI imaging and gene expression data from 306 subjects in ADNI. Table 1 gives specific information of the samples.
For sMRI data, this paper realized the segmentation of sMRI based on the CAT toolkit of MATLAB software. Specifically, the CAT toolkit provided a voxel-based morphometric measurement (VBM) function (Veres et al., 2009). Finally, the gray matter volumes of 140 ROIs were extracted as imaging features. Differential expression analysis was performed using the Limma package for gene expression data. Specifically, we used the AD group and the MCI group as the diseased group and the HC group as the control group to experiment. Differentially expressed genes with p values less than 0.01 in the AD and MCI groups were retained (510 genes were retained in the AD group, and 314 genes were retained in the MCI group).
In addition, this paper firstly divided the AD group and the MCI group into the training set and the test set according to the ratio of 8:2 and used the HG-netNMF algorithm to perform network module association analysis on the training set of the two groups of data, and then validate on the test set, and use Top 10 genes to regress Top 10 ROIs, build diagnostic models, etc.

The influence of neighbor size
In this paper, the K-nearest neighbors (KNN) method was used to construct the hyperedge, and the neighbor size of the KNN Frontiers in Aging Neuroscience 05 frontiersin.org algorithm needs to be selected. The size of the neighbors was selected based on the size of the objective function value. Based on the selection experience of previous papers, we divided the parameters of the two groups. Figures 1A,B below give the line graphs of the relationship between the neighbor size of the AD group and the MCI group and the objective function value, respectively. The AD group's smallest relative error was 0.4496 (corresponding to 31 and 5). The MCI group's smallest relative error was 0.4848 (corresponding to 5 and 47).

Parameter selection
Two parameters need to be selected in this paper, namely λ 1 and λ 2 . The objective function value was selected as the selection standard, and parameters were selected from the range of [0.0001, 0.001, 0.01, 0.1]. The schematic diagram of parameter selection was shown in Figure 2.
In addition, according to the selection experience of previous literature, the value of k generally does not exceed one-tenth of the minimum number of samples or features for the number of genes-ROIs network modules k. Therefore, the k value in this paper was set to 7 in the AD group and 22 in the MCI group.

Algorithm comparison
To confirm the superiority of the algorithm, the algorithm was compared on simulated data and the real data. First, on simulated data, this paper compared the objective function values of the two algorithms at different noise levels. The generation method of simulated data was similar to Kim et al. (2019). Specifically, this paper simulated the sMRI numerical matrix X and the gene numerical matrix Y of 300 random samples. By defining a normal distribution N 0 2 ,σ  ( ) , generate an ROI weight vector u with 200 elements and a gene weight vector v with 2000 elements. In addition, this paper created noise  = e, which comes from a normal distribution N e 0 2 ,σ ( ) . Next, correlated and uncorrelated ROI and gene variables were generated similarly: X u e = + ε and X e = and Y v e = + ε and Y e = . The respective and mutual Pearson correlation coefficients were then calculated for the two variables as inputs to the algorithm (R 11 , R 22 , and R 12 ). Next, the objective function values of the proposed method and the netNMF algorithm were compared under the same experimental conditions and noise levels ( Table 2). In addition, we simulated the anti-noise performance of the two algorithms when the sample size is large. Specifically, we set the total number of samples to 1,000, and the remaining parameters are consistent with the above to obtain the A B

FIGURE 2
The line graph of parameter selection and objective function value of AD group and MCI group. (A,B) Are the cases of AD group and MCI group, respectively.  objective function values of the two algorithms under different noise levels (Table 3).

R G S G
In real data, this paper compared the results of the proposed HG-netNMF algorithm with the results obtained by the netNMF algorithm for the AD group and the MCI group. Specifically, we calculated the Pearson correlation coefficients between the reconstructed matrices R 11 ′ , R 22 ′ , and R 12 ′ and the three original matrices R 11 , R 22 , and R 12 , as shown in Table 4. In addition, we give the formulas for R 11 ′ , R 22 ′ and R 12 ′ in the following formula.
corr X Y , ( ) represents the Pearson correlation coefficient between X and Y . It can be seen from Table 3 that the proposed algorithm outperforms the netNMF algorithm in the reconstruction of R 22 and R 12 in the AD group. In the MCI group, the proposed algorithm outperformed the netNMF algorithm for reconstructing the three original matrices. This confirmed that hypergraph regularization contributes to the improvement of algorithm reconstruction performance. In addition, this paper also drew the scatter plots of the two algorithms between the three original matrices and the three reconstructed matrices in the AD group and the MCI group, as shown in Figure 3. As seen from Figure 3, the two algorithms had comparable reconstruction performance when reconstructing R 11 and R 12 . However, the proposed algorithm performed better when reconstructing R 22 in the MCI group. Correlation scatter plots of 11 R , 12 R , and 22 R of the two algorithms and their reconstruction matrices 11 R ′ , 12 R ′ , and 22 R ′ in the AD and MCI groups. (A,E,I) Are three sets of scatter plots obtained by the netNMF algorithm in the AD group. (B,F,J) Are three groups of scatter plots obtained by the HG-netNMF algorithm in the AD group. (C,G,K) Are three sets of scatter plots obtained by the netNMF algorithm in the MCI group. (D,H,L) Are three groups of scatter plots obtained by the HG-netNMF algorithm in the MCI group.

Network module selection
Fourteen network modules were obtained in the AD group, of which module 1 and 10 contained less than five genes, so they were eliminated. In the MCI group, 14 modules were obtained, of which modules 1, 3, 6, 9, and 11 contained less than five ROIs, and modules 2, 3, 4, 7, and 9 contained less than five genes. Therefore, the above modules were eliminated. Figure 4 below shows the number of elements retained in the AD and the MCI group and the respective reconstruction errors and total reconstruction errors of the two modal features.
As shown in Figure 4, module 3 of the AD group and module 10 of the MCI group had minor relative errors, and subsequent analysis of these two modules will be performed later.

Significant network module analysis
First, this paper constructed a PPI network using the ROIs and genes in network module 3 in the AD group, respectively. Specifically, this paper selected elements in module 3 from R 11 and R 22 to form PCC pairs with a one-to-one correspondence between two components. The AD group's mean of the PCC of gene-gene and ROI-ROI was 0.902. Therefore, in this paper, the PCC threshold of ROI and genes was set to 0.9, and uses the relationship pair of PCC > 0.9 to construct the ROI-ROI interaction network and the gene-gene interaction network. Furthermore, the interaction network model was visualized using Cytoscape (version 3.9.1; Shannon et al., 2003). The Matthews Correlation Coefficient metric (MCC) algorithm (Boughorbel et al., 2017) was widely used as a performance metric in bioinformatics. This paper used the MCC algorithm to mine the scores of ROIs and genes related to other network nodes and then arranges the top 10 ROIs and genes as the risk ROI and key genes. The MCI group's mean of the PCC of gene-gene and ROI-ROI was 0.604. Therefore, in this paper, the PCC threshold of ROI and genes was set to 0.60, and used the relationship pair of PCC > 0.6 to construct the ROI-ROI interaction network and the gene-gene interaction network. The PPI networks of the two groups are shown in Figure 5.

Key regions of interests and genes selection
In this paper, the MCC algorithm was used to obtain the scores of the Top nodes in the four network graphs of the AD and the MCI groups, as shown in Tables 5, 6. A higher score in the table represents a more critical role for this ROIs/gene in MCI/AD. So we discussed these ROIs/genes in detail in the subsequent discussion section. The top brain regions in both groups were also visualized ( Figure 6,  Figure 7). We also drew Venn diagrams for ROI and gene nodes in the AD group and MCI group network (Figure 8).   Top 10 ROI visualization of AD group.

FIGURE 7
Top 10 ROI visualization of MCI group.

Analysis of the biological significance of key regions of interests and genes
We first analyzed the biological significance of the Top 10 ROIs in the AD group. A pilot study showed that in AD, Aβ deposition in the inferior temporal gyrus was strongly associated with gray matter atrophy in brain region BF-227 (Maeno, 2019). The left lingual gyrus is associated with changes in functional connectivity at the local network level in AD (Chang et al., 2020). The left angular gyrus is associated with increased brain metabolism in AD patients (Weissberger et al., 2017). Studies on rs-fMRI have shown that the left angular gyrus is the brain functional connectivity region showing the most significant discrimination . The left inferior frontal gyrus is associated with genetic variants of cortical atrophy in AD, contributing to further understanding AD's genetic basis . Maeno (2019) found that Aβ deposition in the anterior cuneiform of AD patients was associated with atrophy of the right occipitotemporal region. The left medial orbital age has a strong negative correlation, which is valid in the age adjustment of AD subjects .
We also analyzed the Top ROIs in the MCI group's biological significance. The left entorhinal cortex decreases in volume and cognitive and motor dysfunction in older adults with mild cognitive impairment (Sakurai et al., 2019). Cerebral blood flow (rCBF) in the left lateral orbital gyrus was decreased in the dizzy MCI group compared with the non-dizzy MCI group (Na et al., 2020). In an ALFF-based fMRI study, researchers found significantly increased ALFF values in the right lingual gyrus of MCI patients compared with HC (Lai et al., 2022). Occipital gray matter volume correlates with neuropsychological performance in patients with amnestic MCI or mild AD (Arlt et al., 2013). The left superior parietal lobule is associated with modulating the amplitude of low-frequency fluctuations (ALFF) in patients with mild cognitive impairment (MCI) (Zhuang et al., 2020). Chen et al. (2020) found atrophy of the Para hippocampus gyrus structure in MCI. In addition, it can be found from Tables 4, 5 that the left inferior frontal gyrus and left inferior frontal gyrus are the intersection ROIs of the two groups. Here we found that these two brain regions were selected in both the AD and MCI groups. Using the left inferior frontal gyrus as a seed, Pistono et al. (2021) found that the functional connectivity of the language network could better discriminate the MCI and AD participants than the executive control network, revealing an increase in connectivity during the MCI phase. To study the relationship between MCI patients and apathy, the researchers divided MCI patients into those with and without "SPECT images suggestive of AD. " They found that apathy Frontiers in Aging Neuroscience 11 frontiersin.org negatively correlated with regional cerebral blood flow in the bilateral fusiform gyrus (Kazui et al., 2016). Next, this paper also performed the same analysis on the Top 10 genes. Two significant features of AD are transcriptome dysregulation and altered RNA-binding protein (RBP) function (Chen et al., 2020). TNS1-related Gene Ontology annotations included RNA binding and actin binding. Variations in actin-involved endocytosis pathways are significant contributors to the overall regulation of genetic risk for AD (Pistono et al., 2021). TRIM58 gene induces E3 ubiquitin ligase in late erythropoiesis (Kazui et al., 2016). Experiments have confirmed that α-synuclein in red blood cells may help differentiate AD from HC. Erythropoiesis is associated with FECH. The protein encoded by FECH is localized to mitochondria (Kazui et al., 2016). GLRX5 encodes a mitochondrial protein (Rybak-Wolf and Plass, Histograms of errors (MAE and RMSE) obtained by regression prediction of Top 10 ROIs using Top10 genes in the AD and MCI groups, respectively. (A,B) Are the regression results of two groups, respectively.
Frontiers in Aging Neuroscience 12 frontiersin.org 2021). E2F2 is involved in the control of the cell cycle, and Zhou et al. (2021) identified vital cell cycle regulators, helping to develop potential pathways for optimal AD treatment. BCL2L1 was identified as a core target of tau pathogenesis (Tesi et al., 2020). TXN is associated with cellular senescence, and Gene Ontology annotations associated with this gene include RNA binding and oxidoreductase activity. Chico et al. (2013) found that superoxide dismutase activity was reduced in MCI compared to controls. SLC15A3 is involved in the innate immune response, and Fiala et al. (2017) found that the innate immune system in MCI patients is highly increased or decreased through the transcription of inflammatory genes. LST1 may regulate immune responses, and peripheral innate immune responses had the highest activation level in the MCI group compared with the subjective memory complaints (SMC) and AD groups (Munawara et al., 2021). The CCDC9 gene is a possible component of the

Analysis of MMSE based on top features
In this paper, the Top 10 ROIs and 10 genes of the AD and the MCI groups were used to predict MMSE (stands for root mean square error) to confirm the correlation between the Top features and the clinical score. Specifically, this paper first used support vector regression (SVR), random forest (RF), and K-Nearest Neighbors (KNN) algorithms to regress the MMSE of the two groups on the training sets of the two groups, respectively. The regression effect was evaluated using MAE and RMSE, as shown in Figure 9.
It can be found from Figures 9A,B

Correlation analysis between top regions of interests and genes
This paper drew a correlation heat map for the Top 10 ROIs and genes on the test set in the two groups obtained above, as shown in Figure 10.
As seen in Figure 10, there is a strong correlation between the Top features of the AD group data. This paper takes the absolute value of PCC and then calculates the average correlation. Among them, the average correlation for the AD group was 0.5411, and the correlation of the control group was 0.1045. The mean correlation was 0.1039 in the MCI group and 0.2619 in the control group.

Regression analysis between top features
In this section, this paper still used three regression algorithms: SVR, RF, and KNN. Specifically, this paper uses the Top 10 genes selected by the HG-netNMF algorithm to regress to the Top 10 ROIs, respectively, trains three models on the training set, and uses MAE and RMSE to evaluate the regression effect on the test set, as shown in Figure 11. It can be found from Figure 11 that using different regression algorithms to predict the left lingual gyrus of the AD group can achieve the smallest root mean square error. Using different regression algorithms to predict the right frontal operculum of the MCI group can achieve the smallest root mean square error.

Classification analysis
This paper's ROC curves were drawn using the Top 10 ROIs and genes of the AD and MCI groups, respectively. In addition, this paper used Logistic regression in IBM SPSS Statistics software to build a joint diagnostic model, as shown in Figure 12. This paper also counted the specific information of the four diagnostic models, as shown in Table 7.
Next, to determine whether the ROIs and genes involved in the construction of the diagnostic model also have diagnostic significance for AD and MCI, this paper drew ROC curves for the ROIs and genes involved in the construction of the diagnostic model in Figure 9 to verify one by one, as shown in Figure 10. In addition, the ROC curve information of a single ROI and gene in the two groups was calculated separately, as shown in Tables 8, 9. It can be seen from Figure 13 that most of the AUCs of ROI/gene involved in the construction of the diagnostic model are more significant than 0.5, which has diagnostic significance for AD and MCI, further confirming the effectiveness of the algorithm.

Conclusion
This paper proposed a hypergraph-based NetNMF method to integrate sMRI and genetic data of AD and MCI patients and mined their respective regulatory networks and interaction information, aiming to explore the risk ROIs and genes associated with AD and MCI. Finally, robust diagnostic models for AD and MCI were constructed, respectively. In future work, we will integrate more data types to analyze the disease-related regulatory mechanisms more comprehensively.

Data availability statement
Publicly available datasets were analyzed in this study. This data can be found at: https://adni.loni.usc.edu.

Author contributions
JT contributed to the conception of the study. JZ and JT performed the experiment. XX and TL contributed significantly to analysis and manuscript preparation. ZC and RC performed the data analyses and wrote the manuscript. JC and XL helped to perform the analysis with constructive discussions. All authors have read and approved the manuscript.