Prioritization of risk genes for Alzheimer’s disease: an analysis framework using spatial and temporal gene expression data in the human brain based on support vector machine

Background: Alzheimer’s disease (AD) is a complex disorder, and its risk is influenced by multiple genetic and environmental factors. In this study, an AD risk gene prediction framework based on spatial and temporal features of gene expression data (STGE) was proposed. Methods: We proposed an AD risk gene prediction framework based on spatial and temporal features of gene expression data. The gene expression data of providers of different tissues and ages were used as model features. Human genes were classified as AD risk or non-risk sets based on information extracted from relevant databases. Support vector machine (SVM) models were constructed to capture the expression patterns of genes believed to contribute to the risk of AD. Results: The recursive feature elimination (RFE) method was utilized for feature selection. Data for 64 tissue-age features were obtained before feature selection, and this number was reduced to 19 after RFE was performed. The SVM models were built and evaluated using 19 selected and full features. The area under curve (AUC) values for the SVM model based on 19 selected features (0.740 [0.690–0.790]) and full feature sets (0.730 [0.678–0.769]) were very similar. Fifteen genes predicted to be risk genes for AD with a probability greater than 90% were obtained. Conclusion: The newly proposed framework performed comparably to previous prediction methods based on protein-protein interaction (PPI) network properties. A list of 15 candidate genes for AD risk was also generated to provide data support for further studies on the genetic etiology of AD.


Introduction
Alzheimer's disease (AD) is a chronic neurodegenerative disorder that is characterized by cognitive impairment and memory loss.It affected approximately 50 million people worldwide in 2020, which is expected to increase to 150 million by 2050 (Breijyeh and Karaman, 2020).Advanced age is the most important risk factor for AD (Knopman et al., 2021).A significant increase in the incidence rate of AD was observed in senior citizens after the age of 65 years (Knopman et al., 2021).Equal incidence rates of AD were identified for males and females after adjusting for age, indicating that sex might not be associated with the risk of AD (Knopman et al., 2021).The pathological features of AD include senile plaques formed by the accumulation of β-amyloid protein and neurofibrillary tangles composed of highly phosphorylated τ proteins.Several hypotheses have been proposed to explain the pathogenesis of AD, including oxidative stress (Yang et al., 2022), inflammation (Yang et al., 2022), and DNA damage (Tanaka et al., 2021).However, no consensus has yet been reached.
Previous studies have indicated that AD is a complex disorder, and its risk is attributed to multiple genetic and environmental factors (Carmona et al., 2018;Bertram and Tanzi, 2019).In the last decade, genome-wide association (GWA) analyses have significantly contributed to the genetic etiology of AD (Bertram and Tanzi, 2019).Jansen et al. confirmed 29 risk loci and several relevant pathways related to AD through a GWA meta-analysis (Jansen et al., 2019).In addition, Celeste et al. reviewed the relationship between several AD risk genes, including ABCA7, BIN1, CASS4, and CD33, and the cellular and neuropathological characteristics of AD (Karch et al., 2014).Nevertheless, a recent study indicated that approximately half of the heritability of AD remains unaccounted (Raybould and Sims, 2021).It is probable that a large number of susceptibility loci for AD have not yet been discovered.However, recent studies have indicated that largerscale GWA studies in the future are less cost effective due to the intrinsic deficiency rooted in the study design of GWA studies; therefore, it might not be a preferable choice for unraveling these hidden genomic regions that contribute to the risk of AD (Escott-Price and Hardy, 2022).In this sense, prioritizing AD risk genes based on evidence gained from different perspectives and then validating these candidate risk genes in subsequent candidate gene-based association studies might be an effective strategy for discovering more relevant genes for AD risk.In a recent study, Cogill et al. applied machine-learning-based methods using brain developmental gene expression data to prioritize high-confidence candidate genes for autism spectrum disorder (Cogill and Wang, 2016).This study established a feasible analysis pipeline for prioritizing candidate risk genes for complex disorders, using spatial and temporal gene expression data.
Multiple lines of evidence have indicated that the expression of AD risk genes has specific spatial and temporal features (Moradifard et al., 2018;Grubman et al., 2019).Extracting and properly synthesizing information from these gene expression features might be an effective way to prioritize the risk genes for AD.In this study, we aimed to construct and evaluate a machine-learningbased model to identify high-confidence risk genes for AD using spatial and temporal gene expression data extracted from a publicly available database.

Materials and methods
The statistical analysis pipeline is shown in Figure 1.In this study, we propose an AD risk gene prediction framework based on spatial and temporal features of gene expression data (STGE).In this analysis framework, the gene expression data of providers of different tissues and ages were utilized as model features.Human genes were classified as AD risk or non-risk sets and randomly split into training and validation sets.Support vector machine (SVM) models were constructed to capture the expression patterns of genes that were believed to contribute to the risk of AD in the training set, which were then applied to the validation set to evaluate model performance.The STGE model was then applied to a gene set with an unknown status for AD risk, and a confidence score was assigned to each gene.
Spatial and temporal expression data for each gene were obtained from the GTEx database.Gene expression data related to tissues of the human brain (including the cerebellum, cortex, anterior cingulate cortex, hippocampus, substantia nigra, caudate, cerebellar hemisphere, frontal cortex, hypothalamus, nucleus accumbens, putamen, spinal cord, and amygdala) were extracted.Data from tissue sample providers under 20 or over 70 years of age were not included, and all these providers were healthy.In addition, we also removed tissue providers who scored 0 or 4 points on the death classification provided by GTEx database basing on the 4point Hardy Scale (Hardy et al., 1985), because those scores represent the death of the provider is associated with chronic disease.Specifically, the score of 0 added by GTEx database stands for ventilator case (all cases on a ventilator immediately before death), and the score of 4 stands for slow death case (death after a long illness, with a terminal phase longer than 1 day; deaths that are not unexpected).Finally, gene expression data in 13 types of brain-related tissues for 14,697 genes were extracted from 317 tissue sample providers of various ages and genders (Supplementary Table S1 and Supplementary Figure S1).
AlzData is a database for scoring correlations between human genes and the risk of AD, based on evidence from highthroughput omics data.The CFG scores ranged from 0 to 5, with a higher score indicating a stronger correlation between the gene and AD.Genes with scores of 4-5 were extracted to form the AD risk gene set ("the right answer").For genes with scores of 0-3, we supplemented the "DISEASE/TRAIT" (we always call it "trait" for short) from the GWAS catalog and excluded genes related to AD to obtain AD non-risk genes.Finally, 3,899 genes comprising 340 AD risk genes and 3,559 non-AD risk genes were identified, and these genes' CFG scores and GWAS traits are shown in Supplementary Table S2.

Model construction and evaluation
The SVM models were constructed based on spatial and temporal gene expression data extracted from relevant databases using the e1071 package of the R project, and both spatial and temporal aspects of the data are contained in the data features, which is going to be used for feature selection.Gene expression data were first grouped by the tissue type and age of the tissue providers.The median expression level of each gene in the tissue type age group was calculated and used as features in the SVM models.A total of 64 brain tissue-related features were obtained for model construction (Supplementary Table S3).The dataset was randomly divided into training and validation sets in a ratio of 7: 3.There were 238 AD risk genes and 2,491 AD non-risk genes in the training set.The SMOTE function in the DMwR package was used to balance gene numbers.Feature selection was conducted using the caret package, and 19 spatial and temporal features were selected based on recursive feature elimination (RFE).Accuracy and Kappa statistics were chosen as the evaluation indicators to estimate the performance of the selected features, and we chose the feature set with both the greatest value and least variance to build the SVM model.Parameter optimization was performed using a grid search strategy.Parameters including model accuracy, specificity, sensitivity, and area under the curve (AUC) were utilized to evaluate the performance of the SVM model.The R packages pROC and ROCR were used to draw the ROC curve and calculate the AUC, respectively.The R package ggplot2 was used for data visualization.

Results validation
After the genes with high confidence were predicted by SVM model, the normalized expression in AlzData database (http://www.alzdata.org/Normalized_differential1.php) will be used for providing these genes' differential expression data.Besides, KOBAS platform (http://kobas.cbi.pku.edu.cn/) will be used to do gene ontology (GO) and KEGG pathway enrichment analysis in all available databases (including OMIM, KEGG Disease and NHGRI GWAS Catalog).

Feature selection based on recursive feature elimination
RFE was used for feature selection.Data for 64 tissue-age features were obtained before feature selection, and this number was reduced to 19 after RFE was performed.SVM models based on each of these 19 features (the gene expression levels were obtained by median values of samples) were built and evaluated for accuracy, specificity, sensitivity, and AUC (Table 1).The feature with the highest AUC was the human tissue of the brain cerebellum at the age of 40-49 (AUC = 0.688).

Comparison and validation of SVM models
SVM models were built and evaluated using 19 selected and full features (Table 2 and Figure 2).The AUC values for the SVM model based on 19 selected features (0.74 [0.690-0.790])and full feature sets (0.730 [0.678-0.769])were very similar.To evaluate model robustness, we also constructed these models based on the mean expression level of each gene in the tissue type age group.In addition, to examine the potential effects of sex, SVM models were constructed based on the expression data from male and female samples.The results are summarized in Supplementary Table S4.There are no significant differences when mean values were utilized compared to median values.The model performance based on males or females was also very similar to that of models constructed using all samples.Finally, we chose the selected feature and median values to construct the SVM model because of its highest AUC.Besides, some known AD risk genes (such as APOE, PICALM and BIN1) were recovered with the final SVM model, and the probabilities of them being classified as AD risk genes are ranged from 0.723-0.783and shown in Supplementary Table S5.

Risk genes of AD predicted by the SVM model
Based on the SVM models constructed using tissue-age-specific gene expression data, the risk contributions to AD onset and development were evaluated for 10,798 genes that were not included in the model construction and evaluations (the external gene set).15 genes predicted to be risk genes for AD with a probability greater than 90% were obtained (Table 3).Among these genes, GUCY1B3 had the highest confidence score as a risk gene for AD (0.93).To further investigate this gene set, we examined the gene expression patterns of these 15 genes in the human brain and made a heatmap showing in Supplementary Figure S2.In addition, 191 risk genes for AD with a probability greater than 80% are shown in Supplementary Table S6.

Differential gene expression analysis and pathway/ontology analysis
After the normalized differential gene expression analysis, there exist 8 genes among 15 candidate genes expressing differentially in AD.The differential expression data of these 8 genes are shown in Table 4.The GO and KEGG pathway enrichment analyses find out 15 pathways that are statistically correlated with candidate genes, which are shown in Supplementary Figures S3, S4.

Discussion
In the present study, we propose a novel machine-learningbased analysis pipeline using data extracted from the GTEx database to prioritize candidate AD risk genes.The performance measured by the AUC of the SVM models was promising, and a list of 15 candidate AD risk genes was presented according to the prediction model.In the last decade, several studies have been published to identify candidate AD risk genes, and most of these studies were based on protein-protein interaction (PPI) networks to identify hub genes using GWA data.The model performance measured by the AUC of these previous studies ranged from 0.63 to 0.84 depending on different settings (Luo et al., 2019;Lagisetty et al., 2022;Wang et al., 2022;Pei et al., 2023).The methods used in these comparative studies and their AUC are shown in the Supplementary Table S7.Unlike these previous studies, the STGE framework was used to predict AD candidate genes based on the spatial and temporal features of AD risk gene expression.The performance of our model (AUC = 0.74) was comparable to that of previous studies.In this sense, the present study proposed and validated an alternative framework for prioritizing risk genes for AD.In the future, an analysis framework integrating information from gene expression features and PPI network properties might be a promising method to further promote the accuracy and effectiveness of prediction models for prioritizing candidate AD risk genes.
Although most patients with AD experience the first symptom in their mid-60s, previous studies have indicated that changes in the molecular levels occur at a much earlier stage (Egan et al., 2019;Vermunt et al., 2019).A previously published family-based longitudinal study has shown that familial AD may have a long prodromal phase of several years (Chiotis et al., 2018).A recent cohort study also indicated that plasma phospho-tau181 levels were much higher from 16 years prior to the onset of AD symptoms in AD patients with specific DNA mutations (Wang et al., 2021;Karikari et al., 2022).The results of the current study offer new evidence at the gene expression level for prodromal changes in AD patients.Although AD is a late-onset disorder, more than half of the selected features were obtained from sample providers before the age of 60 years.Five of the 19 features, including tissues of the anterior cingulate cortex, putamen basal ganglia, caudate basal ganglia, cerebellum, and hypothalamus, were obtained from providers who are 30-39 years old.In accordance with multiple lines of previous evidence, these findings indicate that molecular-level changes might be identified several years before early symptoms appear in patients with AD.Nevertheless, since a couple of the AD risk genes used in this study were extracted from studies focusing on early-onset AD, we need to be cautious in interpreting these results.Future research using longitudinal data might provide more clues for identifying prodromal biomarkers for AD and, in turn, shed light on early screening and prevention of this complex neurodegenerative disorder.
Among the 15 candidate genes identified through STGE, a few are of particular interest.Sine oculis homeobox homolog 3 (SIX3) encodes a type of transcription factor belonging to the sine oculis homeobox transcription factor family (Steinmetz et al., 2010).Multiple lines of evidence based on animal models have linked this locus to brain development (Steinmetz et al., 2010;Schacht et al., Frontiers in Genetics frontiersin.org2020).A recent GWA study associated genetic polymorphisms of SIX3 with math ability, and its weakening was considered a sign of the progression of AD patients (Lee et al., 2018).Actin-related protein 3B (ACTR3B) encodes a member of the actin-related protein (ARP) family, which might regulate and induce cell shape changes and motility (Hu et al., 2018).Several previous studies have linked ACTR3B to brain aging progression, although no direct GWA study has validated the connection between genetic polymorphisms of these loci and AD (Hu et al., 2018;Seefelder and Kochanek, 2021).In addition, multiple animal models and population-based evidence have been published for dopamine receptor D2 (DRD2) and gamma-aminobutyric acid type A receptor subunit alpha 5 (GABRA5) being associated with brain-related disorders and traits, including schizophrenia, bipolar disorder, Parkinson's disorder, and neurotransmission (Prisciandaro et al., 2017;Escamilla et al., 2018;Mundorf et al., 2021;Zhang et al., 2021).
In a recent study, Blum et al. concluded that the DRD2 Taq1A A1 allele might increase the risk of Alzheimer's aging in African Americans by integrating and reviewing previously published data (Blum et al., 2018).Additionally, the genes BAG Cochaperone 3 (BAG3), inositol polyphosphate-5-phosphatase A (INPP5A), seizure related 6 homolog (SEZ6), and intercellular adhesion molecule 5 (ICAM5) are involved in the progression of AD has been proposed in several functional studies using animal models (Hoarau et al., 2011;Paetau et al., 2017;Zhu et al., 2018;Zhou et al., 2020;Zhu et al., 2021).Within these genes, through proteomic study, BAG3 may affect AD by influencing the interpretation of Aβ and tau protein, and patients with AD have much lower levels of SEZ6 in their cerebrospinal fluid than those without dementia (Khoonsari et al., 2016;Gonzalez-Rodriguez et al., 2021).Further in vivo and in vitro studies are needed to validate the functional connections between the risk of AD and the genes on the predicted list.Three of the 15 pathways identified by GO and KEGG pathway enrichment analyses are worthy of attention, including regulation of synapse structural plasticity, branching morphogenesis of a nerve and forced vital capacity.According to a review, synapse structural plasticity is related to the number of spines, and post-mortem reports of Alzheimer's brains showed reduced spine number in the hippocampus and cortex (Chidambaram et al., 2019).One research studying novel compounds' effect on neuronal branching morphogenesis of PC12 cells indicates that branching morphogenesis is one of the entry points for research to promote recovery of nerve regeneration following neurodegenerative diseases, like AD (Katebi et al., 2019).A prospective cohort study of 431,834 individuals shows that per unit decrease in lung function measure was each associated with increased risk for all-cause dementia (including AD).As for forced vital capacity, its hazard ratio (HR) is 1.16 and p-value is 2.04 × 10 −5 (Ma et al., 2023).
The current study has several limitations.First, there is still much space for the promotion of STGE, although the performance of STGE is comparable to that of previous models based on PPI network properties.In addition, as bioinformatics data mining is based on publicly available databases, the completeness of the current work might be limited owing to data availability.The gene expression data in the brain substantia nigra in the age group of 30-39 years were unavailable from the database; therefore, this feature was not included in the model  (Goyal et al., 2018), all candidate AD risk genes are protein-coding genes in the current study.Furthermore, the data we used in our research can only correlate to tissues, so we were unable to associate these genes with specific brain cell types.In summary, in the present study, an efficient analysis framework based on spatial and temporal features of gene expression was proposed to prioritize AD risk genes.The newly proposed framework performed comparably to previous prediction methods based on PPI network properties.A list of 15 candidate genes for AD risk was also generated to provide data support for further studies on the genetic etiology of AD.

FIGURE 2
FIGURE 2 ROC curves of the SVM models constructed based on the median gene expression levels in different tissue-age groups.

TABLE 1
The mean accuracy, sensitivity, specificity and AUC of each model built by each selected feature from the RFE method.

TABLE 3
Genes predicted by the SVM model with their confidence score, location, length (bp) and biotype.

TABLE 4
Differential expression results of candidate genes with FDRs < 0.05.and evaluation.Besides, although the data for training the model contains non-coding RNA, which have been shown to play an important role in the pathogenesis of complex disorders construction