A network-based method for associating genes with autism spectrum disorder

Autism spectrum disorder (ASD) is a highly heritable complex disease that affects 1% of the population, yet its underlying molecular mechanisms are largely unknown. Here we study the problem of predicting causal genes for ASD by combining genome-scale data with a network propagation approach. We construct a predictor that integrates multiple omic data sets that assess genomic, transcriptomic, proteomic, and phosphoproteomic associations with ASD. In cross validation our predictor yields mean area under the ROC curve of 0.87 and area under the precision-recall curve of 0.89. We further show that it outperforms previous gene-level predictors of autism association. Finally, we show that we can use the model to predict genes associated with Schizophrenia which is known to share genetic components with ASD.


Introduction
Autism spectrum disorder (ASD) is a complex neurological and developmental disorder that affects a person's behavior, communication, and learning abilities.About 1 in 44 children is identified with the disorder according to estimates from Centers for Disease Control and Prevention (CDC) Autism and Developmental Disabilities Monitoring (ADDM) Network (Maenner, 2021).It is thought to be caused by a combination of genetic and environmental factors that impacts the structure and function of the brain and nervous system (Flickr, 2023).Identifying the genetic base of ASD is critical to understanding its underlying biological mechanisms.Such knowledge will impact the development of new interventions and treatments for individuals affected by this disorder.
Machine learning based methods offer a new perspective to the problem by learning from known ASD-related genes and building models that provide ways to prioritize the risk associated with previously unknown genes based on their predicted scores (Liu et al., 2014;Krishnan et al., 2016;Duda et al., 2018;Brueggeman et al., 2020;Lin et al., 2020).These methods differ in their training features and prioritization method.(Duda et al., 2018) and (Krishnan et al., 2016) leveraged brainspecific functional interaction networks to produce genome-wide rankings of ASD associated genes.(Liu et al., 2014) clustered evidence for ASD association within a co-expression network in specific brain regions.(Lin et al., 2020) used features extracted from spatiotemporal gene expression patterns in the human brain.Last, the state-of-the-art forecASD (Brueggeman et al., 2020) integrates network-based information from large gene interaction networks with scores of genetic association and brain gene expression information.In detail, novel features are generated from BrainSpan expression and STRING interaction data.These are combined with literature-derived features from DAWN, DAMAGES, and Krishnan (Liu et al., 2014); (Krishnan et al., 2016;Zhang and Shen, 2017) and used to train a random forest classifier for ASD association.
The methods described above have predominantly relied on a single data source, potentially missing relevant information.Conversely, some approaches such as forecASD utilize multiple data sources that are devoid of network context.To address these limitations, our proposed method leverages a network propagation technique to integrate diverse data sources while accounting for their network context.Our model employs network-propagation on ASD associated genes from different data sets to derive predictive gene scores.Features are then combined using a random forest classifier.We evaluate the performance of our model and compare it to previous methods.Finally, we use our model to predict Schizophrenia-associated genes.

Classification pipeline
The computational pipeline has two stages.First, network-based gene features are generated using a network propagation technique.Second, a random forest model is applied to these features to yield the prediction score.The classifier is summarized in Figure 1.Its stages are described in the following sections.

Feature generation
Our starting point is gene lists obtained from the literature as detailed in Table 1.Overall, we use ten gene sets that were suggested to be associated with ASD based on various layers of information.
Each of these ASD related gene lists is used as a seed for a network propagation process that pinpoints other genes with high proximity to the seed set in a protein-protein interaction (PPI) network.The initial value of each seed protein from a list of size s is set to 1/s.We use a human PPI network from (Signorini et al., 2021) which has 20,933 proteins and 251,078 interactions in its main connected component.We run network propagation with default damping parameter ɑ = 0.8.We normalize the results using the eigenvector centrality method (Erten et al., 2011) in order to avoid biases which are caused by the degrees of the proteins.The resulting ten propagation scores for each gene comprise its feature set.

Random forest model
The features of a gene are integrated using a random forest model.To train the model, we use SFARI's Gene Scoring Module (Abrahams et al., 2013) which offers critical evaluation of the strength of the evidence for each gene's association with ASD.The genes are assigned to four categories: "Syndromic" (S), "Category 1" (High Confidence), "Category 2" (Strong Candidate) and "Category 3" (Suggestive Evidence).We label "Category 1" genes as positives (206 in total) and randomly pick 206 negative genes that do not appear in the SFARI database.
The random forest model is trained with the "sklearn" Python package using its default parameters which are a maximum of 100 trees, no maximum tree depth and minimum number of samples required to split an internal node of 2. ASD Classifier Flowchat.Putative ASD-associated genes serve as seeds to a network propagation process.After propagation, the resulting scores yield classification features for a random forest classifier that is used to compute association scores.

Model performance
We tested our classifier using 5-fold cross-validation.Figure 2 depicts ROC and precision-recall graphs showing the final result as the mean between all the five fold scores.The AUROC is 0.87 and the AUPRC is 0.89 indicating the high accuracy of our method.
To facilitate the model's application by potential users, we calculated an optimal classification cutoff of 0.86, which maximizes the product of specificity and sensitivity (Liu, 2012).The full model scores can be found in Supplementary Table S2.
To further support our results, we tested our classifier's predictions on SFARI genes of scores 2 and 3 (which were not used during training), compared to randomly chosen negative genes (Figure 3).The classifier's score distributions of the groups were compared using the Wilcoxon signed-rank test.Reassuringly, SFARI genes with scores 2 and 3 got significantly higher scores than the random negative genes (p-value < 3.62e-34).
After establishing the accuracy of our proposed method, we conducted a comparative analysis with the state-of-the-art forecASD classifier, which was shown to outperform earlier predictors.For this purpose, we employed the same random forest classifier on both our dataset and the features suggested by forecASD, namely BrainSpan Performance evaluation.Performance of our classifier in 5-fold cross validation.The result of each fold and its mean are shown.Left: ROC curves.Right: Precision-recall curves.
and STRING.We additionally assessed the propagation procedure on a random degree-preserving network to act as a negative control.
The results are provided in Figure 4 showing the superiority of our method compared to forecASD and the negative control (AUROC of 0.91 vs. 0.87 and 0.82).Notably, the relatively high AUROC of the negative control testifies to the quality of the gene sets that serve as seeds for the network propagation.

Functional annotation analysis
Next, we wished to analyze the functional roles of the top predicted genes.To this end, we set a prediction threshold of 0.947, which maximizes the sum of precision and recall, and focused on the 84 genes passing this threshold.We conducted functional enrichment analysis using g:Profiler (version e109_ eg56_p17_1d3191d) with Bonferroni corrected p-values and a significance threshold of 0.001 (Raudvere et al., 2019).The analysis is based on several data sources (GO:MF, GO:BP, Human Phenotype Ontology) (Figure 5).
From Human Phenotype Ontology-Autistic Behavior was the highest enriched phenotype.Using GO:BP (Biological Process) and GO:MF(Molecular Function) data sources, yield several highly enriched pathways known to play important roles in autism etiology including chromatin organization and binding (Haddad Derafshi et al., 2022), histone modification (Sun et al., 2016), neuron cell-cell adhesion (Eve et al., 2022) and zinc ion binding (Walsh et al., 2001;Walsh, 2002;Wang and Zhou, 2010;Yasuda et al., 2011).The full list of the functional annotation analysis results can be found in Supplementary Table S1.

Exploiting schizophrenia genes
Given the known phenotypic similarity between ASD and schizophrenia (SCZ) (Hommer and Swedo, 2015) we wished to test whether our classification model could be further improved by adding information on SCZ associated genes.Thus, we collected lists of genes that were associated with SCZ (Table 2) and reapplied our computational pipeline.
Next, we checked if our ASD classifier can also predict schizophrenia associated genes.We ran it on genes which are associated with SCZ from a database for Schizophrenia genetic research, SZDB (Wu et al., 2017).Specifically, we focused on 1622 SCZ-associated genes from SZDB with scores higher than 3, and compared their scores to those of the same number of random genes (Figure 6).Wilcoxon signed-rank test showed that the classifier gave the SCZ-associated genes significantly higher scores than the random genes (p-value of 2.275e-11).

Conclusion
We have presented a classification model for disease association.The classifier uses network propagation which enables us to  Performance comparison against forecASD and a negative control.Shown here are the mean ROCAUC results of each classifier.
combine and to amplify signals from individual genes, and uses machine learning to allow us to learn from known genes in order to classify new ones.In application to ASD, our classifier attained high accuracy, outperforming the state of the art.We have further shown its applicability to SCZ, which benefits from the similarity between these diseases but also shows the generality of our approach.
Functional enrichment analysis of our proposed candidate ASD genes has pointed to several pathways and processes that have been previously linked to ASD.For instance, neuron cell-cell adhesion, which may contribute to neuroinflammation in ASD (Eve et al., 2022) and participates in neurodevelopmental pathways associated with the disorder (Gandawijaya et al., 2020).Moreover (Bonsi et al., 2022),  Indeed, the model based on both ASD, and SCZ, genes obtained improved results with AUC, and AUPRC, of 0.88 and 0.89 respectively.
Frontiers in Bioinformatics frontiersin.orgdemonstrated that genes associated with ASD are frequently involved in the structural organization and functional activity of synapses, as evident in our results indicating enriched pathways such as synapse assembly and presynaptic and postsynaptic membrane assembly.This is interesting, as ASD is sometimes regarded as a disorder of connectivity (Mohammad-Rezazadeh et al., 2016), since its genes have both direct and indirect effect on a range of presynaptic and postsynaptic proteins (Bonsi et al., 2022;Yeo et al., 2022).The fact that our classifier succeeded in predicting Schizophrenia-associated genes may suggest that previously implied association between ASD and SCZ may evolve from connectivity-related issues (Hommer and Swedo, 2015).

FIGURE 5
FIGURE 5Functional annotation results for genes predicted by our classifier.(A) Manhattan plot created with g:Profiler illustrates the enrichment analysis results.The x-axis represents functional terms that are grouped and color-coded by data sources.The y-axis shows the adjusted enrichment p-values in negative log10 scale.Highlighted points in the plot are the terms which got the highest scores, and highlighted driver terms in GO created by g:Profiler algorithm.The algorithm is used for filtering GO enrichment results, providing a more efficient and reliable approach compared to traditional clustering methods.(B) Detailed information about highlighted circles from the Manhattan graph.Detailed information include data source, id and name of the term together with corresponding p-value.

TABLE 1
Lists of ASD associated genes according to different sources.
, (2019) Cross-cortex dup15q-associated genes that reported both significant differential DNA methylation and transcriptional changes 223 post-mortem tissues samples isolated from three brain regions [prefrontal cortex, temporal cortex and cerebellum (CB)] 74 10 Sanders et al., (2015) Analysis of de novo CNVs (dnCNVs) from the full Simons Simplex Collection 65 FIGURE 2

TABLE 2
Lists of SCZ associated genes.