Novel Gene-Based Analysis of ASD GWAS: Insight Into the Biological Role of Associated Genes

Background: Autism spectrum disorder (ASD) is a neurodevelopmental disorder characterized by its significant social impact and high heritability. The latest meta-analysis of ASD GWAS (genome-wide association studies) has revealed the association of several SNPs that were replicated in additional sets of independent samples. However, summary statistics from GWAS can be used to perform a gene-based analysis (GBA). GBA allows to combine all genetic information across the gene to create a single statistic (p-value for each gene). Thus, PASCAL (Pathway scoring algorithm), a novel GBA tool, has been applied to the summary statistics from the latest meta-analysis of ASD. GBA approach (testing the gene as a unit) provides an advantage to perform an accurate insight into the biological ASD mechanisms. Therefore, a gene-network analysis and an enrichment analysis for KEGG and GO terms were carried out. GENE2FUNC was used to create gene expression heatmaps and to carry out differential expression analysis (DEA) across GTEx v7 tissues and Brainspan data. dbMDEGA was employed to perform a DEG analysis between ASD and brain control samples for the associated genes and interactors. Results: PASCAL has identified the following loci associated with ASD: XRN2, NKX2-4, PLK1S1, KCNN2, NKX2-2, CRHR1-IT1, C8orf74 and LOC644172. While some of these genes were previously reported by MAGMA (XRN2, PLK1S1, and KCNN2), PASCAL has been useful to highlight additional genes. The biological characterization of the ASD-associated genes and their interactors have demonstrated the association of several GO and KEGG terms. Moreover, DEA analysis has revealed several up- and down-regulated clusters. In addition, many of the ASD-associated genes and their interactors have shown association with ASD expression datasets. Conclusions: This study identifies several associations at a gene level in ASD. Most of them were previously reported by MAGMA. This fact proves that PASCAL is an efficient GBA tool to extract additional information from previous GWAS. In addition, this study has characterized for the first time the biological role of the ASD-associated genes across brain regions, neurodevelopmental stages, and ASD gene-expression datasets.


INTRODUCTION
Autism spectrum disorder (ASD) is a neurodevelopmental disorder (NDD) characterized by impaired social interaction and communication together with repetitive and restrictive behaviors (American Psychiatric Association, 2013). The estimated prevalence of ASD stands at approximately 1% in the general population, and this value is influenced by whether adult or child and adolescent populations are considered (The National Academies of Sciences, 2015).
ASD is a multifactorial disorder, meaning that several factors (both environmental and genetic) are involved in the development of this condition. Thus, rare genetic variation, including small deletions and duplications (indels), CNVs (copy number variation) and SNVs (single-nucleotide variation) contribute significantly to ASD liability. In addition, the pattern of inheritance (common or rare) and the origin of these mutations (inherited or de novo) will also differentially contribute to the genetic risk. Thus, rare inherited variation (MAF < 1%) alone only explains 3% of ASD genetic risk (Sanders et al., 2015). However, common variation accounts for a substantial fraction of ASD heritability (50%) (Cross-Disorder Group of the Psychiatric Genomics Consortium, 2013; Smoller et al., 2013;Glessner et al., 2014;Anttila et al., 2018). Early GWAS of ASD have failed to detect strong signals which could possibly be due to the presence of phenotypic heterogeneity across collections and the need of much larger sample sizes to achieve an adequate statistical power (Weiss et al., 2009;Anney et al., 2010;Ma et al., 2010). Different associated signals were considered as plausible risk variants for years even when subsequent studies did not replicate most of them. Thus, some examples of associated SNPs (singlenucleotide polymorphisms) are the following: rs10513025 (SEMA5A), rs4141463 (MACROD2), and rs4307059 (MSNP1), among others (Weiss et al., 2009;Kerin et al., 2012;Jones et al., 2013;Torrico et al., 2017). However, the genetic landscape of ASD has evolved over the past 10 years, and the latest ASD GWAS meta-analysis conducted by the Psychiatric Genomic Consortium (PGC) is a proof of that. PGC has made an incredible effort not only to increase sample size up to 10,000 cases and controls but also to develop well-defined quality control and imputation pipelines. The latest ASD GWAS metaanalysis has led to the identification of 93 genome-wide significant markers of which only 53 were replicated in independent cohorts. Top associated SNPs were rs910805 (chromosome 20; p-value: 2.04 × 10 −9 ) and rs10099100 (chromosome 8; p-value: 1.07 × 10 −8 ), among others (Grove et al., 2019). However, GWAS information can be also analyzed following a gene-based analysis (GBA) approach. Thus, p-values for each SNP within a gene are combined to obtain a single statistic for each single gene across the genome (Huang et al., 2011). GBA strategies employ summary statistics from GWAS without the need of individual genotypes. The study of genes as testing units with a biological entity provides the correct framework to carry out secondary approaches. These additional studies help to characterize the molecular and cellular mechanisms involved in the pathogenesis of the disease (Wang et al., 2007).
PASCAL (Pathway scoring algorithm) is one of the most recent GBA approaches. PASCAL generates gene scores by aggregating SNP p-values from GWAS meta-analysis while correcting for LD (linkage disequilibrium) using 1000 Genomes data. Thus, PASCAL creates a pairwise SNP by SNP correlation matrix. MAGMA also allows to correct by LD structure but in a different way. It creates a SNP matrix (principal components) pruning away those SNPs presenting small eigenvalues in the matrix (de Leeuw et al., 2015;Lamparter et al., 2016). Moreover, PASCAL is distinguishable from VEGAS and MAGMA by having more statistical power and by being computationally less demanding (Lamparter et al., 2016).
The main aim of this paper is to further mine the summary data from the meta-analysis of ASD by PASCAL. Although it is expected that the results do not vary much from those genes found by MAGMA, additional analysis will be performed to characterize the functional role of the ASD-associated genes. Therefore, gene-network analysis, enrichment analysis, and metaanalysis of DEG (different expressed genes) in ASD brain will be carried out to gain information about their biological function.

GWAS Meta-Analysis
Summary statistics from the latest ASD GWAS meta-analysis were obtained from the public repository available in the PGC website (http://www.med.unc.edu/pgc/results-and-downloads). The following data set was employed: iPSYCH_PGC_ASD_ Nov2017.gz (Grove et al., 2019) which includes the meta-analysis of ASD by the Lundbeck Foundation Initiative for Integrative Psychiatric Research (iPSYCH) and the Psychiatric Genomics Consortium (PGC) released in November 2017.
The data set comprises a total of 18,381 cases and 27,969 controls. Additional information including sample size and ancestry can be found in Table 1. Additional information about the genotyping and QC methods employed are available at the PGC website.

Gene-Based-Analysis
PASCAL scoring algorithm (https://www2.unil.ch/cbg/index. php?title=Pascal) was applied to the summary statistics from ASD meta-analysis. Individual SNPs from GWAS results were first mapped to genes employing a default ±50 kb window around 5′ and 3′UTRs. The default maximum number of SNPs per gene allowed by PASCAL was 3,000. LD information was retrieved from 1000 Genomes European panel. Bonferroni correction sets the significance cutoff at 2.26 × 10 −6 (0.05/22135 genes).

Regional Plots
LocusZoom tool (http://locuszoom.org/) was employed to construct regional plots for the regions containing PASCAL associated genes. To this aim, meta-analysis data including marker name, p-values, odds ratio (OR), chromosome position (start-end), and index SNP were specified for the analysis. The source of LD information used to construct the r 2 correlation matrix between SNPs in these regional plots was retrieved from hg19/1000 Genomes Nov 2014 EUR (European). The rest of the optional controls were used as default.

Gene-Network Analysis
FunCoup v.4.0 (http://funcoup.sbc.su.se/search/) was employed to expand the list of associated genes (p < 2.26 × 10 −6 ) including its interactors. FunCoup database integrates 10 different types of functional couplings among genes that allow to infer functional association networks: protein interaction (PIN), mRNA co-expression (MEX), protein co-expression (PEX), genetic interaction profile similarity (GIN), shared transcription factor binding (TFB), co-miRNA regulation by shared miRNA targeting (MIR), subcellular co-localization (SCL), domain interactions (DOM), phylogenetic profile similarity (PHP), and quantitative mass spectrometry (QMS). Gene networks for ASD were constructed considering as input six of the eight associated loci due to the lack of information for the remaining two ( Table 2). Gene networks were constructed considering three different parameters. Therefore, expansion parameters include confidence threshold (0.8), a maximum number of 30 nodes per expansion step, and a query depth of 1 (only genes directly linked to the query genes are shown). Network expansion algorithm was settled in order to obtain those strongest interactors for any query gene, without prioritizing common neighbor's links. Moreover, enriched term analyses (KEGG, GO biological function, and GO molecular function) were considered for each gene network constructed with the corresponding p-values. Gene-network representation displays the most significant KEGG pathways according to their q-values after considering a FDR approach. Node sizes scales to emphasize gene importance in the whole network while participating nodes for each KEGG pathway are marked in black.
Functional Annotation GENE2FUNC, a core process of FUMA (Functional Mapping and Annotation of Genome-Wide Association Studies) (http:// fuma.ctglab.nl/), was used to functionally annotate ASD genes and its interactors. Therefore, ASD input for FUMA comprises 36 genes (PASCAL associated genes plus strong interactors from FunCoup) ( Table 2).
Different analyses performed by GENE2FUNC were employed, including a gene expression heatmap and an enrichment analysis of differentially expressed genes (DEG). Gene expression heatmap was constructed employing GTEx v7 (53 tissue types) and BrainSpan RNA-seq data. The average of normalized expression per label (zero means across samples) was displayed on the corresponding heatmaps. Expression values are TPM (Transcripts Per Million) for GTEx v7 and RPKM (Read Per Kilobase per Million) in the case of BrainSpan data set. Heatmaps display normalized expression value (zero mean normalization of log2 transformed expression), and darker red means higher relative expression of that gene in each label, compared to a darker blue color in the same label. DEG analysis was carried out creating differentially expressed genes for each expression data set. In order to define DEG sets, two-sided Student's t-test were performed per gene per tissue against the remaining labels (tissue types or developmental stages). Those genes with a p-value < 0.05 after Bonferroni correction and a log fold change ≥ 0.58 are defined as DEG. The direction of expression was considered. The -log10 (p-value) refers to the probability of the hypergeometric test.

Meta-Analysis of Differentially Expressed
Genes in ASD Studies dbMDEGA (https://dbmdega.shinyapps.io/dbMDEGA/) allows to conduct a meta-analysis using different ASD and brain control expression datasets to test which genes are differentially expressed (DEGs) between both groups. Specifically, this tool uses expression profiles from three human ASD brain gene expression datasets: GSE28475 (Chow et al., 2012), GSE28521 (Voineagu et al., 2011), and GSE38322 (Ginsberg et al., 2012). Thus, dbMDEGA contains about 17,741 human genes as meta-results for querying DEGs. PASCAL genes and FunCoup interactors (N = 36) were submitted in the Meta_summary panel. Meta-analysis p-values (random effects model), FDR (false discovery rate) for each gene, and heterogeneity measures (I 2 ) were obtained.

Gene-Based-Analysis
PASCAL analysis has revealed association of eight loci (p-value < 2.26 × 10 −6 ) ( Table 3). NKX2-2 and NKX2-4 (chromosome 20) were highlighted as associated by PASCAL in comparison with the results obtained by MAGMA (Table 4). Regional association plot around rs910805 shows three different r 2 levels. PASCAL is able to detect NKX2-2 and NKX2-4 as associated, both genes located farther away from the lead SNP. However, SNPs within both loci are in moderate LD with the index SNP (Figure 1). In addition, PASCAL shows association of CRHR1-IT1, LOC644172 (chromosome 17), and C8orf74 (chromosome 8) ( Table 3 and Figure 1). It is worth to note that CRHR1-IT1 is intronic with CRHR1 (previously reported by MAGMA) ( Table 4). However, the vast majority of loci associated by PASCAL were previously reported by MAGMA (Table 4).

Gene Expression Heatmaps and Differential Expression Analysis
Gene expression heatmap based on GTEX v7 RNA-seq data for ASD-associated genes together with FunCoup interactors ( Table 2) has revealed interesting results. Specifically, three genes (NKX2-2, OLIG-2, and KCNN2) display increased expression levels in brain (different regions) in comparison with the remaining tissues included in the analysis. Moreover, another gene set, XRN2 plus some interactors, display the opposite trend showing lower relative expression levels on brain tissues (Figure 3). Interestingly, gene expression heatmap employing BrainSpan data has revealed that the vast majority of genes included on the downregulated geneset have shown a higher relative expression in prenatal stages (early, early-mid, and late-mid). Moreover, the same trend was also observed for NKX2-2 and OLIG2. Therefore, a lower relative expression level for these genes was shown across prenatal stages in comparison with those levels found in adult brain tissues (Figure 3). Thus, there is a clear division in terms of expression levels for those sets of genes expressed across different pcw (post-conception weeks) versus those expressed across the adult lifespan. DEG analysis with GTEx data (adult brain) has shown significantly enriched DEG sets across distinct brain tissues. Moreover, DEG study performed with BrainSpan data has confirmed a significant association of gene sets during early prenatal stages (up-regulation) versus young adulthood (down-regulation). Specifically, up-regulation occurs around 8, 9, and 13 pcw, and down-regulation takes place across different postnatal ages (Figure 4).

DISCUSSION
The latest ASD GWAS meta-analysis has reported the association of individual SNPs across chromosome 20 (rs910805), 5 (rs13188074), 17 (rs142920272), and 8 (rs10099100), among others. In the same study, different genes located near to the topranked SNPs as XRN2, KCNN2, and KIZ (or PLK1S1) have been associated by MAGMA (Grove et al., 2019). PASCAL analysis has shown additional associated genes: NKX2-4, NKX2-2, CRHR1-IT1, C8orf74, and LOC644172. However, these findings should not be considered as completely new. Thus, it is worth noting that lead SNPs located on chromosome 20 and 8 were  associated with ASD when MTAG (multi-trait analysis of GWAS) was done with genetically correlated phenotypes (schizophrenia and educational attainment) but not by MAGMA itself (Turley et al., 2018;Grove et al., 2019). In fact, this is why NKX2-4 and NKX2-2 were reported as "possibly associated" genes when MTAG was carried out (Grove et al., 2019). However, we found that PASCAL is able to rescue these associations when its own statistical approach is applied. It is also worth to note that CRHR1-IT1 partially overlap with CRHR1 even if it encodes a different transcript (previously not reported by MAGMA). Thus, CHR1-IT1 is reported as associated by PASCAL due to the inclusion of the "gene-name" in the list used as input (present in PASCAL and not in MAGMA). c8orf74 could also be a controversial "new" finding because the SNPs located within the gene harbor more significant p-values but are located near to XKR6 (identified in Grove et al, 2019). Even so, it should be considered that PASCAL has helped to unveil the association of additional genes near to those genes reported by MAGMA. Both GBA tools, MAGMA and PASCAL, work in a very similar way: i) they are able to employ summary statistics as input instead of genotypes, ii) gene scores are calculated combining the results for all SNPs located across the gene, and iii) LD correction is made by external information from the 1000 Genomes European panel (de Leeuw et al., 2015;Lamparter et al., 2016). However, the construction of the correlation matrix is slightly different between both methods. This could explain those small differences found between the associated genes reported by both methods. p-values associated to each gene in PASCAL were in general less significant than those reported by MAGMA, even when identical genes were reported by both algorithms. Moreover, the number of genes that remain significant after Bonferroni correction was lower in PASCAL. All these results taken together could indicate that PASCAL is more conservative than MAGMA. In addition, we propose that PASCAL could be used as a complementary GBA approach because it has helped to highlight additional genes located in the same LD region than those reported by MAGMA.
Regarding the insight into the biological role of ASD-associated genes, we would like to focus on the following loci: NKX2-2, NKX2-4, CRHR1-IT1, LOC644172, and c8orf74. NKX2-2 and NKX2-4 are members of the homeobox-transcription factor family. NKX2-2 encodes a transcription factor involved in the morphogenesis of the central nervous system, and it is essential during the differentiation of neural populations located on the hindbrain and spinal cord during early development (Briscoe and Ericson, 1999). ) has a key role in brain development, and its downregulation in the developing forebrain promotes proliferation and inhibits differentiation of neural progenitors which results in impaired neurogenesis (Shen et al., 2018). Gene expression heatmaps (GTEX and BrainSpan data) have revealed two expression clusters in which NKX2-2 is involved. Genes within the first cluster have shown downregulation across early prenatal stages and upregulation during postnatal stages, and genes within the second cluster have shown the opposite trend. NKX2-2 and its interactor OLIG-2 (oligodendrocyte transcription factor 2) are included in the first cluster that displays a lower normalized expression during prenatal stages in comparison with postnatal stages. This seems to point to the existence of a different genetic regulation for NKX2-2 during brain development in comparison with the adult brain. Thus, it was reported that NKX2-2 is initially expressed in differentiating oligodendrocyte precursor cells (OPCs) and then is downregulated. However, it was demonstrated that its expression could also be upregulated again during later neurodevelopmental stages in order to maintain myelin structures (Cai et al., 2010). This is particularly interesting due to previous reports that have linked aberrant expression of genes across the oligodendrocyte lineage with ASD pathology (Li et al., 2014;Zeidán-Chuliá et al., 2016). In addition, enriched terms for NKX2-2 and OLIG-2 were protein, chromatin, and TF binding, all of them biological processes involved in ASD pathogenesis. However, it should be noted that both genes did not show differences in expression between ASD brain tissue and controls when DEG was carried out. These results indicate the need of subsequent functional studies to further characterize their biological role. In addition, a second ASD expression cluster (BrainSpan heatmap) includes XRN2, CRHR1-IT1, and their multiple interactors. First of all, it should be remarked that FunCoup calculates many more interactors for XRN2 than for the remaining genes. This could entail a bias for subsequent analysis but also underlines the relevance of this gene due to its wide PPI network. Thus, it is worth to note that many XRN2 and some of its interactors (CDKN2AIP, ILF3, GNL3 ADAR, LYAR, SNW1, RBM39, SYNCRIP, and PTBP1) have shown significant p-values when dbMDEGA meta-analysis was performed. This fact demonstrates that these genes are differentially expressed between ASD brains and controls and therefore their potential relevance in ASD pathogenesis (Voineagu et al., 2011;Chow et al., 2012;Ginsberg et al., 2012). In addition, some of the enriched terms for XNR2 and their interactors were spliceosome, RNA transport, and nucleic acid binding, all of them essential biological functions with a role during early development. Thus, it was demonstrated that XRN2 might serve a general role in triggering the termination of transcription through the degradation of the 3′ UTR (Eaton et al., 2018). CRHR1-IT1 codifies for a long intergenic non-protein coding RNA which was recently associated with susceptibility to antisocial behavior (Day et al., 2018). CRHR1-IT1 shares part of its sequence with CRHR1 which encodes the corticotropin-releasing hormone receptor 1 (Grove et al., 2019). CRHR1 is a main component of the hypothalamic-pituitary-adrenal pathway, and it has been repeatedly associated with response to stress-related psychopathology (Binder and Nemeroff, 2010). The function of CRHR1-IT1 is not well characterized, but it might play a role in the regulation of CRHR1 expression. Thus, CRHR1 and CRHR1-IT1 could be involved in the modulation of behavior and cognition, both core features of ASD (Smith et al., 2016;Grimm et al., 2017;Day et al., 2018;Zhong et al., 2018). However, the meta-analysis of DEG in ASD brain vs controls has not revealed association for CRHR1 nor CRHR1-IT1.
Finally, C8orf74 and LOC644172 were also associated after PASCAL analysis. LOC644172 is located upstream of CRHR1 and CRHR1-IT1, but it was not recognized by FunCoup nor FUMA. C8orf74 is part of a reading frame that has the ability to be translated, but it is not yet characterized. Thus, it seems that both loci are not included in functional and genetic databases due to their poor biological characterization. However, the meta-analysis of expression profiles revealed that C8orf74 was differentially expressed between ASD brain tissues and controls, which indicates that C8orf74 needs further functional characterization.

CONCLUSIONS
PASCAL algorithm was used to carry out a novel GBA of ASD, employing summary statistics from the latest GWAS metaanalysis. This study has identified several gene associations even though most of them were previously reported by MAGMA. However, PASCAL has been useful to define the association of other genes located in the same LD region than those found by MAGMA. These results indicate that PASCAL should be considered as a complementary GBA approach when it is necessary to further mine GWAS results.
The second part of this study has been focused on the biological characterization of the ASD-associated genes by PASCAL. Thus, gene-network and functional annotation approaches including gene expression heatmaps and DEG were carried out. This has helped to understand the genetic findings into a biological context and to select the most suitable ASD susceptibility genes as candidates for functional biological studies.

DATA AVAILABILITY
All data generated during this study are included in this published article and its supplementary information files. Summary statistics for ASD GWAS are publicly available at https://www. med.unc.edu/pgc/results-and-downloads.

ETHICAL STATEMENTS
The GWAS data employed in this study are publicly available at https://www.med.unc.edu/pgc. These genetic data have been