Gene4PD: A Comprehensive Genetic Database of Parkinson’s Disease

Parkinson’s disease (PD) is a complex neurodegenerative disorder with a strong genetic component. A growing number of variants and genes have been reported to be associated with PD; however, there is no database that integrate different type of genetic data, and support analyzing of PD-associated genes (PAGs). By systematic review and curation of multiple lines of public studies, we integrate multiple layers of genetic data (rare variants and copy-number variants identified from patients with PD, associated variants identified from genome-wide association studies, differentially expressed genes, and differential DNA methylation genes) and age at onset in PD. We integrated five layers of genetic data (8302 terms) with different levels of evidences from more than 3,000 studies and prioritized 124 PAGs with strong or suggestive evidences. These PAGs were identified to be significantly interacted with each other and formed an interconnected functional network enriched in several functional pathways involved in PD, suggesting these genes may contribute to the pathogenesis of PD. Furthermore, we identified 10 genes were associated with a juvenile-onset (age ≤ 30 years), 11 genes were associated with an early-onset (age of 30–50 years), whereas another 10 genes were associated with a late-onset (age > 50 years). Notably, the AAOs of patients with loss of function variants in five genes were significantly lower than that of patients with deleterious missense variants, while patients with VPS13C (P = 0.01) was opposite. Finally, we developed an online database named Gene4PD (http://genemed.tech/gene4pd) which integrated published genetic data in PD, the PAGs, and 63 popular genomic data sources, as well as an online pipeline for prioritize risk variants in PD. In conclusion, Gene4PD provides researchers and clinicians comprehensive genetic knowledge and analytic platform for PD, and would also improve the understanding of pathogenesis in PD.

In addition, genome-wide association studies (GWAS) have provided insight into the genetic basis of PD by identifying and replicating risk loci (Blauwendraat et al., 2019b;Chang et al., 2017;Chung et al., 2012;Foo et al., 2017;Jansen et al., 2017;Nalls et al., 2014). Some differentially DNA methylated genes (Wullner et al., 2016) and differentially expressed genes (Infante et al., 2015) have also been reported to be associated with PD pathogenesis. Most currently known PD-causing genes are involved in several pathogenesis pathways including mitochondrial dysfunction, neuron apoptosis, and the autophagy lysosome ubiquitin-proteasome system (Blauwendraat et al., 2019b). Despite these great advances in the genetics and genomics of PD, the related results are scattered among thousands of studies, making it difficult and laborious for researchers to use these available data.
Parkinson's disease is an age-dependent neurodegenerative condition, and the age at onset (AAO) is a robust phenotypic measure compared to motor and non-motor phenotypes (Kilarski et al., 2012;Nalls et al., 2015;Wickremaratchi et al., 2011). Many previous studies have reported the frequency of gene mutations in PD patients with different AAOs, and most of them focused on analyzing the association of AAO and several common PD-causing genes Searles Nielsen et al., 2013;Tan et al., 2019;Trinh et al., 2018). Up to now, several genes has been reported to be strongly associated with an early-onset age of PD based on cohorts within specialist clinics or within the general population Lin et al., 2019;Searles Nielsen et al., 2013;Tan et al., 2019;Trinh et al., 2018;Trinh et al., 2019). However, a comprehensive analysis of the association between AAO and all PD-associated genes (PAGs) is still lacking. Furthermore, where AAO is associated with other genetic components, such as the expression pattern, the functional effects of genetic variants need to be further investigated.
To overcome this limitation and facilitate studies of the genetic basis of PD, we integrated multiple layers of genetic and genomic data of PD and prioritized several PAGs, we developed a convenient integrative genomic database named as Gene4PD, which can facilitate queries and analysis of genetic and genomic data with an intuitive graphical user interface that will benefit physicians and researchers.

Data Collection
We collected five types of genomic data related to PD from the PubMed database with following search items: (1)  " We then screened out articles related to PD according to the exclusion criteria: (i) studies that focused on molecular mechanisms, rather than genetic studies; (ii) studies without original data for genomic information, which may cite from published studies, such as reviews or meta-analysis; and the inclusion criteria: (i) studies reported the genomic information (such as rare variant, SNP, and CNV) of PD, (ii) studies reported the differential expressed genes of PD, (iii) studies reported the differentially expressed DNA methylation genes of PD. Then, we manually extracted available genomic, AAO and other clinical information.

Data Annotation and PAG Prioritization
To investigate the functional genetic variants identified in patients with PD, ANNOVAR (Wang et al., 2010) was used for comprehensive annotation based on definitions of transcripts from the RefSeq database. Based on the functional effects and minor allele frequency (MAF) in the GnomAD database, the genetic variants were classified as follows: (1) rare loss of function (LoF, including frameshift indels, splicing, stop-gain, and stoploss) variants with MAF less than 0.01; (2) rare deleterious missense (Dmis) variants with MAF less than 0.01; (3) rare tolerate missense (Tmis) variants with MAF less than 0.01; and (4) all remaining genetic variants. We used our previously developed program ReVe (Ioannidis et al., 2016) to predict deleterious missense variants with scores higher than 0.7, as previous studies.
To prioritize PAGs, we developed a scoring system (Supplementary Table 2) combining the different types of genetic evidences as following. The rare variants assigned an evidence scores of 1-5 according to their functional effects and MAF. The CNVs were directly assigned an evidence score of 5 because they disrupted gene function. The PD-associated SNPs, DEGs, and DMGs were assigned an evidence scores of 1-3 based on their p-values. A combined evidence score for each gene was calculated by summing up the evidence scores of all integrated studies. All genes integrated in this study were classified into five grades: high confidence (score ≥ 20), strong associated (score of 10-20), suggestive associated (score of 5-10), minimal evidence (score of 3-5), and uncertain evidence (score < 3) (Figure 1).

The Verification of Functional Correlation and PAG Network
We performed a permutation test to evaluate the interconnectivity and functional correlation among the 124 PAGs (score ≥ 5) as described in our previous study ). To ensure a reasonable level of analysis, we constructed a protein-protein interaction (PPI) network using the STRING v 11.0 database 1 (Szklarczyk et al., 2019) with a confidence score greater than 0.4. Specifically, we randomly simulated 1,000,000 permutation tests to evaluate the interconnectivity among highconfidence/strong genes, and among suggestive genes, as well as to determine the connectivity between these two classes of PAGs. Then, the 124 PAGs were selected to construct an interconnected PPI network based on the STRING online analysis platform. Additionally, the functional networks were clustered by multiple biological processes of Gene Ontology (GO) 2 .

Association Between Age at Onset and PAGs
The AAO data and genetic data of each gene were downloaded from Gene4PD. We then analyzed the relationship between the AAO and PAGs from three perspectives. First, we aggregated and compared genes with more than five AAO terms to assess the association between AAO and PAGs. All genes were classified into one of three sections: juvenile-onset (≤30 years), earlyonset (30-50 years), or late-onset (>50 years), according to the median of the AAO. Second, we compared the AAO with different functional types of variants [loss-of-function (LoF) and deleterious missense variants (Dmis)] in the same gene to investigate the contribution of different types of functional variants (LoF and Dmis) to the AAO of PD.

Database Construction and Interface
By integrating all the collected genomic information as well as the annotation for each variant/gene from 63 data sources (Supplementary Table 3), we developed Gene4PD database 3 by combining Vue with a PHP-based web framework laravel to construct as a user-friendly web interface. Moreover, the front and back separation model was used for website development. The front end is based on vue and uses the UI Toolkit element, which supports all modern browsers across platforms, including Microsoft Edge, Safari, Firefox, and Google Chrome, and the back end is based on laravel, a PHP web framework. Gene4PD is compatible with all major browser environments and different operating systems such as Windows, Linux, and Mac. The data are stored in the MySQL database.

Genomic Data Integration and PAG Prioritization
Through systematic review and curation of multiple lines of public studies, we reviewed more than 3,000 publications, 487 of which met the quality control and collection criteria. Genetic information such as gene symbols, chromosomes, locations, reference base, altered base, and hereditary modes were collected, and other basic information and clinical data, including sample ID, PubMed ID, methods of detecting variants, country, race, gender, AAO, functional study, subtype of disease, detail description of clinical phenotypes, and sporadic/familiar types, were also integrated. We catalogd five types of genetic data (8302 terms) related to PD: (1) 2,252 rare variants, including 954 non-redundant rare variants, in 226 genes from 327 available publications; (2) 139 CNVs in 34 genes from 94 publications; (3) 1,237 associated SNPs in 640 genes from 42 studies; (4) 2,926 DEGs from 8 publications; (5) 657 DMGs from 7 publications (Supplementary Table 1) (Figure 1). Specifically, for the collected genetic variants, we identified 334 rare LoF, 1,328 rare Dmis variants, 485 rare Tmis variants, and 105 other remaining variants based on standard annotations.
We then developed a weighted scoring system to prioritize PAGs by combining all above genetic evidence. As a result, we prioritized 25 high confidence genes (score ≥ 20), 38 strong associated genes (score of 10-20), 61 suggestive associated genes (score of 5-10), 88 minimal evidence genes (score of 3-5), and 3,791 uncertain evidence genes (score < 3) ( Table 1). We found that 19 of 21 known PD-causing genes (Supplementary Table 4) were prioritized as high confidence genes. Six other high confidence genes, including GCH1, RAB39B, CHCHD2, MAPT, TH, and ASNA1, were also regarded as potential PDcausing genes. The remaining two known PD-causing genes were classified into strong associated genes (HTRA2) or suggestive associated genes (UCHL1), as they showed lower replication.

PAGs Were Functionally Correlated
We performed a permutation test to assess the functional correlations of the 124 associated genes (score ≥ 5). Most of them were disease-causing genes or risk-genes from GWAS. As a result, we observed 48 of 63 high-confidence or strong FIGURE 1 | The overall roadmap of this study. The above green dashed box is the process of data collection and analysis, and the below purple dashed box is the overall framework of Gene4PD. Gene4PD supports a "Quick search," "Advanced search," "Browse," and "Analysis" service, as shown in the blue section. The annotation information at the variant level and gene level is shown in the lilac section.  [Blauwendraat et al. (2019b). The genetic architecture of Parkinson's disease. Lancet Neurol]. 124 PD-associated genes (score ≥ 5) were classified into three grades, including high confidence (score ≥ 20), strong associated (score of 10-20), and suggestive associated (score of 5-10).
PAGs (P < 1.0 × 10 −6 , Supplementary Figure 1A) that interacted with each other and had 203 interconnections (P < 1.0 × 10 −6 , Supplementary Figure 1B), which was significantly higher than the random expectation. Similarly, the suggestive associated genes also significantly interacted with each other (Supplementary Figures 1C,D), suggesting they were functionally correlated. Furthermore, we observed 21 suggestive associated genes (P = 0.098, Supplementary Figure 1E) which interacted with high-confidence or strong PAGs with 74 connections (P = 2.5 × 10 −5 , Supplementary Figure 1F), suggesting that the two classes of associated genes were also functionally correlated. These results demonstrate that the 124 genes were functionally associated with PD, although these results require further experimental validation. We then developed an interacted functional network contained 88 of the 124 PAGs which interacted with each other at protein-level with 336 connections (Figure 2A). The functional network contained 25 high-confidence genes, 27 strong associated genes, and 36 suggestive associated genes. All 21 known PD-causing genes were included in this functional network. Additionally, other genes in this PPI network may be associated with PD. GO enrichment analysis of the 88 genes revealed server GO-associated with PD (Supplementary Table 5 and Figure 2B) Table 5 and Figure 2B), were regarded as critical functional signaling pathways associated with PD. The functional network suggested that the prioritized PAGs shared a common signaling mechanism and were functionally correlated.

AAO Is Associated With Multiple Genetic Components
The integrated genetic data and AAO data from the Gene4PD database provide an unprecedented opportunity to comprehensively identify the vital association between the AAO and PAGs on a large scale. Therefore, we analyzed 31 PAGs with more than five AAO items in each gene ( Figure 3A).
To further investigate different types of functional variants (LoF and Dmis) contributing to the AAO of PD, we further performed pairwise comparison to analyze the differences in the AAO between patients with LoF and patients with Dmis for each gene. As a result, 10 genes with more than three AAO items both in LoF and Dmis were selected to perform a comparison analysis. We found that the AAO of patients with LoF in five genes, including GCH1 (P = 2.63 × 10 −5 ), PINK1 (P = 2.31 × 10 −3 ), PRKN (P = 4.08 × 10 −3 ), FBXO7 (P = 0.02), and ATP13A2 (P = 0.03), were significantly lower than patients with Dmis, while the AAO of patients with LoF in VPS13C (P = 0.01) was higher than patients with Dmis, whereas four other genes including PARK7 (P = 0.06), SNCA (P = 0.12), RAB39B (P = 0.96), and TMEM230 (P = 0.27) did not present significant differences ( Figure 3B). More PD samples with detailed genotypes and  phenotypes may be identified in further studies investigating the relationship between AAO and PAGs.

Gene4PD: An Integrative Genetic Database and Analytic Platform for Parkinson's Disease
To facilitate researchers to query and analyze genetic and genomic data related to PD, we constructed an online comprehensive database and analysis platform named as Gene4PD 4 which integrated rare variants, CNVs, associated SNPs, DEGs, and DMGs related PD and more than 63 popular genomic and genetic data sources (Figure 1 and Supplementary  Table 3). Users can query the detailed genetic and genomic information of the variants and genes via quick or advanced search interfaces in Gene4PD (Figure 4A). The quick search supports several common search terms (such as gene symbol, genomic regions, cytoband) and returns a page containing six visual tables. The advanced search in Gene4PD supports additional user demands by allowing the user to upload a file or paste a list containing search terms. The first table summarizes the evidences scores from the variant types of the collected genetic data, and the other five tables show the more detailed information (Figure 4B). In addition, to comprehensively evaluate the pathogenicity of genetic variants, three meaningful panels were integrated in Gene4PD: (1) predicted damaging scores and functional consequences of missense variants from 24 in silico algorithms (Supplementary Table 3); (2) allele frequencies of different populations based on seven databases, including gnomAD (Lek et al., 2016), ExAC (Karczewski et al., 2017;Lek et al., 2016), 1000 Genomes Project (Genomes Project et al., 2015), ESP6500 (Fu et al., 2013), Kaviar, and Haplotype Reference Consortium; (3) disease-related information from 11 related databases, including InterVar , COSMIC70 (Forbes et al., 2017), ICGC (International Cancer Genome Consortium, Hudson et al., 2010), dbSNP (Sherry et al., 2001), ClinVar (Landrum et al., 2020), Gene4Denovo (Guihu et al., 2019), InterPro (Finn et al., 2017), OMIM (Amberger et al., 2019), MGI (Eppig et al., 2017), and HPO (Kohler et al., 2020). Notably, all query results can be copied to the clipboard or exported as Excel spreadsheets.
Gene4PD includes six sections to obtain gene-level knowledge of a given gene in a one-stop interface (Supplementary Figure 2). (1) The Basic information sections gives out the primary information of genes sourced from NCBI Gene, Entrez gene, Ensembl, OMIM, GeneCards, HGNC, the intolerance score from residual variation intolerance score (RVIS), LoFtool, heptanucleotide context intolerance score, GDI, Episcore, and pLI score; (2) The Gene function section includes five subsections: molecular function extracted from UniportKB, GO terms, Domian from InterPro, PPI from InBio Map, biological pathway from Biosystems; (3) The Phenotype and Disease section reports phenotype and disease-related variants or genes from the OMIM (Amberger et al., 2019), ClinVar (Landrum et al., 2020), MGI (Eppig et al., 2017), HPO (Kohler et al., 2020) and Gene4Denovo (Guihu et al., 2019) are shown in the Phenotype and disease section; (4) The Gene expression section shows spatiotemporal-expression levels of genes in the brain retrieved from BrianSpan, expression profiles in 31 primary tissues and in 54 secondary tissues extracted from GTEx, cell diversity and expression in the human cortex based on single-nucleus RNA-seq data from Allen Brain Atlas, and the subcellular location retrieved from The Human Protein Atlas; (5) The Variants in different populations section provides the number of variants with functional effects in different populations from genomAD; (6) The Drug-gene interaction section provides the drug-gene interactions and gene druggability data sourced from DGIdb v3.0.2.
One of the main advantages of Gene4PD is that it provides an interface for analyzing genetic data according to the specific needs of the user (Supplementary Figure 3). The analysis process includes four simple steps of filling in an email address, choosing the Trio or Non-trio option of the genetic data, uploading genetic data files (VCF4 format), and inputting basic information of samples. Gene4PD will then analyze the rare damaging variants using default parameters. Importantly, Gene4PD also supports a flexible control panel so that users can conveniently adjust the values of quality control, data sources of annotation, and parameters for identifying rare damaging variants. The annotation section contains four sub-panels: Basic information annotation, Pathogenicity prediction of missense variants, Allele frequency in variant population, and Clinical-related database. When the analysis is complete, Gene4PD will send a link to the user by email for downloading the results.

DISCUSSION
With the development of NGS, the number of multi-omics studies has greatly increased, such as genomics, genetics, epigenetics, and transcriptomics, for revealing the pathogenesis of PD, resulting in the identification of numerous associated genes (Jansen et al., 2017;Sandor et al., 2017;Siitonen et al., 2017). The ability to comprehensively investigate these associated genes from different genetic perspectives such as expression patterns, functional interconnections, and relationships with AAO would be helpful for studying the pathogenesis of PD, and improving clinical diagnosis, treatment, and drug development efforts (Blauwendraat et al., 2019b). However, most of these results are widely scattered among published articles, making it challenging to integrate these omics data. To facilitate this task, in this study, we manually collected rare variants, CNVs, associated SNPs, DEGs, and DMGs related to PD from the related literature, and prioritized 124 PAGs, including 25 high-confidence, 38 strong associated and 61 suggestive associated genes. Interestingly, 19 of the 21 known PD-causing genes belonged to the high-confidence genes, highlighting that other high-confidence PAGs may also be associated with PD. Therefore, more genetic and experimental studies are needed to validate the genetic mechanisms of PAGs incorporating all of these associations.
We further performed several bioinformatics analyses to validate the functional association of the 124 PAGs based on PPI data. First, we observed significant associations among the high-confidence/strong PAGs, among the suggestive PAGs, and between the two classes of experimental evidences. Second, we developed an interconnected functional network containing 88 PAGs and enriched in several known PD-associated functional pathways. In the functional network, we highlighted several PAGs that interacted with known PD-causing genes, such as CHCHD2, RAB39B, and RIC3. CHCHD2 (Funayama et al., 2015) has been widely reported to be associated with PD, presented functional interaction with many PD-causing genes, such as SNCA, PINK1, LRRK2, PARK7, VPS35 in the network. RAB39B with the known PD-causing genes SNCA and LRRK2 is consistent with its α-synuclein pathology (Wilson et al., 2014). These results suggest that these three genes are involved in closely related biological functions similar to the disease-causing genes and increasing the risk PD.
To enable PD researchers to make full use of the reported genetic data, more than six online PD-related database such as the variation databases MDSGene (Lill et al., 2016), PDGene database (Nalls et al., 2014), ParkDB (Taccioli et al., 2011), Parkinson Disease Mutation Database (PDmutDB) (Horaitis et al., 2007), The Mutation Database for PD (MDPD) (Tang et al., 2009), (Parkinson's disease map) PDMap (Fujita et al., 2014) and PDbase (Yang et al., 2009) have been explored. However, we found that MDSGene collected mutations of 12 common PD associated genes, and the other three databases (PDmutDB, MDPD, PDbase) were not in functional order for use. In addition, the PDGene database focused on GWAS data collection and analysis and did not provide summary and statistic reports, whereas ParkDB is specifically dedicated to gene expression data and PDmap integrates pathways implicated in PD pathogenesis. Furthermore, there are also other human disease-related genetic databases, which are not specially integrated PD-related genetic information, including DisGeNet{Pinero, 2020 #226} and Open Targets Platform{Ochoa, 2020 #225}. Therefore, to establish an integrative genetic database and analytic platform for facilitating PD research, we integrated all collected data to construct Gene4PD as a comprehensive data analysis platform. Compared to existing databases, Gene4PD not only integrates more recorded variants, CNVs, SNPs from GWAS, DNA methylation genes, and DEGs, but also supports the systematic analysis of genetic variants. To overcome the multiple challenges in evaluating the pathogenicity of a variant or to determine whether a gene is pathogenic, based on experience with the construction of our previous online databases VarCards  and Gene4Denovo (Guihu et al., 2019), we integrated more than 63 genomic data resources in Gene4PD, covering comprehensive variant-level and gene-level annotation data. In addition, we will update Gene4PD by collecting published generelated data every 6 months.
PD has been widely recognized as an age-dependent neurodegenerative condition. The variation in phenotypes and genotypes can be related to the AAO of PD (Kilarski et al., 2012;Nalls et al., 2015;Wickremaratchi et al., 2011). Previous studies suggested that some genes were strongly associated with the AAO, such as ATP13A2 (Ramirez et al., 2006), PLA2G6  (Djarmati et al., 2004;Lucking et al., 2000), PINK1 (Bonifati et al., 2005;Kumazawa et al., 2008), and PARK7 (Djarmati et al., 2004). Based on our largescale integrated dataset, we not only confirmed the association between these six commonly reported genes and AAO, but also verified another 20 genes associated with AAO. Furthermore, our results suggest that different types of functional variants (LoF and Dmis) contribute differently to the AAO of PD.
There were some limitations to this study. First, by satisfying the quality control, we collected as much related literature as possible. Although all data collectors, who were rigorously trained to ensure the consistency of collected data, were researchers or PhD students with a strong background in clinical genetics, there may have been some omissions. We encourage researchers to contact us to refine missing data of PD. We will keep to update it annually. Second, the scoring system used in this study may not be powerful enough, and thus we encourage users to analyze genetic data with the re-score function in Gene4PD or to download data for re-analysis. Third, the integrated approach in this study may be biased, and the prioritized genes need to be validated in population and verified the pathogenic mechanism with cell or animal experiments. Fourth, we did not get much data on AAO of PD. We hope to collect more AAO data in the future to make our analysis results to be more reliable. Last, clinical information besides the AAO must be considered to determine the factors contributing to the genotype-phenotype correlations of PD. However, most articles do not provide detailed phenotypic information, which poses a challenge for a meta-analysis of the relationship between clinical phenotypes and genes. In this regard, we encourage researchers to provide details on clinical phenotypes in the context of genetic studies.

CONCLUSION
In conclusion, we catalogd different types of genetic data from many publications related to PD and prioritized 124 functionally related PAGs with different lines of evidence, which were used to construct the database and analysis tool Gene4PD, suggesting that integrating multiple genetic data is useful for prioritizing novel associated genes. In addition, we characterized the genetic landscape of the prioritized PAGs, providing insight into the pathology of PD. Gene4PD is expected to provide researchers and clinicians comprehensive genetic knowledge and analytic platform for PD, and would also improve the understanding of pathogenesis in PD.

DATA AVAILABILITY STATEMENT
Publicly available datasets were analyzed in this study. This data can be found here: http://genemed.tech/gene4pd/.