Identification of hub genes of Parkinson's disease through bioinformatics analysis

Parkinson's disease (PD) is a common neurodegenerative disease, and there is still a lack of effective diagnostic and treatment methods. This study aimed to search for hub genes that might serve as diagnostic or therapeutic targets for PD. All the analysis was performed in R software. The expression profile data of PD (number: GSE7621) was acquired from the Gene Expression Omnibus (GEO) database. Differentially expressed genes (DEGs) associated with PD were screened by the “Limma” package of the R software. Key genes associated with PD were screened by the “WGCNA” package of the R software. Target genes were screened by merging the results of “Limma” and “WGCNA.” Enrichment analysis of target genes was performed by Gene Ontology (GO), Disease Ontology (DO), and Kyoto Enrichment of Genes and Genomes (KEGG). Machine learning algorithms were employed to screen for hub genes. Nomogram was constructed using the “rms” package. And the receiver operating characteristic curve (ROC) was plotted to detect and validate our prediction model sensitivity and specificity. Additional expression profile data of PD (number: GSE20141) was acquired from the GEO database to validate the nomogram. GSEA was used to determine the biological functions of the hub genes. Finally, RPL3L, PLEK2, PYCRL, CD99P1, LOC100133130, MELK, LINC01101, and DLG3-AS1 were identified as hub genes of PD. These findings can provide a new direction for the diagnosis and treatment of PD.


Introduction
Parkinson's disease (PD) is one of the common neurodegenerative diseases, second only to Alzheimer's disease (Di Stefano and Marinelli, 2021;Pan et al., 2021). The global prevalence of PD is rising, affecting nearly 2% of people over the age of 65 and 5% of people over the age of 85 (Bloem et al., 2021;Dorszewska et al., 2021). The pathological features of PD are mainly the loss of dopaminergic neurons in the substantia nigra and the formation of Lewy bodies, which is determined by genetic and environmental factors, and related to age, immune-inflammatory mechanisms, mitochondrial dysfunction, oxidative stress, apoptosis, lysosomal dysfunction, etc (Pan-Montojo and Reichmann, 2014;Su and Federoff, 2014;Kalia and Lang, 2015;Mullin and Schapira, 2015;Vivekanantham et al., 2015;Hu and Wang, 2016;Collier et al., 2017;Vascellari and Manzin, 2021). The role of genetic factors in PD is receiving more attention, and dozens . /fnins. . of genes have been found to be related to the incidence of PD, including SNCA, LRRK2, and so on (Manto and Marmolino, 2009;Deng et al., 2018;Poujois and Woimant, 2018). Neurologists' assessments of clinical manifestations, movement disorders, and some routine laboratory tests are the most important diagnostic methods for PD. However, these methods have disadvantages and limitations, such as low sensitivity, selectivity, and high cost (Bindas et al., 2021;Mobed et al., 2021). The treatment of PD is constantly developing, including drug therapy, surgical treatment, gene therapy, rehabilitation, etc. However, drug therapy is still the preferred treatment of Parkinson's disease in the clinic and is the main treatment method. However, drug therapy can only improve symptoms and cannot control the progression of the disease. With the prolongation of the medication duration and the dose increase, there will be a decrease in the efficacy of the drug and complications. Some researchers are developing new gene promoters to control gene expression in different subsets of neurons, stimulating the growth and development of specific neurons (such as dopaminergic neurons), thereby promoting the recovery of Parkinson's disease. Another emerging form of gene therapy is using novel drugs that directly deliver key proteins involved in dopamine metabolism to the basal ganglia region. Using gene therapies to clear disease-causing proteins is another beneficial exploration (Brundin et al., 1987;Lindvall et al., 1988;Ropper and Samuels, 1997). Although many genes are associated with PD, the specific pathogenesis is not clear, and there is still a lack of understanding of the genes that can be used for the diagnosis and treatment of PD. It is important to find new genes related to PD, which can be used for new targets for treatment.
Recently, bioinformatics analysis has been widely used as a new technique to screen for underlying biomarkers for both tumor and non-tumor diseases (Zhang et al., 2022). In this study, we downloaded a microarray dataset and analyzed gene expression to obtain differentially expressed genes from persons with PD and healthy individuals. We combined the data with WGCNA and machine learning algorithms, screened out a total of eight core genes, verified the prediction accuracy by area size (AUC) under the ROC curve, and performed GSEA analysis on each hub gene. We aimed to identify candidate genes that may be used as PD biomarkers.

Data collection and preprocessing
The raw data of the substantia nigra tissue from 16 PD and nine American normal samples in the GSE7621 (Lesnick et al., 2007) dataset, which was sequenced using the GPL570 platform, was obtained from the GEO database.

Identification of DEGs in the substantia nigra of patients with PD
The "Limma" R package was used to screen DEGs between PD and normal samples, and genes with P < 0.05 and |log 2 FC| >1 were regarded as DEGs (Liu et al., 2021).

Screening of key modules and target genes based on WGCNA
To screen potential genes associated with PD, the gene expression matrix of the substantia nigra tissue from 16 PD and nine normal American samples was used to create a weighted gene co-expression network using the "WGCNA" R package (Langfelder and Horvath, 2008;Yu et al., 2012). First, we clustered all samples to guarantee a reliable network. Second, we calculated the Pearson correlation coefficient between each pair of genes to evaluate the expression similarity of genes and acquire a correlation matrix. We also used the soft threshold function to convert the correlation matrix into a weighted neighborhood matrix and used a soft connectivity algorithm to select the optimal soft threshold to ensure that gene correlations were maximally consistent with scale-free distribution. Subsequently, the neighborhood matrix was transformed into a topological overlap matrix (TOM). Furthermore, co-expression modules were obtained based on the criteria of dynamic tree cutting by setting the minimum number of genes in a module as 50. Finally, key modules were selected by correlation analysis, and the key modules' genes were considered key genes. Target genes were obtained by intersecting DEGs with key genes based on WGCNA screening.

Gene Ontology, Disease Ontology, and Kyoto Enrichment of Genes and Genomes enrichment analyses
Biological function enrichment of Gene Ontology (GO), Disease Ontology (DO), and Kyoto Enrichment of Genes and Genomes (KEGG) analyses were performed using the "clusterProfiler" R (Yu et al., 2012) package. GO enrichment analysis was performed to investigate the gene-related biological process (BP), molecular functions (MF), and cellular components (CC). DO enrichment analysis was used to explore genes-related diseases. KEGG enrichment analysis was conducted to explore gene-related signaling pathways. Statistical significance was set at an adjusted P-value < 0.05.

Identification of hub genes of PD based on machine learning algorithms
To begin with, the LASSO logistic regression algorithm was performed to screen potential genes by using the "glmnet" R package (Tibshirani, 1996;Friedman et al., 2010), and receiver operating characteristic (ROC) analysis was selected to test the model reliability by calculating the area under the curve (AUC) value through the "pROC" R package (Robin et al., 2011). Next, the SVM-RFE algorithm was used to screen potential genes using the "e1071" R package (Suykens and Vandewalle, 2004;Huang et al., 2014). In addition, the random forest (RF) algorithm was also conducted to screen potential genes using the "randomForest" R package (Liaw and Wiener, 2007;Cutler et al., 2011). Similarly, the ROC curve was used to test the model reliability by using the "pROC" R package, and the top 10 genes based on %IncMSE ranking were regarded as potential genes (Robin et al., 2011). Finally, overlapping genes among potential genes generated via LASSO, SVM-RFE, and RF algorithms were considered hub genes of PD.

Establishment of a diagnostic nomogram for PD
A diagnostic nomogram was established based on the hub genes by using the "rms" package in R software. The receiver operating characteristic curve (ROC) was used to investigate the efficiency of this diagnostic model. The area under curve >0.7 was considered significant. Additional expression profile data of PD [number: GSE20141 (Zheng et al., 2010)] was acquired from the GEO database to validate the nomogram.

Evaluation of the expression levels and diagnostic implications for the hub genes
Wilcoxon's rank-sum test was used to analyze the expression levels of hub genes. ROC analysis was performed to evaluate whether hub genes could differentiate PD samples from normal samples using the "pROC" R package (Robin et al., 2011).

Biological functions and validation of hub genes
Gene Set Enrichment Analysis (GSEA) was performed using the "clusterProfiler" R package to investigate the biological functions of hub genes by the ordered gene expression matrix based on the Pearson correlation between each hub gene and other genes (Yu et al., 2012).

Identification of DEGs in the substantia nigra of patients with PD
By setting the cut-off value as P < 0.05 and |log 2 FC| >1, a total of 117 DEGs, including 42 upregulated and 75 downregulated genes were identified in the substantia nigra of PD patients compared with normal samples (Table 1). A volcano diagram was constructed for the DEGs ( Figure 1A). The top 60 DEGs are presented using a cluster heatmap ( Figure 1B).

Screening of key modules and genes based on WGCNA
We extracted the expression data of differentially expressed genes in samples of persons with PD for co-expression analysis. First, the soft threshold was selected for subsequent coexpression network construction (Figure 2A). The principle was to make the constructed network more in line with the characteristics of the scale-free network. The R-square .
/fnins. . was set as 0.85 (Figure 2A). WGCNA was used to construct the co-expression network module and visually display the modules' gene correlation. Fourteen co-expression modules were obtained, and the number of genes in each module was at least 50. The results were displayed in a hierarchical clustering diagram ( Figure 2B). Then, a heat map was mapped on moduletrait relationships according to the Spearman correlation coefficient to evaluate the association between each module and the disease ( Figure 2C). Two modules "MEdarkgrey" and "MEdarkorange" had high association with PD and were selected as PD-related modules (MEdarkgrey module: r = 0.91, P = 4e−10; MEdarkorange module: r = 0.89, P = 4e−08). The MEdarkgrey and MEdarkorange modules were positively correlated with PD, 5,801, and 7,763 genes, respectively. Target genes were obtained by intersecting DEGs with key genes based on WGCNA screening ( Figure 2D).

Gene Ontology, Disease Ontology, and Kyoto Encyclopedia of Genes and Genomes enrichment analyses
The GO analyses associated the most enriched biological process (BP) terms with dopamine biosynthetic process, synapse organization regulation, and synapse structure or activity. The most enriched terms for cellular components (CC) were mainly associated with terminal bouton. The most enriched molecular function (MF) terms were associated with G protein-coupled peptide receptor activity and peptide receptor activity (Figures 3A,B). In the DO analysis, the target genes were enriched in neurodegenerative diseases, such as PD, and tumors of the nervous system ( Figures 3C,D). In the KEGG analysis, the target genes were enriched in cocaine addiction and the calcium signaling pathway ( Figures 3E,F).

Identification of hub genes of PD based on machine learning algorithms
Machine learning algorithms were selected and executed to further identify the hub genes of PD from 91 target genes. First, while constructing the LASSO model based on PD and normal samples, λ analysis suggested that the model could accurately predict PD with λ = 11 ( Figure 4A). Thus, RPL3L, PLEK2, PYCRL, ABCA11P, DACH2, CD99P1, SNORD114-3, LOC100133130, MELK, LINC01101, and DLG3-AS1 were identified to build the LASSO module. We acquired the LASSO coefficient spectrum of the potential genes according to λ = 11( Figure 4A). However, SVM-RFE analysis revealed that the SVM model based on one characteristic gene showed an optimum error rate (0.00, Figure 4B). The first 20 genes were identified as potential genes. At the same time, the RF algorithm identified the top 10 genes from 52 potential genes ( Figure 5C).

Establishment of diagnostic nomogram for PD
A diagnostic nomogram was successfully constructed based on the eight genes for predicting the incidence of PD ( Figure 5A). The area under the curve (AUC) of the GSE7621 dataset was 1.000 ( Figure 5B), and that of the GSE20141 dataset was 0.900 ( Figure 5C).

Evaluation of the expression levels and diagnostic implications for the hub genes
To further investigate the role of hub genes in PD, we first observed their expression levels in PD patients. Interestingly, we found that the expression of PYCRL, LOC100133130, MELK, LINC01101, and DLG3-AS1 was downregulated, and the expression of RPL3L, PLEK2, and CD99P1 was upregulated in Frontiers in Neuroscience frontiersin.org . /fnins. . PD patients compared with the healthy samples ( Figure 6A). Moreover, ROC analyses suggested RPL3L, PLEK2, PYCRL, CD99P1, LOC100133130, MELK, LINC01101, and DLG3-AS1 might be used as hub genes of PD patients ( Figure 6B).

Discussion
Parkinson's disease is the second most prevalent neurodegenerative disease and has significantly increased over the past 20 years (Cabreira and Massano, 2019). It is caused by a combination of environmental and genetic factors related to age, sex, etc (Cabreira and Massano, 2019;Cerri et al., 2019). Due to the lack of early diagnostic techniques, PD is usually not detected until later stages of complete neuronal degeneration, which often leads to delayed treatment of patients and affects the prognosis (Lotankar et al., 2017). Therefore, to improve prognosis, biomarkers are needed to detect the onset of the disease in the early stages.
In the present study, we first obtained 117 DEGs from the substantia nigra of PD, and WGCNA screened a total of 13,564 key genes for two key modules. Ninety-one target genes were screened out by intersecting DEGs with key genes. Interestingly, these 91 target genes were mostly related to the regulation of synapse structure or activity, dopamine biosynthetic process, locomotory behavior, and response to nicotine metabolismrelated BPs (Figures 3A,B). Thus, we speculated that these genes might play key roles in PD by regulating the synapse structure or activity, dopamine biosynthetic process, locomotory behavior, and response to nicotine.
Over the past few decades, more and more studies have shown that the biosynthesis of dopamine and the regulation of synaptic activity and structure play an important role in the pathogenesis of PD (Latif et al., 2021;Nachman and Verstreken, 2022). In addition, increasing evidence has revealed that nicotine has protective effects on PD (Jin Jung et al., 2021;Wang et al., 2022). Furthermore, motor symptoms are one of the most important clinical manifestations of PD (Opara et al., 2017). However, we found that 91 target genes were enriched only in the calcium signaling pathway (Figures 3E,F). Calcium signaling pathways play an important role in PD (Calì et al., 2014;Bohush et al., 2021). Therefore, our study may contribute to understanding the molecular mechanisms underlying PD.
Finally, we identified RPL3L, PLEK2, PYCRL, CD99P1, LOC100133130, MELK, LINC01101, and DLG3-AS1 as hub genes using LASSO logistic regression, SVM-RFE, and RF algorithms. RPL3L (ribosomal protein L3-like) is one of the four non-canonical riboprotein genes, and it encodes the 60S ribosomal protein L3-like protein that is highly expressed only in cardiac and skeletal muscles (Ganapathi et al., 2020). Recent research indicates that RPl3l overexpression impairs the growth and myogenic fusion of myotubes, and RPL3L can be used as a potential genetic marker to control neurodegeneration (Chaillou, 2019). Thus, RPL3L may play a critical role in PD by regulating function of skeletal muscle. CD99P1 (CD99 molecule pseudogene 1) is a long noncoding RNA (lncRNA) and has been revealed to be related to myofibroblast differentiation (Huang et al., 2015;Yildirim et al., 2021). Hence, CD99P1 may play a key role in PD by affecting myofibroblast differentiation. PYCRL (pyrroline-5-carboxylate reductase-like) is a pyrroline-5-carboxylate reductase linked to the conversion of ornithine to proline (De Ingeniis et al., 2012). Therefore, PYCRL may play a decisive role in PD by affecting proline biosynthesis. MELK (maternal embryonic leucine zipper kinase) is an AMPactivated protein kinase (AMPK)-related kinase (Seong and Ha, 2019). More and more studies have found that MELK is involved in the occurrence of cancer and cell metabolism by regulating the cell division cycle Seong and Ha, 2019). Therefore, MELK may play a key role in PD by affecting the cell division cycle. PLEK2 (Pleckstrin-2) is a crucial mediator of cytoskeletal reorganization . Studies show that PLEK2 may regulate actin organization and cell spreading (Hu et al., 1999;Hamaguchi et al., 2007). Therefore, PLEK2 may play a key role in PD by regulating actin organization and cell spreading. LINC01101 is a long noncoding RNA (lncRNA) associated with progression and high-risk HPV infection (Iancu et al., 2017). But there is still no description of its pathogenesis. Notably, no studies have reported the role of RPL3L, PLEK2, PYCRL, CD99P1, LOC100133130, MELK, LINC01101, and DLG3-AS1 in PD. LOC100133130 and DLG3-AS1, as newly discovered genes, have no relevant report. Thus, further investigations are necessary.
We investigated the biological functions of RPL3L, PLEK2, PYCRL, CD99P1, LOC100133130, MELK, LINC01101, and DLG3-AS1. Interestingly, GSEA revealed that RPL3L, PLEK2, PYCRL, CD99P1, LOC100133130, MELK, LINC01101, and DLG3-AS1 were mainly involved in nicotine addiction, diabetes, protein export, taste transduction, fatty acid degradation, propanoate metabolism, endocrine and other factor-regulated calcium reabsorption, GABAergic synapse, steroid biosynthesis, synaptic vesicle cycle, carbohydrate digestion and absorption, proteasome, and asthma. Currently, an increasing number of studies have shown that the above factors significantly impact the occurrence and development of PD. For example, it has been suggested that nicotine addiction and diabetes are associated with PD (Cheong et al., 2020;Carvajal-Oliveros et al., 2021). However, their regulatory mechanisms are rarely studied to the best of our knowledge. Thus, further studies are required to explore this in the future.

Conclusion
In all, 117 DEGs were screened between the substantia nigra of PD and healthy samples. Of these, RPL3L, PLEK2, PYCRL, CD99P1, LOC100133130, MELK, LINC01101, and DLG3-AS1 were identified as hub genes of patients with PD based on .
WGCNA and machine learning algorithms. Therefore, our study contributes to the understanding of PD and helps in improving the diagnosis and treatment of PD. However, further studies are needed to investigate the roles of hub genes.

Data availability statement
The original contributions presented in the study are included in the article/supplementary material, further inquiries can be directed to the corresponding author.

Ethics statement
Ethical review and approval was not required for the study on human participants in accordance with the local legislation and institutional requirements. Written informed consent from the patients/participants or patients/participants' legal guardian/next of kin was not required to participate in this study in accordance with the national legislation and the institutional requirements.

Author contributions
YY conceived of the presented idea. XH developed the theory, performed the computations, and supervised the findings of this work. CW and YW verified the analytical methods. XX and CL contributed in the revision of the article. All authors discussed the results and contributed to the final manuscript.

Acknowledgments
We are grateful to everyone who has helped us with this paper. Our sincere gratitude and appreciation go first to our supervisor, whose suggestions and encouragement gave us much insight into these translation studies. Studying under his guidance and supervision has been a great privilege and joy. Furthermore, it is our honor to benefit from his personality and diligence, which we will treasure our whole life. Our gratitude to him knows no bounds. We are also extremely grateful to all our friends and classmates who have provided us with assistance and companionship in preparing this paper. In addition, many thanks go to our family for their unfailing love and unwavering support. Finally, we are grateful to all those who devoted much time to reading this thesis and giving us much advice, which will benefit us in our later studies.