Integrative Analysis of Methylation and Copy Number Variations of Prostate Adenocarcinoma Based on Weighted Gene Co-expression Network Analysis

Prostate adenocarcinoma (PRAD) is the most pervasive carcinoma diagnosed in men with over 170,000 new cases every year in the United States and is the second leading cause of death from cancer in men despite its indolent clinical course. Prostate-specific antigen testing, which is the most commonly used non-invasive diagnostic method for PRAD, has improved early detection rates in the past decade, but its effectiveness for monitoring disease progression and predicting prognosis is controversial. To identify novel biomarkers for these purposes, we carried out weighted gene co-expression network analysis of the top 10,000 variant genes in PRAD from The Cancer Genome Atlas in order to identify gene modules associated with clinical outcomes. Methylation and copy number variation analysis were performed to screen aberrantly expressed genes, and the Kaplan–Meier survival and gene set enrichment analyses were conducted to evaluate the prognostic value and potential mechanisms of the identified genes. Cyclin E2 (CCNE2), rhophilin Rho GTPase-binding protein (RHPN1), enhancer of zeste homolog 2 (EZH2), tonsoku-like DNA repair protein (TONSL), epoxide hydrolase 2 (EPHX2), fibromodulin (FMOD), and solute carrier family 7 member (SLC7A4) were identified as potential prognostic indicators and possible therapeutic targets as well. These findings can improve diagnosis and disease monitoring to achieve better clinical outcomes in PRAD.


INTRODUCTION
Prostate adenocarcinoma (PRAD) is one of the most common neoplasms worldwide, ranking 4th among all cancer types in both sexes with an incidence of 7.1% (1). In the United States, PRAD is the most prevalent cancer in men and is estimated to have caused more than 30,000 deaths in 2020 (1,2).
Cancer was previously considered as a genetic disease, but there is considerable evidence that epigenetic changes contribute to tumorigenesis and tumor progression (3)(4)(5). DNA methylation is the most widely studied epigenetic modification in both non-neoplastic and neoplastic diseases including PRAD (6). The methylation of CpG islands, which are often located in the gene promoter, results in transcriptional silencing (7). Recently, methylation of enhancer regions has also been shown to play an important role in regulating gene expression (8,9). DNA methyltransferase 1 (DNMT1), DNMT3a, and DNMT3b are upregulated in PRAD tissue compared to normal benign prostatic hyperplastic tissue, and their expression is elevated in cancerous tissue with a higher Gleason score, suggesting a close association between epigenetic alterations and PRAD development and progression (10). Additionally, epigenetic marks are potential biomarkers for PRAD (11) and targets for next-generation drugs.
Copy number variations (CNVs) are the most common genetic alteration in cancers, and CNV burden is associated with the rates of recurrence and death in multiple neoplasms (12). E26 transformation-specific (ETS) genes, tumor protein 53 (TP53), phosphatase and tensin homolog (PTEN), and androgen receptor (AR) are the most frequently altered genes in primary prostate cancer, which leads to dysregulation of phosphoinositide 3-kinase (PI3K)/protein kinase B (AKT), RAS/RAF, and cell cycle FIGURE 1 | Schematic diagram of the study. After identifying clinically relevant modules with WGCNA, the pink module (M7) was selected for further investigation including differentially methylated genes and frequency of CNVs, afterwards, the Kaplan-Meier survival analysis, and GSEA were used to obtain results and conclusions. WGCNA, weighted gene co-expression network analysis; DFS, disease-free survival; CNV, copy number variation; GSEA, Gene Set Enrichment Analysis. signaling pathways; moreover, alterations in AR and TP53 have been linked to castration resistance (13,14) and worse outcomes (15). Thus, CNVs have prognostic value in PRAD as they can reflect disease progression.
The development and progression of cancers involve genegene interactions within a gene co-expression network. In this study, we carried out weighted gene co-expression network analysis (WGCNA) (16) to identify genes associated with clinical outcomes in PRAD and can thus serve as biomarkers. We also investigated CNV and methylation status of genes in key module of the network and assessed their prognostic value for PRAD.

Data Acquisition
The expression data matrix of The Cancer Genome Atlas (TCGA) PRAD database comprising 497 tumor and 52 normal tissue samples along with CNVs, DNA methylation, and clinical FIGURE 2 | Module-trait relationships. Each column corresponds to one trait, row to one module and every cell contains the correlation coefficient and p-value. The gray module represents genes not classified into any module. BR, biochemical recurrence. GS, Gleason score. pN, pathological N stage. pT, pathological T stage.
information was downloaded from the University of California at Santa Cruz (UCSC) Xena web server (https://xenabrowser.net/).

Identification of Co-expression Module
Unlike ordinary clustering analysis, clustering criteria of WGCNA have biological significance, so the results obtained by this method have higher credibility. WGCNA clusters genes with similar expression patterns into a module and allows analysis of correlations between module and sample features. In this study, WGCNA was carried out to identify gene module closely related to clinical outcomes in PRAD. To minimize computational burden, the top 10,000 genes with the largest variance were selected. The topological overlap matrix (TOM) was performed to measure the correlation between genes and detection of module, which was able to identify not only the similarity of expression between gene A and gene C, but also the effect of gene A on gene C via gene B. A height of 220 in the sample cluster was used to detect outliers, with two outliers as filters. A power β of 8, minimal module size of 30, and branch merge cutoff height of 0.25 were used as the criteria for module construction.

Copy Number Variation Analysis
The TCGA PRAD CNV profiles were originally measured using whole genome microarray at a TCGA genome characterization center, and GISTIC2 method was then conducted to acquire the estimated values to −2, −1, 0, 1, 2, respectively, representing homozygous deletion, single copy deletion, diploid normal copy, low-level copy number amplification, and high-level copy number amplification (17). The processed data was obtained from https://xenabrowser.net/. In addition, GISTIC2 was conducted to assess the possibility of CNV events in specific chromosomal regions. Genes with changes in frequency >10% were selected for further analysis. We calculated the Spearman correlation coefficient (r) between CNVs and gene expression levels, with r > 0.4 as the cutoff value, indicating the significant impact on gene expression of CNV.

Methylation Analysis
The DNA methylation profiles of PRAD from TCGA were available at the University of California, Santa Cruz (UCSC) Xena browser (https://xenabrowser.net/), which were measured experimentally based on the Illumina Infinium HumanMethylation450 platform (Illumina, San Diego, CA, USA). DNA methylation values (β values, between 0 and 1) were recorded for every array probe in each sample by virtue of BeadStudio software (Illumina, San Diego, CA, USA), representing the ratio of the intensity of the methylated bead type to the combined locus intensity. The level of methylation evaluated by β values were derived from the Johns Hopkins University and University of Southern California TCGA genome characterization center.

Gene Set Enrichment Analysis and Protein-Protein Interaction Network Analysis
Gene Set Enrichment Analysis (GSEA) (18) is a computational method used to determine whether a predefined set of genes can show significant differences between two biological FIGURE 3 | The level of gene copy number in PRAD samples. The estimated values -2, -1, 0, 1, 2, respectively representing homozygous deletion, single copy deletion, diploid normal copy, low-level copy number amplification, and high-level copy number amplification. The horizontal axis represents PRAD tumor samples in TCGA, whereas the vertical axis represents genes from M7 module with CNV > 10%.
statuses, which were performed by the GSEA software obtained from http://www.broad.mit.edu/gsea to assess the enrichment of identified genes with distinct CNVs and methylation levels in PRAD, with false discovery rate (FDR) < 25% and nominal p < 0.05 as the cutoff values. Proteinprotein interaction (PPI) network analysis of identified genes was completed by an online tool available at https:// string-db.org/ to assess possible interactions between their expression products.

Survival and Statistical Analyses
The Kaplan-Meier survival analysis was performed with Prism 7.0 (GraphPad, La Jolla, CA, USA) and the online tool GEPIA (http://gepia.cancer-pku.cn/index.html) (19). Previous study by  Li et al. has established a prognostic model and verified with independent datasets after establishing a prognostic model (20); therefore, we downloaded the independent dataset GSE70769 (21) through the National Center for Biotechnology Information Search database (https://www.ncbi.nlm.nih.gov/) and analyzed the impact of the identified genes on the prognosis of prostate cancer patients. Multivariate analyses were carried out with the cox proportional hazards regression model. All data processing was performed using SPSS v22.0 software (SPSS Inc, Chicago, IL, USA) or R software (x64 3.5.1) (22).
The research process is illustrated in Figure 1.

Identification of Co-expression Module in PRAD
The top 10,000 genes with the largest variations in expression level relative to normal tissue were selected for WGCNA. We generated a module-trait association network with 7 clinicopathologic traits and 17 modules and calculated the Pearson's correlation coefficients and p-values to evaluate the relationship between clinical traits and feature vectors of genes in the module. The module with highest correlation coefficient and module size >30 (pink module, M7, p < 0.01) was selected for further analysis (Figure 2).

Copy Number Variation Analysis
After analyzing CNV profiles of TCGA PRAD data and combining the results with pink module (M7) from the WGCNA, we selected 111 genes with a variation frequency >10% and constructed a heatmap of the CN of genes in the PRAD samples (Figure 3), which allowed us to identify those with abnormal CN. Because our aim was to identify prognostic biomarkers for PRAD, we examined the pathologic stage associated with the CN variants. The 14 genes with the highest CNV and the corresponding clinicopathologic stage are shown in Table 1. Of these, nine genes with a Spearman correlation coefficient >0.4 were selected to evaluate the association between gene CN and expression level. Positive correlations were observed for the cyclin E2 (CCNE2), DNA replication and sister chromatid cohesion 1 (DSCC1), rhophilin Rho GTPase-binding protein (RHPN1), enhancer of zeste homolog 2 (EZH2), RAD54B, TBC1 domain family member 31 (TBC1D31), and tonsoku-like DNA repair protein (TONSL) genes (p < 0.0001), indicating the amplifications of CN events probably correlated with higher gene expression level. However, epoxide hydrolase 2 (EPHX2) and CBFA2/RUNX1 partner transcriptional co-repressor 3 (CBFA2T3) primarily showed deletions of CN events, which leading to the lower level of gene expression (Figure 4). To macro-evaluate the possibility of CNV events in specific chromosomal regions, the deletion and amplification plots based on G scores for CNV were demonstrated in Supplementary Figure 2. The higher G score of a region represents for the greater probability of CNV events in that region.

The Kaplan-Meier Survival Analysis
We performed the Kaplan-Meier analysis to evaluate the relationship between CNV and disease-free survival (DFS). The survival curve indicated that CNV level was significantly associated with the prognosis of patients with PRAD, with lower CNV predicting longer DFS (p = 0.0001; Figure 5). We analyzed the relationship between CNV of the nine above-mentioned genes and patient prognosis and found that lower CNs of CCNE2 [hazard ratio (HR) = 1.6; p < 0.05], RHPN1 (HR = 2; p < 0.05), EZH2 (HR = 2.2; p < 0.001), and TONSL (HR = 1.7; p < 0.05) were associated with better prognosis, whereas the opposite was true for EPHX2 (HR = 0.47; p < 0.001) (Figure 6). A multivariate analysis of CBFA2T3 CN suggested that it may be a protective factor in PRAD, whereas the Kaplan-Meier survival analysis suggested it was not statistically significant in the prognosis of patients with PRAD ( Figure 6I and Table 2). The validation of identified biomarkers for prognosis value revealed the similar results as our former analysis, indicating the explicit prognostic significance of CCNE2, SLC7A4, EZH2, etc. (Supplementary Figures 1B-H).

Gene Set Enrichment Analysis and PPI Network Analysis
To identify enriched gene sets in PRAD samples with high CNV and clarify the mechanisms of CNV in tumorigenesis, we performed GSEA to identify relevant biological pathways in the Kyoto Encyclopedia of Genes and Genomes (KEGG) database and Pathway Interaction Database (PID) using FDR < 25% and p < 0.05 as the criteria for significance. For EZH2, TONSL, and CCNE2, the GSEA curves revealed four enriched gene sets including "KEGG-cell cycle, " "KEGG-P53 signaling pathway, " "PID-ataxia-telangiectasia mutated (ATM) pathway, " and "PID-E2F pathway, " which are mainly related to cell cycle regulation, cell apoptosis, and DNA damage repair. Additionally, for EPHX2, two functional gene sets were enriched-namely, "Cell cycle pathway" and "PID-E2F pathway" (Figure 7). The PPI network analysis found that there was a co-expression between EZH2 and CCNE2, both of which play important roles in regulating cell cycle (Supplementary Figure 1A).

Methylation Analysis
After establishing the co-expression module, the DNA methylation level of the genes was examined, and its correlation with gene expression level was evaluated with the Pearson's correlation coefficient. Differentially methylated genes with the Pearson's correlation coefficient >0.4 were identified, including fibromodulin (FMOD), transmembrane protein 220 (TMEM220), histone H2B type 1-H (HIST1H2BH), zinc finger 334 (ZNF334), RIC3 acetylcholine receptor chaperone (RIC3), and solute carrier family 7 member (SLC7A4); these genes were all hypermethylated in tumor samples (n = 336) compared to normal tissue (n = 49) (Figures 8A, 9A). There was a moderate inverse correlation between gene expression and methylation levels (r > 0.4, p < 0.001; Figure 8B). The heatmap of DNA methylation revealed significantly higher levels in tumor tissue compared to normal tissue, especially for SLC7A4. The Kaplan-Meier survival curves showed an association between gene expression level and the prognosis of PRAD for FMOD [HR (high) = 0.37; p < 0.001] and SLC7A4 [HR (high) = 0.44; p < 0.001], with a higher level corresponding to better prognosis (Figures 9B,E), while others were not statistically significant (Figures 9C,D,F,G). The GSEA curves revealed four gene sets that were enriched, including "KEGG-cell cycle, " "PID-E2F pathway, " "Hallmark-E2F target, " and "Hallmark-G2M checkpoint" (Figures 9H-K). These results indicate that FMOD and SLC7A4 are significant genes related to the clinical outcome of PRAD.

DISCUSSION
The number of new cases of PRAD in the United States has shown an increasing trend in the last 3 years, and PRAD is the second leading cause of death in men despite improvements in diagnostic methods and treatments (2,23). Although magnetic resonance imaging and some biomarkers are used for the diagnosis of PRAD, the standard approach is tissue biopsy (24), which may only be performed at later stages of the disease when therapeutic options are limited. Copy number variations occur in 4.8-9.5% of the human genome and play a critical role in tumor recurrence (25); and epigenetic modifications such as DNA methylation are potential biomarkers and targets for treatment in cancer (11). Given the increasing rates of PRAD, there is a need for new diagnostic and prognostic biomarkers with high specificity and sensitivity. In this study, we identified five novel genes with high CNV in PRAD by WGCNA (CCNE2, RHPN1, EZH2, TONSL, and EPHX2) along with two hypermethylated genes (FMOD and SLC7A4) that were significantly correlated with the prognosis of patients with PRAD and may thus be clinically useful biomarkers.
Cyclin E2 encodes cyclin E2, a regulatory subunit of cyclindependent kinase 2 (CDK2), which controls cell cycle entry from quiescence. Although the gene encoding the other subunit  of CDK2, CCNE1, has been linked to poor prognosis in hepatocellular carcinoma (HCC), there is little known about the role of CCNE2 in tumor progression (26). Cyclin E2 was shown to induce the G1-S transition in PC3 prostate cancer cells (27); our results suggest that it may have a similar function in PRAD, given that a lower CCNE2 CN was associated with longer DFS in patients.
Rhophilin Rho GTPase-binding protein is a Rho GTPaseinteracting protein that has not been previously reported in PRAD, but is known to modulate the glomerular filtration barrier and podocyte cytoskeletal architecture (28). The long non-coding RNA RHPN1 antisense RNA 1 (RHPN1-AS1) was found to promote the progression of several tumors, including uveal melanoma, cervical cancer, and HCC (29)(30)(31). Our results provide the first demonstration that overexpression of RHPN1 is associated with poor prognosis in PRAD.
Enhancer of zeste homolog 2, the catalytic subunit of the DNA methyltransferase polycomb repressive complex 2 (PRC2), is overexpressed in hormone-refractory metastatic PRAD and may be correlated with disease progression and prognosis (32). Consistent with our findings, one study showed that an elevated level of EZH2 was associated with over proliferation of tumor cells and worse prognosis and may have clinical utility for distinguishing indolent PRAD from aggressive disease with a fatal course (31). On the one hand, the utility of EZH2 as a biomarker has been demonstrated in patients with intractable PRAD (33). On the other hand, EZH2 inhibitors have been linked to carcinogenesis and treatment resistance in clinical trials (34), though the detailed mechanisms underlying these effects remain to be determined.
Tonsoku-like DNA repair protein promotes homologous recombination during DNA repair in a complex with MMS22like (MMS22L) (35). However, the role of TONSL in prostate cancer is unknown. We found that a high level of TONSL was associated with enrichment of genes related to the ATM and E2F pathways-which mediate DNA repair and negatively regulate the cell cycle-and decreased survival time in patients with PRAD. Thus, TONSL is a potential biomarker for the progression of PRAD. Epoxide hydrolase 2 functions in arachidonic acid and androgen signaling (36)(37)(38) and has been linked to the biosynthesis and metabolism of cholesterol in the regulation of testosterone levels (39). EPHX2 silencing induced apoptosis in PRAD cells and enhanced the antiproliferative effect of flutamide (40). In our study, decreased expression of EPHX2 was associated with the enrichment of genes related to the cell cycle and E2F pathway while patients with an elevated level of EPHX2 had better prognosis, suggesting that EPHX2 has a protective role in PRAD.
Fibromodulin (encoded by FMOD) is thought to be involved in the inhibition of tumorigenesis and apoptosis in hematologic malignancies such as B-cell chronic lymphocytic leukemia and mantle cell lymphoma, like other proteoglycans (40). FMOD was shown to be overexpressed in PRAD cell lines and clinical specimens (41), which is supported by our findings. Our analysis revealed that higher expression of FMOD was associated with a better clinical outcome, highlighting its potential utility as a biomarker for monitoring disease progression.
Solute carrier family 7 member is a cationic amino acid transporter of unknown function; SLC7A4 expressed in the plasma membrane was insufficient to drive amino acid transport (42). Our results showed that SLC7A4 methylation was higher in tumor specimens than in normal tissue, and that higher SLC7A4 expression was associated with better clinical outcome. We speculate that SLC7A4 inhibits tumor formation in PRAD through regulation of the cell cycle. Thus, SLC7A4 likely has clinical value for monitoring PRAD progression and predicting prognosis.

CONCLUSION
In this study, we used WGCNA to identify seven genes that are potential prognostic biomarkers for PRAD based on CNV (CCNE2, RHPN1, EZH2, TONSL, and EPHX2) and DNA hypermethylation (FMOD and SLC7A4), all of which can serve as indicators of PRAD progression and potential therapeutic targets for the PRAD treatment as well. However, further experiments are needed to elucidate the precise roles and mechanisms of these candidate biomarkers in PRAD and validate their clinical applicability.

DATA AVAILABILITY STATEMENT
The original contributions presented in the study are included in the article/Supplementary Material, further inquiries can be directed to the corresponding author/s.