Integrative Analysis of DNA Methylation Identified 12 Signature Genes Specific to Metastatic ccRCC

Background: Abnormal epigenetic alterations can contribute to the development of human malignancies. Identification of these alterations for early screening and prognosis of clear cell renal cell carcinoma (ccRCC) has been a highly sought-after goal. Bioinformatic analysis of DNA methylation data provides broad prospects for discovery of epigenetic biomarkers. However, there is short of exploration of methylation-driven genes of ccRCC. Methods: Gene expression data and DNA methylation data in metastatic ccRCC were sourced from the Gene Expression Omnibus (GEO) database. Differentially methylated genes (DMGs) at 5′-C-phosphate-G- 3′ (CpG) sites and differentially expressed genes (DEGs) were screened and the overlapping genes in DMGs and DEGs were then subject to gene set enrichment analysis. Next, the weighted gene co-expression network analysis (WGCNA) was used to search hub DMGs associated with ccRCC. Cox regression and ROC analyses were performed to screen potential biomarkers and develop a prognostic model based on the screened hub genes. Results: Three hundred and fourteen overlapping DMGs were obtained from two independent GEO datasets. The turquoise module contained 79 hub DMGs, which represent the most significant module screened by WGCNA. Furthermore, a total of 12 hub genes (CETN3, DCAF7, GPX4, HNRNPA0, NUP54, SERPINB1, STARD5, TRIM52, C4orf3, C12orf51, and C17orf65) were identified in the TCGA database by multivariate Cox regression analyses. All the 12 genes were then used to generate the model for diagnosis and prognosis of ccRCC. ROC analysis showed that these genes exhibited good diagnostic efficiency for metastatic and non-metastatic ccRCC. Furthermore, the prognostic model with the 12 methylation-driven genes demonstrated a good prediction of 5-year survival rates for ccRCC patients. Conclusion: Integrative analysis of DNA methylation data identified 12 signature genes, which could be used as epigenetic biomarkers for prognosis of metastatic ccRCC. This prognostic model has a good prediction of 5-year survival for ccRCC patients.


BACKGROUND
Clear cell renal carcinoma (ccRCC) is the major type of renal tumor in the human urinary system (1). Many patients with ccRCC have distant metastasis to lymphoid or other organs (2). The 5-year survival rate was lower in patients with distant metastatic ccRCC than with non-metastatic ccRCC (3). Over the past decade, the therapeutic strategies for advanced ccRCC have evolved rapidly from a nonspecific immune approach to targeted therapy against vascular endothelial growth factor (VEGF), and recently to novel immunotherapy. Multiple agents like immune checkpoint inhibitors (ICIs) and tyrosine kinase inhibitors (TKIs) show promising results in the clinical trials and have been approved as the second-line even the first-line treatment for advanced ccRCC (4)(5)(6)(7). Moreover, targeted combination therapies have been explored as the first-line treatment in pioneer trials (8).
However, to enhance the efficacy and prognosis always underscores a need for new therapeutic targets and drug combinations (9). An increasing number of studies prove that epigenetic alterations such as DNA methylation and histone modifications are associated with cancer progression and occurrence. Therefore, they could be exploited as biomarkers for cancer diagnosis and prognosis (10)(11)(12)(13)(14). However, epigenetic biomarkers underlying metastatic ccRCC remain to be elucidated. Epigenetic biomarkers have been testified in certain malignancies and allowed their potential use in differentiation between metastatic and non-metastatic ccRCC. In this study, we profiled methylation patterns of metastatic and primary non-metastatic ccRCC and identified potential biomarkers associated with the cancer progression. We measured the gene methylation levels at CpG sites throughout the genome and found that a subset of genes presented with aberrant methylation and expression in metastatic ccRCC when compared with primary nonmetastatic ccRCC. We further created the criteria to screen hub DMGs and defined the pathological stages of ccRCC accordingly. Finally, we applied the survival analysis to evaluate the hub DMGs as potential biomarkers for the prognosis of ccRCC patients.

Data Collection
The gene methylation datasets (GSE113501 and GSE105260) were downloaded from the GEO database. GSE113501 and GSE105260 were generated on the platform GPL13534 (Illumina HumanMethylation450 BeadChip). In total, 54 metastatic tumor samples and 96 non-metastatic tumor samples were included in the study. GSE105261 was used for screening of DEGs in non-metastatic and metastatic ccRCC. TCGA-KIRC dataset was used for analysis of correlation between gene expression and methylation levels. All ccRCC-related clinical data in this study were obtained from the GEO and the Cancer Genome Atlas (TCGA) database.

Differential Methylation Analysis
First, the missing methylation values among all samples were removed. To identify differentially methylated CpGs, twosample independent t-test was applied to compare primary non-metastatic ccRCC and metastatic ccRCC data. Bonferroni procedure, together with transformed β values, was used to adjust crude p-values in multiple comparisons. The β value was calculated by the formula β = M/(M + U + a), where M and U represent the methylated and unmethylated signal intensities, respectively. a is usually set as 100 to stabilize β values when M and U are too small (15). β average were used for evaluation of global and regional CpG methylations. The delta β value was calculated to evaluate the methylation difference between metastatic and non-metastatic ccRCC. M and β values were used to identify the DMGs and intergenic CpG sites. The adjusted p < 0.05 and delta β values in 4/10 quantile were used as the cutoff criteria to screen candidate hub genes or intergenic CpG sites. R packages "limma" and "lumi" were used for differential analysis and conversion between β and M values, respectively (16)(17)(18)(19)(20)(21)(22). The DMGs were classified into six types according to the methylation patterns.

GENE SET ENRICHMENT ANALYSIS
Gene ontology (GO) and Kyoto Encyclopedia of Genes and Genomes (KEGG) pathway enrichment analyses were implemented by DAVID online tools (23,24). p < 0.05 was considered statistically significant. The GO term enrichment analysis included three categories: biological process (BP), molecular function (MF), and cellular component (CC). R package "Venn Diagram" was used to illustrate the overlapping genes in DMGs and DEGs.

Correlation Between DNA Methylation and Gene Expression
We examined the correlation between DNA methylation and gene expression based on the data from the TCGA database. Spearman correlation coefficients were used to evaluate the expression value and the methylation level at all genes CpG sites. The correlation was regarded significant if the absolute value of correlation coefficient was >0.3 and FDR was <0.05. Only the overlapping genes in DMGs and DEGs were evaluated for their biological functions by gene regulation analysis.
Weighted Co-expression Network Construction Analysis (WCGNA) R package "WGCNA" was used for constructing co-expression networks among genes across microarray samples (25). To begin with, the power of β (soft thresholding) was defined to ensure a standard scale-free network. An adjacency matrix was then constructed with the values of adjacencies between each pair of node genes in the network and their Pearson's correlation coefficients. The adjacency matrix was used to calculate the topological overlap matrix (TOM) and the  corresponding dissimilarity (1-TOM). Finally, based on TOMbased dissimilarity measures, the dendrogram was constructed by average linkage hierarchical clustering. Highly similar modules were clustered and then merged with a height cut-off of 0.25.

Identification of Modules Corresponding to Clinical Traits
By WGCNA, the modules most relevant to the clinical phenotypes of interest could be identified. Here, we differentiated clinical phenotypes by pathological stages. As a principal component of the gene expression matrix, module eigengene (ME) was used to identify modules corresponding to clinically significant traits. Gene significance was evaluated by log p-value of each gene in the linear regression between gene expression and pathological progression. The module significance was used to evaluate the correlation between module expression patterns and clinical traits (age, gender, survival time, and disease stage, etc.). In general, a module was regarded significantly correlating with certain clinical traits if its absolute module significance ranked the highest among others.

Identification of DMGs Associated With Overall Survival Prognosis and Validation of Potential Epigenetic Biomarkers by Receiver Operating Characteristic (ROC) Analysis
To identify module genes associated with prognosis of ccRCC patients, we analyzed gene methylation data in the turquoise module from the Cancer Genome Atlas Kidney Renal Clear Cell Carcinoma (TCGA-KIRC) data collection. Univariate Cox regression was used to evaluate the prognostic power of 79 individual probes at CpG sites. Furthermore, multivariate Cox regression analysis was performed to evaluate whether all genes corresponding to individual CpG sites could be used as independent prognostic factors for patient's survival. p < 0.05 indicates statistical significance. ROC analysis was used to validate the predictive accuracy of potential epigenetic biomarkers, which were remarkably methylated in the multivariable Cox regression model. The area under the curve (AUC) was calculated to measure the quality of the classifier with R package "pROC".

Development and Evaluation of Prognostic Model
The genes with significant p-values by multivariate Cox regression were selected to develop a prognostic model. The risk score was calculated with regression coefficients multiplied by methylation levels. The median risk score was set as the cutoff value, by which 307 ccRCC patients from the TCGA database were divided into the high-risk group and the low-risk group. Clinical traits were used as an independent variable and the overall survival (OS) as the dependent variable to calculate the hazard ratio (HR). ROC curve was plotted with the R package "survivalROC" to confirm the predictive power of the 12-gene prognostic model and assess the probability of 1, 3, and 5-year OS for ccRCC patients.

Identification of DMGs in Metastatic ccRCC
To identify potential DMG biomarkers for prognosis of ccRCC, we performed a DNA methylation profiling on 150 tumor tissues. All the relevant clinical data were collected from the GEO datasets: GSE113501 (including 28 metastatic ccRCC tissues and 87 non-metastatic ccRCC tissues) and GSE105260 (including 26 metastatic ccRCC tissues and 9 non-metastatic ccRCC tissues). In total, 237 CpG sites in GSE113501 and 462 CpG sites in GSE105260 were identified to have differential methylations (Supplementary Tables 1, 2), which corresponded to 198 and 355 genes, respectively (Supplementary Tables 3, 4). GO analysis showed that 552 DMGs were linked to the biological pathways that involved in transcriptional regulation of cancers, cell cycle and metabolism, such as bladder cancer (Supplementary Table 5).

DMGs Involved in Transcriptional Regulation
GO analysis revealed that most DMGs were annotated to biological process (BP) and molecular function (MF) terms indicating transcriptional regulation: positive or negative regulation of transcription for various BPs (Figure 2A and Supplementary Table 6) and transcription factor bindings ( Figure 2B). In addition, the ontology of cell component (CC) pointed to key cellular structures responsible for cellcell interaction and cell activities (Figure 2C). Venn diagram showed that genes representing top BP terms were mostly methylated in the regions of body, 5 ′ UTR, and TSS1500 ( Figure 2D).

Overlapping Genes in DMGs and DEGs
We identified 51 genes overlapping between DEGs from GSE105261 and DMGs from GSE105260 ( Supplementary Figure 1 and Supplementary Table 7), which were classified into four types according to their methylation and expression levels, namely downregulated hypermethylated or hypomethylated, and upregulated hypermethylated or hypomethylated. Noticeably, downregulated hypomethylated genes accounted for a much lower proportion (Figure 3A). The CpG methylation of these genes mainly took place in the body and TSS1500 regions (Figure 3B). GO analysis also revealed that they were related to transcriptional factor assembly and receptormediated signaling (Figure 3C), and key protein bindings ( Figure 3D). Moreover, 10 genes were screened against the TCGA-KIRC database, which showed a significant correlation between methylation level and expression level in ccRCC patients (Supplementary Figure 2).

Identification of DMGs Associated With Metastatic ccRCC Throughout the Genome
Further, we measured all CpG methylations throughout the genome and identified 6,300 genes from GSE113501 (Supplementary Table 8) and 799 genes from GSE105260 (Supplementary Table 9), respectively. Three hundred and fourteen genes were overlapped between the two subsets (Supplementary Figure 3). As previously, these genes were annotated to BPs relating to protein folding and transportation ( Figure 4A and Supplementary Table 10) and CCs to components in nucleus (Figure 4B), as well as MFs to key element bindings ( Figure 4C). When comparing the 314 DMGs with 3,002 DEGs filtered from GSE105261, we found 44 genes were overlapped (Supplementary Figures 4 and  Supplementary Table 11). Most of them were downregulatedhypermethylated genes (Supplementary Figure 5), indicating their domination in the progress of ccRCC. The TCGA-KIRC dataset and Spearman test were applied for correlation analysis, which identified top 10 genes with significant correlation between methylation and expression (Supplementary Table 12 and Supplementary Figure 6).

Identification of Hub DMGs
Next, we performed the WGCNA analysis to identify potential molecular modules that could characterize the pathological stages of ccRCC. The WGCNA network was constructed with 308 DMGs screened out by methylation level changes in all CpG island regions. The DMGs irrelevant to ccRCC clinical features were removed (Supplementary Table 13). The dendrogram and traits of all samples were illustrated ( Figure 5A) and the soft threshold power was set as five to ensure a scale-free network ( Figure 5B). Three WCGNA modules were identified as blue, turquoise, and gray ( Figure 5C). The blue module showed to be negatively correlated with pathological stages while turquoise module was positively. The turquoise module harboring 79 genes was selected for further analysis as it was the most relevant to clinical traits (Supplementary Tables 14 and Data Sheet 2). The correlation coefficients with p-values, as well as heatmap of each module were calculated, respectively (Figures 5D-F) and the 79 relevant genes were considered as hub DMGs.

Identification of 12 Signature DMGs
Among the 79 hub DMGs, 43 CpG sites were further identified as potential DNA methylation biomarkers for prediction of overall survival of ccRCC patients by univariate Cox regression analysis. The genes such as TGDS, TCTE3, TOPBP1, SNX14, and PHIP showed their expression were strongly associated with the overall survival in ccRCC patients (Supplementary Data Sheet 1). By contrast, the multivariate cox analysis revealed 12 DMGs (CETN3, DCAF7, GPX4, HNRNPA0, NUP54, SERPINB1, STARD5, TRIM52, C4orf3, C12orf51, C17orf65, and C21orf45) significantly affected overall survival of ccRCC patients ( Table 1). The module' risk score was significant ( Figure 6A) and survival analysis exhibited hypermethylation signature of the 12 genes significantly correlated with poor prognosis (p < 0.001) ( Figure 6B). Moreover, we examined methylation levels of the 12 genes in normal kidney tissues. The results showed that there was no significant difference between normal kidney tissues and non-metastatic ccRCC kidney tissues, but for metastatic renal cancer. Above all, all the analyses suggested that the multi-gene methylation signature may be associated with ccRCC metastasis (Supplementary Data Sheet 3). Next, the ROC AUC was used to evaluate predictive powers of the prognostic models constructed with single signature gene or multiple signature genes. The ROC curves indicated that C17orf65, HNRNPA0, and STARD5 exhibited better diagnostic accuracy for differentiating metastatic and nonmetastatic ccRCC cases in GSE105260 as the AUC value was >70% (Supplementary Data Sheet 4). By contrast, C12orf51, C17orf65, SERPINBI, and TRIM52 demonstrated better diagnostic accuracy for differentiation of cases in GSE113501 (Supplementary Data Sheet 5). The ROC curve validated that the 12 signature genes as a whole had better diagnostic accuracy than individual signature genes in GSE105260 ( Figure 7A). Also, the 12-gene signature could effectively diagnose the patients with metastatic ccRCC case in GSE113501 ( Figure 7B). The prediction powers for 1-, 3-, and 5-year overall survival were 0.87, 0.84, and 0.81, respectively, indicating that the prognostic model with the 12 signature genes had high sensitivity and specificity ( Figure 7C). The values for predicting the 3-and 5-year survival were lower than that for 1year, suggesting that the model was more accurate for shortterm prediction than for long-term prediction. Additionally, as shown in Figure 7D, the 12-gene prognostic model was more powerful in the prediction than other model based on pathological stages.

DISCUSSION
ccRCC is the most common type of renal carcinoma in the world. Although the treatment strategies for ccRCC have been improved during the last decades, the management of metastatic ccRCC still remain challenging due to lack of eligible molecular targets. Therefore, discovery of specific markers for early diagnosis is essential for control of ccRCC progress. DNA methylation is an important driver of many of the distinct stages of cancer, to study the molecular mechanisms underlying the pathogenesis of ccRCC may be helpful for its diagnosis and treatment. To the best of our knowledge, there is no applicable markers for differentiating clinical stages of ccRCC. In the present study, we used the clinical data from GSE113501 and GSE105260 to profile genes associated with non-metastatic and metastatic ccRCC. Through WGCNA and ROC analysis, 12 genes were finally We first screened 237 differentially methylated CpG sites from the GSE113501 dataset and 462 CpG sites from the GSE105260 dataset, which were mapped to 552 genes. We examined the methylation patterns in specific regions: TSS1500, TSS200, 5 ′ UTR, 1stExon, body, and 3 ′ UTR, of which TSS1500 and body region showed remarkable methylation changes. GO analysis showed these DMGs were annotated to transcription landscape. In fact, DNA methylation and transcription factors (TFs) binding represent two components of the regulatory architecture and their interplay would underlie the clinical outcome of cancer development (26)(27)(28)(29)(30)(31)(32)(33). By the WGCNA analysis, we screened 314 intersected DMGs between metastatic and non-metastatic ccRCC, which were associated with the biological processes and functions that underlined cancer metastasis and invasion. For example, dysregulation of cell cycle was also reported to associate with cancer progress and prognosis in the number of previous studies (33)(34)(35)(36)(37)(38). In view of MFs, these DMGs involved in protein binding, peptidyl-prolyl cis-trans isomerase activity, identical protein binding and poly (A) RNA binding, etc. For example, the etiology of early-onset CRC was linked to ploy (A) RNA binding endorsed by a set of hub genes (39) and the metastasis of breast cancer resulted from LSD1 demethylation (40).
From GSE105261 and GSE105260, we further screened 44 intersected genes between DEGs (in GSE105261) and DMGs (in GSE105260). We classified these intersected genes into two categories: negatively and positively correlated between their methylation and expression levels. The majority of intersected genes fell into the negative correlation category, including downregulated-hypermethylated genes and upregulatedhypomethylated genes. Among these intersected genes, the methylation changes in 10 genes were the most significantly associated to the phenotypes of ccRCC when searching against the TCGA-KIRC dataset. However, some of them have been found to associate with other cancers via diversified mechanisms, for example, nuclear LDHA promoted HPVinduced cervical cancer development using a mechanism of H3K79 hypermethylation (41)(42)(43)(44).
Finally, univariate and multivariate Cox regression analysis identified 43 CpG sites and 12 DMGs (CETN3, DCAF7, GPX4, HNRNPA0, NUP54, SERPINB1, STARD5, TRIM52, C4orf3, C12orf51, C17orf65, and C21orf45) that were highly correlated to patients' survival. Indeed, some genes are reported to contribute to the occurrence and progress of cancers. For example, GPX4 negatively correlated with prognosis of pancancer patients as the low methylation level at the upstream site leads to its higher expression in cancer tissues (45). On the contrary, hypermethylation of SERPINB1 promoter resulted in enhanced neutrophil elastase (NE) activity, which was associated with poor outcome in prostate cancer (46). Other genes like HNRNPA0, DCAF7, and TRIM52 functioned in diverse ways including abnormal changes in alternative splicing or activation of canonical signal pathways in relation with cancer progression (47)(48)(49)(50).
Until recently, two research groups also screened signature genes for prognosis in ccRCC patients based on TCGA database. One group constructed a prognostic model with 7 signature genes (IFI30, WNT5A, IRF9, AGER, PLAUR, TEK, and BID), which were all immunity-related genes (IAGs) (50). The other group built the model with 5 lncRNAs (51). While in present study, we screened the signature genes from the standpoint of methylation alteration. Our model with 12 DMGs has the AUC value of 0.81 for 5-year survival prediction compared to the IGA-based model (0.751) and the lncRNA-based model (0.91). Our model could distinguish metastatic and non-metastatic pathologies, which could accompany traditional pathological tissue typing and provide more precise diagnosis and prognosis of ccRCC patients. Meanwhile, it is necessary to validate our model with more clinical data and the mechanisms by these signature DMGs in ccRCC.

CONCLUSION
Taken together, we performed an integrative DNA methylation profiling and constructed a 12-gene model that signed in metastatic ccRCC. The model could be used for prognosis of patients with ccRCC and diagnosis of non-metastatic ccRCC and metastatic ccRCC. Aberrant methylation of the 12 genes would leave patients with ccRCC at higher risk of a poor prognosis.

DATA AVAILABILITY STATEMENT
The datasets generated for this study can be found in online repositories. The names of the repository/repositories and accession number(s) can be found in the article/Supplementary Material.

AUTHOR CONTRIBUTIONS
MC and GZ designed the study and were major contributors in editing the manuscript. SQ analyzed and interpreted the data and was major contributor in writing the manuscript. SS, LZ, ST, and KX performed analysis and contributed to writing the manuscript. All authors read and approved the final manuscript.