A Novel Promoter CpG-Based Signature for Long-Term Survival Prediction of Breast Cancer Patients

DNA methylation has been reported as one of the most critical epigenetic aberrations during the tumorigenesis and development of breast cancer (BC). This study explored a novel promoter CpG-based signature for long-term survival prediction of BC patients. We used The Cancer Genome Atlas (TCGA) data as training set, and results were validated in an independent dataset from Gene Expression Omnibus (GEO). First, the differential methylation CpG sites were screened in TCGA dataset, of which the candidate promoter CpG sites were preliminarily identified with the univariate Cox regression analysis and the least absolute shrinkage and selection operator regression analysis. Second, the signature was constructed with stepwise regression analysis and multivariate Cox proportional hazards model, which was validated with the survival analysis of two cohorts each from TCGA and GEO databases. The 10-year receiver operating characteristic curves of risk score presented an area under the curve of over 0.7 for both cohorts. A nomogram was also constructed and released. Moreover, Gene Set Enrichment Analysis was performed to identify the more active pathways in high-risk patients. The CpG sites–target gene correlations and differential methylation regions were further explored. In conclusion, the promoter CpG-based signature exhibited good prognostic prediction efficacy in the long-term overall survival of BC patients.


INTRODUCTION
Breast cancer (BC) has become one of the most concerned public health issues in the worldwide, because of the growing incidence, high mortality, and huge economic burden (1,2). More than 1 million new BC cases were diagnosed in 2002 (3). For women, BC led to <25% of the newly diagnosed cancer cases and caused 14.7% of cancer-associated deaths (4). Further, the treatment costs of BC have been generally escalated with the advance of disease stage at diagnosis (5). BC patients can greatly benefit from early diagnosis, both in therapeutic efficacy and economic burden.
With the advances of molecular diagnosis technology, the heterogeneity and complexity of BC have been revealed (6). BC can be classified into different subgroups based on histopathologic characteristics or gene expression profiles. The molecular characterization of BC would provide much information for understanding the pathogenesis of BC and exploring potential markers for early diagnosis and target therapy (7).
DNA methylation, mainly occurring at cytosine-phosphate-guanosine dinucleotide (also designated as CpG) where cytosine was converted to 5-methylcytosine (5meC), has been considered to make important effects in cancer development (8). The covalent addition of a methyl group was generally observed in cytosine within CpG dinucleotides, which were concentrated in large clusters called CpG islands (9). The abnormal methylation of promoter CpG dinucleotide sites in cancer leads to transcriptional silencing, which would be heritable in progeny cells (10). Because the DNA methylation can be varied with internal and external factors, it has become a research hotspot for investigating the tumorigenesis and cancer development (11). At the same time, DNA methylation has also been intensively explored as a target for epigenetic treatment.
Aberrant DNA methylation makes effects in BC. One study investigated the link between DNA methylation and gene expression in BC. The result identified a transcriptional network regulated by DNA methylation at enhancers, in a cell lineage-specific manner (12). Some BC-associated heritable DNA methylation markers were screened through detecting and comparing DNA methylation levels in BC patients and their family members (13). The endogenous and external factors may make effects via modulating DNA methylation patterns, such as oxidative DNA damage and age-related reproductive factors (14,15).
After verifying the association between DNA methylation and gene expression in BC, some large-scale studies have been performed to comprehensively explore the potential DNA methylation markers with clinical significance (16). In another study, blood-based DNA methylation biomarkers were summarized, which showed significance in risk stratification or the early detection of BC (17).
Infinium HumanMethylation450 Bead Chip contained a total of 485,764 CpG dinucleotide sites in the human genome. The whole genome could be divided into four regions: promoter, body, 3 ′ -UTR (3 ′ -untranslated region) and intergenic region. The promoter region was subdivided into 5 ′ -UTR (5 ′untranslated region), TSS200 (within 200 bp upstream of the SCHEME 1 | Flow diagram of the analysis procedure: data collection, processing, analysis, and validation.  (18).
In this study, we aimed to explore novel survival-associated promoter CpG dinucleotide sites for prognostic prediction in BC patients. The BC methylation 450 K datasets and corresponding clinical features were obtained from both The Cancer Genome Atlas (TCGA) and Gene Expression Omnibus (GEO) databases. First, the differential methylation CpG (dmCpG) sites were screened in TCGA dataset, of which the candidate promoter CpG sites were preliminarily identified with the univariate Cox regression analysis and the least absolute shrinkage and selection operator (LASSO) regression analysis. Second, the promoter CpG-based signature was constructed with stepwise regression analysis, which was validated with the survival analysis of two cohorts each from TCGA and GEO databases. A nomogram was also constructed, and the calibration curves showed that it was able to predict 5-, 7-, and 10-year survival accurately. Moreover, Gene Set Enrichment Analysis (GSEA) was performed to identify the more active pathways in high-risk patients and the CpG sites-target gene correlations and differential methylation regions (DMRs) were further explored.

Data Filter and Normalization
For the two datasets, the methylation beta matrix was filtered and normalized with ChAMP R package (20). The batch effect was removed with SVA R package. Probes in two datasets were filtered with the exclusion criteria including: (i) detection p > 0.01; (ii) bead count <3 in at least 5% of samples; (iii) non-CpG sites; (iv) all SNP-related probes and multihit probes; and (v) probes on X and Y chromosomes. The BMIQ method was applied for types I and II probe correction. In particular, probes with SNPs were identified in general 450 K SNP list (21) and filtered by champ.filter() function included in ChAMP R package.

dmCpG Sites
The dmCpG sites were preliminarily screened with ChAMP R package. In this study, CpG sites with | β| > 0.2 and Benjamini-Hochberg adjusted P < 0.05 were identified as dmCpG sites. The dmCpG sites located in the promoter regions (5 ′ -UTR, TSS200, TSS1500 and 1stExon) were further screened and visualized with the pheatmap R package.

The dmCpG Sites Identification in TCGA Dataset
TCGA dataset was applied as training cohort, of which only the patients with overall survival (OS) ranging from 90 to 3,650 days were included for further analysis. The males and patients without OS information were removed, and 715 patients were finally included. The univariate Cox regression analysis was applied for screening OS-associated dmCpG sites preliminarily with survival R package. The hazard ratio (HR) and p-value were provided. The dmCpG sites with P < 0.001 and HR < 10 were screened. A total of 78 survivalassociated dmCpG sites were obtained, corresponding to 63 genes. The metascape (www.metascape.org) was applied for Gene Ontology (GO) function enrichment analysis (22). LASSO regression analysis was further performed to explore the key dmCpG sites with glmnet R package. Finally, 25 dmCpG sites were identified.

The Construction of CpG-Based Prognostic Signature
With the identified 25 OS-associated dmCpG sites, stepwise regression analysis was applied to optimize the model. Variables with P < 0.01 in the stepwise analysis were included in a multivariate Cox proportional hazards model. Under the optimal situations, the dmCpG sites and corresponding coefficients were presented, and the formula for calculating the prognostic index (designated as risk score) based on methylation levels of dmCpG sites was obtained.

The Relationship of Risk Score With Clinical Characteristics
A total of 565 patients with complete clinical information in TCGA dataset were included. The ggpubr R package and t test were involved to explore the relationships between the CpG-based risk score and clinical characteristics, including age and TNM (tumor, node, metastasis). P < 0.05 indicates statistically significant. The results were provided with boxplots.

Independent Prognostic Prediction Analysis
Univariate and multivariate independent prognostic analyses were applied with survival R package. The included clinical variables were age, TNM (tumor, node, metastasis), and risk score. The 565 patients with complete clinical information in the TCGA dataset were included. The HR was calculated and expressed with forest plot. The 5-, 7-, and 10-year receiver operating characteristic (ROC) curves of risk score, age, and TNM (tumor, node, metastasis) were plotted with survival ROC R package. Area under the curve (AUC) of the ROC curve was also provided.

The Validation of dmCpG-Based Signature
The prognostic prediction efficacy of dmCpG-based signature was validated in both TCGA and GEO datasets. The survival analysis was performed on 715 patients in TCGA dataset and 268 patients in GEO dataset, respectively, with survival R package. The risk score was calculated based on the methylation level of included dmCpG sites and corresponding coefficients. The patients could be classified as high-and lowrisk groups, with the median risk score in TCGA dataset as cutoff value. The survival analyses of patients in high-and low-risk groups were performed with survival R package. The 5-, 7-, and 10-year ROC curves were plotted, and AUC of the ROC curve was calculated. Further, the nomogram was constructed for the patients in TCGA dataset with rms R package, and the calibration curves were also plotted with calibrate function.

Analysis of Included dmCpG Sites
The dmCpG sites were further investigated. First, the correlation between dmCpG sites and located gene expression was explored.

The dmCpG Sites Identification
The study flowchart describing the process is shown in Scheme 1.
The datasets from TCGA and GEO databases were preprocessed for further comparison, including filter, normalization, and batch correction. Then the dmCpG sites were screened in TCGA dataset with cutoff of | β| > 0.2 and adjusted P < 0.05. The proportions of dmCpG sites were screened (Figure 1A), and a total of 10,088 dmCpG sites located in the promoter regions (5 ′ -UTR, TSS200, TSS1500, and 1stExon) were finally identified ( Figure 1B). Second, the univariate Cox regression analysis was performed to identify the candidate OS-associated dmCpG sites in TCGA cohort. A total of 78 survival-associated dmCpG sites with P < 0.001 were identified, corresponding to 63 genes ( Table 1). The GO function enrichment of the 63 genes indicated several GO terms, including anterior/posterior pattern specification, norepinephrine transport, intracellular receptor signaling pathway, gland development, regulation of autophagy, etc. (Figure 1C). LASSO regression analysis was further performed to explore the key dmCpG sites. Finally, 25 sites were identified (Figure 2A).

The Construction of dmCpG-Based Prognostic Signature
With the identified 25 survival-associated dmCpG sites, the stepwise regression analysis was further applied to screen the candidate dmCpG sites for constructing the prognostic signature ( Figure 2B). Ten candidate dmCpGs sites were screened in the stepwise analysis (Supplementary Figure 1). Further, four variables with P < 0.01 in the stepwise analysis were included in a multivariate Cox proportional hazards model. These four prognostic dmCpG sites were finally included in the model for calculating the risk score. In the optimal model, the AIC was 901.8, and concordance index was 0.77. The HR and coefficient of each dmCpG was provided (Figure 2C). The risk score was calculated with the following formula: risk score = cg00822495 × 0.9960 + cg03225817 × 2.3621 -cg06884401 × 2.2685 -cg09869811 × 2.1771.

The Relationship of Risk Score With Clinical Characteristics
A total of 565 patients with complete clinical information in TCGA dataset were included. The risk score was significantly higher in patients with older than 65 years ( Figure 3A). Further, the risk score was significantly higher for patients at T4 compared to that of other T stages ( Figure 3B). No differences were found in patients with different lymph node statuses and distant metastases (Figures 3C-D).

Independent Prognostic Prediction Analysis
Univariate and multivariate survival analyses were performed to evaluate whether the risk score was an independent prognostic index irrespective of other clinical features. The clinical information of 565 patients in TCGA dataset was included. The univariate and multivariate analyses indicated that the age, TNM (tumor, node, metastasis), and risk score were an independent prognostic index (P < 0.05 for all, Figures 4A,B). The AUC of 5-, 7-, and 10-year ROC curves indicated that the risk score provided a higher value of AUC compared to that of other clinical features (Figures 4C-E). The risk score was verified as independent prognostic predictor. Further, the integration of risk score and clinical features provided similar AUC of 5-, 7-, and 10-year ROC curves compared to that of individual risk score, indicating comparable prognostic prediction efficacy.

Prognostic Prediction Analysis
The prognostic prediction analysis was performed in TCGA and GEO cohorts, respectively. For patients in the TCGA cohort, risk score was calculated according to the normalized methylation levels of four dmCpG sites. The median risk score of 0.8775 was applied as the cutoff for dividing patients into high-and low-risk groups (Figure 5A). The distribution of survival status of all patients was also presented ( Figure 5B). The variation tendency of the methylation level of dmCpG sites in heatmap was consistent with their coefficients in prognostic signature ( Figure 5C). Kaplan-Meier survival analysis of the risk score indicated the survival probability of patients in highand low-risk groups (P < 0.001, Figure 5D). Further, 5-, 7-, and 10-year ROC curves of risk score were plotted, with the AUCs of 0.723, 0.789, and 0.707, respectively ( Figure 5E). The AUC value indicated good prognostic prediction efficacy. With the risk score of 0.8775 as cutoff, the prognostic prediction ability of risk score was also validated in the patients in the GEO cohort; similar results can be obtained (Figure 6). The 5-, 7-, and 10-year ROC curves of risk score were plotted, with the AUCs of 0.684, 0.622, and 0.711, respectively ( Figure 6E).
The nomogram was plotted for the TCGA cohort to calculate the risk score and predict 5-, 7-, and 10-year OS of BC patients (Figure 7A). The calibration curves were also provided (Figures 7B-D). The observed model presented with black solid line seemed close to ideal prediction model presented with blue dot and light gray line. The dynamic nomogram was released online at the following website: https://cpgsignature-survival-breastcancer.shinyapps.io/ cpgsignature-survival-breastcancer/.

Analysis of Included CpG Sites
The identified dmCpG sites were analyzed for further exploring their function mechanism. First, the characteristics of involved dmCpG sites were extracted, including located gene and regions ( Table 2). Second, RNA-Seq data were obtained in 711 samples from TCGA cohort. The correlations between the four dmCpG sites in the signature and their target gene expression were analyzed (Figure 8A). Positive association was observed in cg00822495-OTX2 (r = 0.17, P = 0.000), whereas negative association was observed between cg06884401-FAM13A (r = −0.28, P = 0.000) and cg03225817-GRIA4 (r = −0.12, P = 0.000). No significant correlation was observed in cg09869811-NUTF2 (r = −0.03, P = 0.49) ( Figure 8A). Third, GSEA was performed to identify the more active pathways in low-risk patients. Thirteen pathways were more obviously enriched, which were antigen processing and presentation, apoptosis, B-cell receptor signaling, cell adhesion molecules cams, chemokine signaling pathway, cytokine-cytokine receptor interaction, FC gamma r-mediated phagocytosis, hematopoietic cell lineage, JAK-STAT signaling pathway, leukocyte transendothelial migration, natural killer cell-mediated cytotoxicity, T-cell receptor signaling pathway, and toll-like receptor signaling pathway ( Figure 8B). Finally, DMRs were explored to further understand the function mechanism of methylation. A total of 1,555 screened DMRs were annotated, and the KEGG pathway enrichment analysis was performed. The 10 most enriched pathways were provided   Figure 8C). In these pathways, PI3K-Akt signaling pathway showed most involved DMRs and frequent interactions with other terms.

DISCUSSION
The epigenetic aberrations have been observed during the tumorigenesis and development of BC (25,26). The DNA methylation located at gene promoter would generally suppress gene transcription, thus down-regulating the expression level of target genes. One study investigated DNA methylation with methylation quantitative trait loci of 30,477 CpG sites in 122,977 BC patients and 105,974 controls of European descent (16). They screened 199 CpG sites with significant association with BC risk. Based on the methylation detection tools and computational analysis, the relationship between DNA methylation and gene expression can be well-demonstrated for identifying new biomarkers for BC (16). Considering the benefit of early and non-invasive diagnosis, blood sample-based DNA methylation was developed. One comprehensive study identified DNA methylation markers in blood for diagnosis or risk evaluation of BC. Variations in DNA methylation profiles, both at overall genomic level and specific loci, have been associated with BC risk (17). Another large scale meta-analysis identified genes consistently associated with prognosis, and their DNA methylation could indicate prognosis and clinical stratification of BC patients (27). Some experimental evidence explored and validated the candidate DNA methylation sites. A fast and accurate methylation marker-based automated cartridge system was developed to detect BC in cells obtained by fine-needle aspiration. The panel consisted of 10 highly methylated markers, presenting an AUC over 0.9 in discriminating BC and benign breast lesions (28). Methylation signatures involving 100 or 30 CpG probes were validated to discriminate patients with or without recurrence (29). Accumulating evidences proved the correlation between global DNA methylation and BC risk, which could be modified by reproductive characteristics (15), oxidative DNA damage (14), and chemotherapeutic agents (30).
In recent years, several prognostic prediction indexes have been developed, based on the expression levels of mRNA, lncRNAs, miRNAs, and so on. One study reported a seven-gene signature for prognostic prediction and treatment guidance in triple-negative BC (TNBC), in which recurrence risk score can be calculated as follows: mRNA signature = 1.108 × TMEM101 -0.213 × KRT5 -0.315 × ACAN -0.464 × LCA5 + 0.446 × RPP40 -0.373 × LAGE3 -0.257 × CDKL2 (31). More studies reported that the integrated lncRNA-mRNA signature provided better prognostic prediction efficacy. The developed response score involved 1 lncRNA and 2 coding genes: response score = 2.595 × BPESC1 -1.09 × WDR72 -1.428 × GADD45A -0.731 (32). Another integrated mRNA-lncRNA signature was developed based on the mRNA species for FCGR1A, RSAD2, CHRDL1, and the lncRNA species for HIF1A-AS2 and AK124454, which may be applied to predict tumor recurrence and the benefit of taxane chemotherapy in TNBC (33). A study was performed to construct mRNA-only signature and integrated mRNA-lncRNA signatures. Both signatures provided similar results for prognostic prediction, while integrated signature had a higher hazard of recurrence (34). Another systematic analysis of lncRNA-miRNA-mRNA competing endogenous RNA network identified four-lncRNA signature as a prognostic biomarker for BC, which were ADAMTS9-AS1, LINC00536, AL391421.1, and LINC00491 (35).
DNA methylation was investigated as novel epigenetic biomarkers for prognostic prediction in BC, because they can increase cancer risk through altering gene expression (36). A mortality risk score based on 10 selected CpGs exhibited strong association with all-cause mortality, cardiovascular diseases, and cancer mortality in BC (37). In our study, based on the in silico analysis, OS-associated promoter CpG sites signature was constructed for prognostic prediction of BC, of which four dmCpG sites were identified, including cg00822495 (OTX2), cg03225817 (GRIA4), cg06884401 (FAM13A), and cg09869811 (NUTF2). In this signature, the higher methylation level of cg00822495 (OTX2) and cg03225817 (GRIA4) indicated higher risk of poor survival, while the higher methylation level of cg06884401 (FAM13A) and cg09869811 (NUTF2) indicated lower risk. Some of the involved target genes were previously reported. OTX2 has been considered as a significantly hypermethylated gene in BC (38). GRIA4 was a methylation-dependent gene involved in neural signaling (39). It was known as an early event in colorectal cancer carcinogenesis (40). One study has reported that rs1059122 (FAM13A) might result in BC susceptibility in Chinese population (41). FAM13A was also reported as a hypoxiainduced gene in non-small cell lung cancer, which was overexpressed under hypoxia conditions (42). NUTF2 was a novel methylation site for BC, which has not been previously reported.
Although the CpG-based signature explored in our study has a better performance than age and TNM stage, there were several limitations to be considered. First, it was only an in silico and retrospective study of publicly available data. The validation of the prediction was performed in only one independent cohort. Adequate validation in a larger population-based prospective cohort should be further performed to strengthen the clinical utility. Second, the integration of risk score with other previously reported markers such as age and TNM might enable the development of more reliable biomarkers. Further study would be necessary. Third, the biological functions of some related target genes should be explored and verified.
In recent years, with the rapid development of genomedetecting technology, we are entering an era of precise treatment. A lot of biomarkers based on gene expression profiles were identified, but very few were learned from CpG dinucleotide sites. The four-CpG signature and nomogram explored in our study could guide clinicians to predict long-term survival of BC patients, accurately identify high-risk patients, and take early intervention in the treatment. It is a fact that the detection of CpG sites is more complex and expensive than gene expression detection now, but hundreds of thousands of CpG sites contain promising diagnostic and prognostic value, which would be explored with the development of detection technology in the future.

CONCLUSIONS
This study explored a novel promoter CpG-based signature that exhibited good prognostic prediction efficacy in the long-term OS of BC patients.

DATA AVAILABILITY STATEMENT
All datasets generated for this study are included in the article/Supplementary Material.

AUTHOR CONTRIBUTIONS
YG analyzed the data and wrote the manuscript. XM and ZQ helped to collect relevant papers for this research. BC analyzed the data and generated the figures and tables. FJ guided the research process. All authors read and approved the final manuscript.

FUNDING
The project was supported by Natural Science Fund of The First Affiliated Hospital of China Medical University.