Hidden among the crowd: differential DNA methylation-expression correlations in cancer occur at important oncogenic pathways

DNA methylation is a frequent epigenetic mechanism that participates in transcriptional repression. Variations in DNA methylation with respect to gene expression are constant, and, for unknown reasons, some genes with highly methylated promoters are sometimes overexpressed. In this study we have analyzed the expression and methylation patterns of thousands of genes in five groups of cancer and normal tissue samples in order to determine local and genome-wide differences. We observed significant changes in global methylation-expression correlation in all the neoplasms, which suggests that differential correlation events are frequent in cancer. A focused analysis in the breast cancer cohort identified 1,662 genes whose correlation varies significantly between normal and cancerous breast, but whose DNA methylation and gene expression patterns do not change substantially. These genes were enriched in cancer-related pathways and repressive chromatin features across various model cell lines, such as PRC2 binding and H3K27me3 marks. Substantial changes in methylation-expression correlation indicate that these genes are subject to epigenetic remodelling, where the differential activity of other factors break the expected relationship between both variables. Our findings suggest a complex regulatory landscape where a redistribution of local and large-scale chromatin repressive domains at differentially correlated genes creates epigenetic hotspots that modulate cancer-specific gene expression.


Introduction
DNA methylation is a repressive epigenetic phenomenon that occurs at CpG dinucleotides (Saxonov et al., 2006). These dinucleotides frequently form CpG-rich clusters known as CpG Islands, which are located near the promoter of 72% of the genes (Saxonov et al., 2006). The methylation status of the promoter is widely accepted to exert a repressive effect on proximal gene expression (Klipp, 2009;Han et al., 2011;Vanderkraats et al., 2013;Szulwach and Jin, 2014). However, the effect of DNA methylation on gene expression varies with respect to its location and the genotype of the individual (Bell et al., 2011;Lou et al., 2014;Wagner et al., 2014). For example, CpGs inside gene bodies are more positively correlated with gene expression than promoter-associated CpGs (Lou et al., 2014), and CpGs in the vicinity of CpG islands (known as CpG island shores) are enriched in tissue and cancer-specific differentially methylated regions (cDMRs) when compared with CpG islands (Doi et al., 2009).
The effect of DNA methylation on gene expression is an active field of research and various mechanisms have been described. Notably, DNA methylation is coupled to histone modifications through methyl-binding and methyl-insensitive factors (Szulwach and Jin, 2014). Histone deacetylase repressor complexes are targeted to methylated regions, whilst the MLL family of histone methyltransferases and the histone demethylase JHDM1a preferentially bind to demethylated DNA (Szulwach and Jin, 2014). Extensive research on epigenetics has been devoted to disentangle the role that DNA methylation plays in cancer (Ghavifekr et al., 2014). This has led to terms such as "hypermethylated" and "hypomethylated" referring to genes whose methylation changes significantly between tumor and normal tissues. However, there is evidence indicating that genes with heavily methylated promoters can be intensely expressed (Guillaumet-Adkins et al., 2014), whilst partially methylated genomic regions and long-range hypomethylated domains contain silenced genes in various cancer types (Berman et al., 2011;Hansen et al., 2011;Hon et al., 2012). DNA methylation stability is known to be lost in cancer (Hansen et al., 2011), but we found no studies addressing the potential differences and functional FIGURE 1 | Schematic representation of the reasoning behind this study. (A) DNA methylation is an epigenetic modification known to modulate gene expression. (B) In this study, we decided to identify significant differences between tumor and normal tissues in their gene expression correlation with DNA methylation; as well as the epigenetic factors that can explain this behavior.
implications of differential gene-expression correlation with DNA methylation between tumor and normal tissues.
In this research, we studied the global and gene-by-gene patterns of differential gene expression-methylation in a set of publically available tumor-normal datasets (Figure 1). This revealed interesting differential patterns of correlation in cancer. In order to further explore the importance of these differences, we calculated gene-level statistics in normal and cancerous breast, which detected thousands of breast Differentially Correlated Genes (hereafter known as DCGs). Functional annotation and comparison of this list of genes with pathways, ontology and transcription factor (TF) binding regions reported by the Encode Project Consortium (ENCODE Project Consortium, 2012) provided evidence for consistent enrichments in cellular processes and chromatin modifications. The repressive effect of DNA methylation on transcription is a well described phenomenon (Saxonov et al., 2006), and we try to address the nature and implications of these findings on our understanding of epigenetics and cancer biology.

Methods
A pipeline of the methodology used in this section can be consulted in Figure 2.

Data Source and Pre-processing
Five tumor-normal datasets from the Gene Expression Ommnibus (GEO) (Barrett et al., 2013) with gene expression and methylation data available were downloaded ( Table 1) (Kresse et al., 2012;Selamat et al., 2012;Fertig et al., 2013;Kim et al., 2013;Terunuma et al., 2014). Although gene expression arrays from GEO submissions are already normalized, we created box plots to confirm it. Methylation arrays were either Illumina 27K (Bork et al., 2010) or Illumina 450K (Sandoval et al., 2011). We used R for our computations (R Core Team, 2012). Beta values were converted to M-values before normalization. Datasets were quantile normalized with the function normalize.quantiles implemented in the package preprocessCore 1 . 1 Bolstad, M. B. preprocessCore: A Collection of Pre-processing Functions. R package version 1.18.10. In order to obtain a comparable list of correlations, we averaged the CpG intensity methylation for each gene within each experiment and separately for cancer and normal tissue samples. We compared the methylation-expression of those genes represented in all experiments. In the case of gene expression datasets, probes matching to the same genes were averaged so that only one vector of expression intensities was kept per gene. Only those genes that had expression and methylation measures in all the datasets were kept. Differential expression and methylation analysis were performed with Significance Analysis for Microarrays (SAM) (Tusher et al., 2001). SAM was run with 10,000 permutations and significance was selected by tuning the value to approximately a 5% False Discovery Rate (FDR). Methylation Coefficients of Variation (a.k.a. CVs) were calculated with a modified version of the classic formula that uses interquartile ranges (a.k.a. IQR) and the median instead of the standard deviation and the media.

Gene Expression-Methylation Correlation Analyses
We calculated Pearson's and Spearman's correlations between methylation and gene expression intensity separately in each tumor-normal dataset and in both cases and controls. The Wilcoxon Signed Rank Test was used to search significant differences in correlation values between tumor and normal samples at the genome-wide level. Signed Pearson's correlation value is known not to directly bear on the size of the correlation coefficients (Goodwin and Leech, 2006), but in order to analyze differences in absolute correlations we managed potential indirect biases introduced by differences in the sample-size by randomly shuffling the datasets in the biggest group (cancer group) to the length of the smaller group (normal group). In each case we calculated the correlations genome-wide and performed the Wilcoxon Signed Rank Test 1000 times. P-values for differential absolute correlation genome-wide were calculated as the mean of the 1000 P-values obtained by random shuffling, and the mean estimates were calculated similarly. Otherwise, when no sample difference existed (e.g., in the Lung Adenocarcinoma cohort) we calculated differences in absolute correlations as though they were signed correlations. Although about 300 genes from the X chromosome were included, we observed that excluding them from the analysis did not change the results at all. In order to explore the biological plausibility of the analysis, we also tested the significance in correlation differences at the

Determination of Differentially Correlated Genes
DCGs are those genes whose expression and DNA methylation patterns significantly differ in cancer when compared with normal tissues. We calculated DCGs in two datasets (namely lung adenocarcinoma and HNSCC) based on the expression-average methylation correlation values for each gene. Fisher's transformation was applied to Pearson's and Spearman's correlation values and the P-values were calculated accordingly (Mudholkar, 2006;Myers and Sirois, 2006). Since the breast cancer methylation was analyzed with the Illumina 450K platform, this group included a greater number of inspected CpGs per gene. In this case, we calculated the combined Stouffer's P-value for all CpGs inside each gene in order to improve our use of methylation information (see Section Chromatin and Transcription Factor Binding Enrichment). This method discovered a much larger number of DCGs. Gene Ontology (Ashburner et al., 2000), Pathways Commons (Cerami et al., 2011), and Cytoband enrichment analysis were performed with the WEB GESTALT tool (Wang et al., 2013). Transcription Factor Binding Sites (TFBS) and InterPro Protein Domain (Jones et al., 2014) enrichment analysis were performed in the GenCodis web browser (Tabas-Madrid et al., 2012). We used our own custom background, representing all the genes that were studied. Pathways analysis of the KEGG (Kanehisa, 2013) database was performed for significant miRNAs with the DIANA-miRPath tool (Vlachos et al., 2012). We used the "gene union" option, which calculates the union of targeted genes by the chosen miRNAs. This list was then used as input for pathways analysis.

Chromatin and Transcription Factor Binding Enrichment
We wished to determine whether DCGs overlapped with any known chromatin feature. With this purpose we first annotated the genomic coordinates of each gene using the biomaRt package (Durinck et al., 2009). Then, we collected epigenetic data from H1 human Embryonic Stem Cells (H1hESC), B-lymphocytes (Gm12787), hepatocellular carcinoma (HepG2), and human mammary epithelial cells (HMEC) cell lines and uploaded it to The Genomic Hyperbrowser (Sandve et al., 2010). These tracks were composed of the Chip-seq peak calls released following (1) the Encode Uniform Processing Pipeline, (2) the Chromatin State Segmentation based on Hidden Markov Models (HMM), (3) the Uniform DNAse 1 Hypersensitivity tracks, and (4) the histone modification tracks released by the Encode Analysis Working Group (AWG) and the Broad/MGH Encode Group, respectively. Those genomic segments whose expression-methylation changed significantly (at maximum FDR of 5%) between tumor and normal samples were labeled as case, and those whose correlation did not change significantly were chosen as controls. We applied the "Preferential Overlap" function of The Genomic Hyperbrowser, with (1) a minimal number of Monte Carlo samples of 1000, (2) a global FDR value of 0.001, (3) an alternative hypothesis of an overlap greater than expected, and (4) a null model that preserves segments of both tracks and permutes case and control assignment sequences on chip-seq tracks. Duplicate and triplicate tracks (e.g., tracks measuring the genomic binding occupancy of the same transcription factor) were handled by combining their P-values using the Stouffer's Z-score method. This test is a meta-analytic method closely related to the Fisher's combined probability test used to combine the result of several independent tests bearing the sample null hypothesis. Stouffer's test is sensitive to consistent departures from the null hypothesis, while Fisher's method is more sensitive to occasional departures from the null hypothesis (Abelson, 1995). Finally, P-value adjustment for multiple testing was performed on four separate batches of tracks according to their origin: Encode's Uniform Pipeline TF Chip tracks, Encode's Chromatin State Segmentation tracks, AWG DNAse 1 Hypersensibility tracks and Broad/MGH Encode Group histone modification tracks. P-values were corrected with the FDR method.

Generalized Additive Models Regression
In order to determine the relationship between expressionmethylation Spearman's correlations and the distance to the closest TSS, we studied such relationships in the breast cancer cohort. Methylation in this study was performed with Illumina 450k arrays, which are suitable for genome-wide analysis. Briefly, we fitted generalized additive models (GAM) (Wood, 2011) with correlation as the dependent variable. GAM models allow for non-parametric or semi-parametric fits between the variables of interest, thus leading to better local fits. This is especially important when variables are known to have non-linear and changing relationships between them. We applied Fisher's transformation to the correlations, so that they follow approximate normal distributions. Smooth terms were calculated with the smoothing "s" option of the "gam" function in the mgcv R package (Wood, 2011). The thin plate regression spline "tp" was used, which is a low rank isotropic smoother of any number of covariates. Smoothing parameter estimation of the independent variable (absolute TSS distance) was calculated with the Generalized Cross Validation (GCV) method. Penalized regression models gain computational efficiency by choosing a relatively small basis, known as k. By default, we set this value to 20. Although variations in the basis have a small impact on the model, we ensured that k-values were not so small to cause over-smoothing by using the "gam.check" function. P-values were computed by randomly re-shuffling 20,000 times in order to calculate the null distribution of the differencing variance estimator. Low P-values may indicate that the basis dimension is too low. We confirmed that all models had approximately normal residuals and that the values of the estimates divided by the residual variance (a.k.a k-index) were close to 1. Plots were created with the ggtools function implemented in the ggplot2 R package (Wickham, 2009).

Literature Search
We used PubMed (McEntyre and Lipman, 2001) to search for most of the bibliography cited in this paper.

Gene Expression Correlation with Average Methylation Intensity
We evaluated the relationship between DNA methylation and expression across more than 6200 genes in all the tumor and normal tissue samples (Supplementary Table 1). We observed significant differences of absolute Pearson's correlation in all the five study groups (10% FDR; Table 2 and Figure 3). Spearman's correlation reduced the level of significant datasets to 4 of 5 (10% FDR), since in this case the breast cancer dataset was not significant (Q-value = 0.214; Table 2). Notably, most of these absolute values were higher in normal tissues than in cancer, except for the Lung Adenocarcinoma dataset. Similarly, the signed correlation values revealed significant Pearson's and Spearman's differences between tumor and normal samples in all study cohorts (FDR <0.1%; Table 2 and Figure 3). Four of these groups had a more positive correlation in normal samples, and in the Bladder Cancer cohort the contrary holds. These results suggest that common correlation differences exist in cancer tissues.

Differentially Correlated Genes in Lung Adenocarcinoma and HNSCC
By applying Fisher's transformation to Spearman's correlations in the lung adenocarcinoma and HNSCC cohorts we discovered genes whose average methylation-gene expression correlation varies significantly between tumor and normal samples (a.k.a. DCGs). In the case of lung adenocarcinoma, we found 43 DCGs at a 10% FDR (Supplementary Table 1), and 16 at a 5% FDR.

Focused Analysis on Breast Cancer
In order to further determine the potential impact that DCGs have in cancer, we decided to investigate the distribution of expression-methylation Spearman's correlations in the breast cancer dataset, since this dataset is based on a high-density array of CpG methylation probes. For this, we used Fisher's transformed Spearman's correlation values, which are distributed normally and allow calculating P-values for differences in correlations. The Stouffer test was used to combine all Pvalues into a single, gene-level P-value. The presence of significant autocorrelation between CpG correlations can increase the Type I error of the Stouffer test. Using the Mantel test, we observed no significant autocorrelation within 200 and 1000 base pairs from the closest TSS (10% FDR). Similarly, no significant autocorrelation was observed between CpGs falling further than 1000 base pairs upstream and downstream from the closest TSS (10% FDR). Thus, we can reasonably assume that the Type I error is not artificially inflated in this study.

DNA Methylation Correlation with Gene Expression as a Function of Transcription Start Site distance
We created a GAM model to inspect the relationships between CpG methylation intensity and proximal gene expression as a function of the distance to the closest TSS. The model showed that CpGs in the immediate vicinity of the closest TSS showed negative correlations with gene expression in both cancer and normal samples, but this pattern was the inverse in more distant points ( Figure 4A; Supplementary Figures 2, 3).
For example, we found that approximately from 5000 bp to the closest TSS until 400,000 bp the model was consistent with more positive than negative correlations in both cancer and normal tissues, reaching points of high statistical significance. Very curiously, a zoom into the first 5000 bp from the TSS revealed that the two models are similar but the cancer model was slightly more negative during the first 2000 bp ( Figure 4B; Supplementary Figures 4, 5). Although the difference was small, the 95% confidence intervals of both normal and cancer samples do not overlap in this region, and the T-test confirmed a significantly more positive correlation in normal samples (Pvalue = 5.99 × 10 −69 , 95% CI [0.017, 0.021]), whilst outside this 5000 bp region the correlations did not show such a strong effect (P-value = 0.021, 95% CI [7.00 × 10 −4 , 9.00 × 10 −4 ]).
We defined the correlation ratio as the ratio of correlation Z scores (Normal Z score/Cancer Z score). We observed that within the first 5000 base pairs the ratio is significantly more negative than outside this region (Wilcoxon Rank Sum Test P-value = 2.06 × 10 −18 , 95% CI [−0.204, −0.129], mean value within 5000 bp = 0.166, mean outside 5000 bp = 0.39), indicating that correlation values are more divergent and that more correlation sign shifts exist within the 5000 bp than outside (42.00% vs. 39.88%).

Determination of Expression-Methylation Differentially Correlated Genes in Breast Cancer
In order to characterize those genes with changing expressionmethylation correlations, we combined one-tailed Fisher's Pvalues at the gene level and identified groups of genes whose correlation patterns varied between cancer and normal cases significantly. We observed 1662 DCGs at a 5% FDR (Supplementary Table 2). Notably, no correlation was observed between the P-values and the number of CpGs covering each gene (Spearman's correlation = 0.004 and −0.004, P-value = 0.633 for the lists of genes more positively and more negatively correlated in cancer vs. normal, respectively). However, a significant enrichment of DCGs vs. background genes in GC content was observed (P-value = 2.91 × 10 −19 , Wilcox), which is an inherent bias due to the Stouffer test and the characteristics of the methylation arrays. The top DCGs can be consulted in Tables 3, 4.

DCGs Correlation Distribution
We considered it interesting to analyze the patterns of expression-methylation correlation of the DCGs as a function of their distance to the closest TSS in normal and neoplastic breast tissues. GAM regression of the DCGs more positively  correlated in cancer provided evidence for a general difference, which is more abrupt in the TSS region ( Figure 4D,  Supplementary Figures 6, 7). A similar but opposite trend can be observed in the model of the genes more negatively correlated in cancer ( Figure 4C, Supplementary Figures 8, 9). Thus, our pipeline efficiently identifies those genes whose correlation patterns vary constantly at the gene level.

DCGs Methylation and Expression Patterns
In order to inspect possible differences in methylation and expression between cancer and normal samples, we calculated the mean and standard deviation for each CpG, with a special focus on CpGs corresponding to DCGs. The mean methylation intensity was significantly lower in normal samples (P-value < 2. To determine possible promoter-dependent differences in methylation, we split the data into promoter-associated CpGs (at less than 1 Kb from the TSS) and in non-promoter associated CpGs. Average methylation of CpGs was found to be significantly higher in promoter-associated CpGs of DCGs with respect to their counterparts in the rest of the genome in both normal and cancer data (Wilcox P-values < 2.2×10 −16 , 95% CI [0.404, 0.453] and [0.462, 0.515], respectively), whilst average methylation was lower in non-promoter CpGs compared with their genomic homologs in both cases (Wilcox P-values < 2.2 × 10 −16 ; 95% CI [−0.158, −0.099] and [−0.154, −0.097], respectively). Interestingly, promoter associated and non-promoter associated CpGs were significantly hypomethylated in normal compared to cancer breast (P-value < 2.2×10 −16 ; Wilcox). Using SAM we identified a general trend of DCGs promoter hypermethylation in cancer and 138 genes whose average promoter methylation was significantly higher in cancer at a 5% FDR (Supplementary Figure 10A). Similarly, average methylation at non promoter-associated regions showed a trend toward hypermethylation in 112 DCGs (5% FDR), but no significant differences were associated with the remaining 93% of the DCGs (Supplementary Figure 10B). Nevertheless, using the Wilcoxon Signed Rank Test we could not identify a single gene whose average methylation at its promoter or outside it was statistically significant at a 10% FDR. Similarly, we found no CpGs that did so individually, although 61% of the location parameter estimates were preferentially biased toward a higher methylation in the cancer group. Finally, promoter associated and non-promoter associated CpGs were more variable in DCGs than in the rest of the genome both in normal and in cancer (P-values < 2.2 × 10 −16 ; Wilcox).
Average gene expression was significantly higher in cancer genome-wide and in DCGs (Wilcox P-values = 1.75 × 10 −53 and 3.445 × 10 −5 , respectively). However, among all the DCGs no significant difference in expression could be observed at a 5% FDR using SAM; and only 50 genes ranked significant at a 10% FDR, with a clear trend of these to be more expressed in normal breast (Supplementary Figure 10C). The average expression of DCGs was significantly higher compared to the rest of the analyzed genome in normal and cancer tissues (Wilcox P-value = 2.39 × 10 −7 and 1.21 × 10 −6 , respectively; 95% CI [0.131, 0.291] and [0.116, 0.276], respectively). Nevertheless, the expression of DCGs is not significantly more variable than the rest of the genome neither among normal samples (Wilcox P-value = 0.173) nor among cancer tissues (Wilcox P-value = 0.106).

Separate Analysis of DCGs Methylation and Expression Patterns by Correlation Difference Sign
Since the 1662 DCGs contain genes that vary in different directions we analyzed those genes whose correlation was more negative in cancer (1001 in total) separately from the rest (661 genes). By comparing the average methylation for each CpG, we observed that both groups were hypermethylated with respect to the rest of the genome in normal and cancer tissues (P-values < 2.2 × 10 −16 ; Wilcox). Although CpG average methylation was greater in cancer in both groups of DCGs (P-values < 2.2×10 −16 ; Wilcox), this measure was more significant in those genes with more positive correlations in cancer than in those with more negative correlations in cancer (Wilcox P-values = 3.85 × 10 −10 and 2.30 × 10 −7 in cancer and normal samples, respectively). Average expression was greater among genes with significantly higher correlation values in cancer than in the rest of the genome (Wilcox P-values = 0.0025 and 0.0027 for normal and cancer, respectively), but no difference was observed for those genes with lower correlation in cancer (Wilcox P-values = 0.204 and 0.263 for normal and cancer groups, respectively). However, no significant difference in expression was detected for both groups of DCGs between cancer and normal breast tissue (Wilcox P-values = 0.857 and 0.912 for genes with higher and lower correlations in cancer, respectively).

Enrichment of DCGs in Partially Methylated Domains, Epigenetic Islands, Lamina-Associated Domains, Imprinted Genes, HOTAIR Target Genes and Somatic Copy Number Aberrations
Since hypermethylation of promoters and hypomethylation of non-promoter associated regions is a common feature of Partially Methylated Domains (PMDs Hon et al., 2012), we tested whether the overlap between DCGs with previously defined PMDs and heavily methylated domains (HMDs) (Schroeder et al., 2011) was higher than the overlap of background genes. Note that these PMDs and HMDs were common to placenta, neurons and fibroblasts. We found strong evidence supporting an enrichment in PMDs (P-value = 2.04×10 −6 , hypergeometric) but no significant overlap was observed with HMDs (P-value = 1, hypergeometric). The average expression of those DCGs overlapping with known PMDs was nearly significantly lower than the rest of the DCGs in normal breast (P-value = 0.0527, Wilcox), but not in breast cancer (P-value = 0.1904, Wilcox). DCGs overlapping known PMDs are hypermethylated compared with non-overlapping DCGs in normal and cancerous breast (P-values < 2.2 × 10 −16 ; Wilcox). Indeed, PMD-overlapping DCGs tend to be hypermethylated in cancer compared with normal breast (P-value = 5.46 × 10 −150 ; Wilcox). Although non-PMD DCGs are also hypermethylated in cancer, PMD-overlapping DCGs are more strongly methylated than non-PMD DCGs (PMD DCGs 95% CI Similarly, we obtained a list of 93 imprinted human genes from the geneimprint database (Jirtle, 1997), of which 53 were among the list of genes in this study. We discovered a marked enrichment of DCGs in imprinted genes (P-value = 1.06 × 10 −9 ; hypergeometric). Feinberg et al. (Wen et al., 2012) described the existence of euchromatin islands (EIs) within heterochromatin domains, and since these were enriched in CTCF binding and DNAse I hypersensitivity regions, we decided to test for significant overlaps between our DCGs coordinates, their background regions and all the EI coordinates found in 5 different cell lines. Once again, we observed a mildly significant enrichment among DCGs (P-value = 0.017). DCGs are enriched in Homeobox genes and HOTAIR is a non-coding RNA derived from a HOX gene and involved in the aberrant gene targeting of PRC2 during cancer progression (Gupta et al., 2010). By comparing background genes and DCGs with a list of genes bound to PRC2 after HOTAIR induction in breast cancer cells (Gupta et al., 2010), we observed a significant enrichment of these genes on the DCG category (P-value = 6.35 × 10 −5 ; hypergeometric).
Finally, DCGs are also enriched in CpG islands and GC content compared with their background genes (P-value = 9.99 × 10 −5 and 2.2 × 10 −16 , Wilcox). Due to the fact that DNA G Quadruplexes (a.k.a. G4) are known to influence DNA methylation (Halder et al., 2010) and gene expression (Agarwal et al., 2014), we analyzed the overlap between DNA G4 motif sequences reported by Balasubramanian et al. (Huppert and Balasubramanian, 2007) and DCGs, observing a very significant enrichment (P-value < 9.99 × 10 −4 ). Nevertheless, this can be a statistical artifact due to the enrichment of DCGs in GC content and must be interpreted with caution. Finally, we also compared DCGs loci with regions of focal Somatic Copy Number Aberrations (SCNAs) enriched in cancer. These SCNAs were discovered after analyzing 3131 and 1480 cancer and normal tissue samples, respectively (Beroukhim et al., 2010). DCGs were found to be mildly enriched in cancer-specific SCNA coordinates (P-value = 0.0128).

Discussion
The effect of DNA methylation on gene expression is widely known to depend on its relative location with respect to the TSS of each gene (Brenet et al., 2011;Lou et al., 2014). Normally, gene body methylation correlates positively with gene expression, whilst methylation at the TSS does the opposite (Lou et al., 2014). Although the existence of different correlations has been described elsewhere, systematic genome-wide differential trends in correlations between groups have not been systematically described until now. The aim of this study is to help in moving from the idea that "DNA methylation restricts expression" to one more elaborate like "DNA methylation, which is a repressive mark, has varying effects on gene expression as a result of the interaction with many other factors. It is necessary for science to study and quantify these phenomena in order to better understand human biology and health." In this article we have described intriguing differences between cancer and normal samples genome-wide. Although the functional consequences of these findings are still obscure, an important point to consider is that DNA methylation can drive alternative promoter use and the production of different transcript isoforms . It has also been suggested that intragenic methylation suppresses intragenic initiation of transcription, thereby limiting the expression of interfering RNAs transcribed within larger genes (Jjingo et al., 2012). Both mechanisms can have an effect on tumor biology, and they may help to explain the observed differences that we report. Similarly, spurious effects of SCNAs in methylation arrays have been described (Houseman et al., 2009). For example, haploid and aneuploid regions affect the methylation estimates. Thus, the presence of SCNAs in cancer samples can influence the variability observed in correlations. Indeed, we found significant enrichments of breast DCGs in Xq, 5q34, and 18q23 chromosomal regions, which are frequenctly deleted in various types of breast cancer (Huang et al., 1995;Yu et al., 2009). These are probably indicating SCNAs, and although they probably affect a minority of the detected DCGs, future studies should address these effects.
We detected the presence of DCGs in lung adenocarcinoma and HNSCC. In the case of HNSCC, we found 15 DCGs at a 10% FDR. The top genes were IL12RB2, PTHLH, and SCL7A11, which are all involved in cancer biology (Liu et al., 2013;Suzuki et al., 2013;Urosevic et al., 2014). CD40 and CEACAM5 were the top DCGs in the lung adenocarcinoma group, and both play important roles in cancer biology (Blumenthal et al., 2005;Govindan et al., 2009;Zheng et al., 2011;Rakhmilevich et al., 2012;Creelan et al., 2013;Moran et al., 2013). By focusing our research on the case of breast cancer, we discovered the existence of 1662 DCGs, which are enriched in cancer and differentiation-related pathways. The subtle hypermethylation observed in breast DCGs promoters, along with the hypomethylation status outside it, resembles PMDs. Consistently, a significant enrichment of DCGs in previously defined PMDs was observed (Schroeder et al., 2011). PMDs have been described in MCF-7 breast cancer cell lines, where large regions of hypomethylation exist (Shann et al., 2008). Ren et al. (Hon et al., 2012) demonstrated (1) that PMDs in breast cancer are associated with hypomethylated gene bodies and lower transcript abundance; and (2) that PMD-hypomethylated gene bodies overlap with repressive chromatin marks (H3K27me3 and H3K9me3) independently of the methylation status of their promoters. Since breast DCGs are enriched in cancer and cell differentiation pathways, as well as in bivalent and heterochromatic genomic domains, it is tempting to speculate that a PMD reorganization process occurs frequently at DCGs loci during carcinogenesis which mediates subtle regulation of cancer gene expression.
Nevertheless, the proportion of DCGs known to be PMDs is quite small (<10%). We also observed that DCGs are significantly enriched in EZH2 and SUZ12 binding in different cell lines, along with significant enrichments in repressive and bivalent chromatin marks. EZH2 and SUZ12 are the major components of the Polycomb Repressive Complex 2 (PRC2), which mediates transcriptional repression partially through histone methylation (Morey and Helin, 2010). Curiously, DNA methylation and PRC2-depedent histone methylation are mechanistically linked. For example, EZH2 recruits DNA methyltransferases to Polycomb-silenced genes inducing de novo DNA methylation patterns, a fact which is especially important in stable gene silencing of cancer cells (Morey and Helin, 2010). However, additional gene silencing mediated by Polycomb can occur in the absence of DNA methylation (Gieni and Hendzel, 2009). For example, PRC2 and H3K27me3 are known to cooperate with H3K9 methylation to maintain gene silencing by a mechanism that involves HP1α anchorage at chromatin (Boros et al., 2014). Furthermore, although EZH2 methylates H3K27 and is associated with breast cancer progression, the disposition of both factors occurs independently in mammary neoplasia (Bae et al., 2014). This is in line with the activating effect that EZH2 induces on the NOTCH1 gene (Gonzalez et al., 2014), which triggers the expansion of cancer stem cells through a repressive-independent mechanism. Moreover, EZH2 is also involved in the regulation of HOX genes (Wu et al., 2008), which are very enriched within breast DCGs. A relevant effect of EZH2 on DNA methylation was observed precisely at the HOX loci in mantle cell lymphoma (MCL) and chronic lymphocytic leukemia (CLL) (Kanduri et al., 2013). EZH2 was observed to trigger H3K27 methylation in both tumoral samples, but a hypermethylation event that led to long-term repression occurred specifically in MCL after EZH2 over-expression. Finally, an important gene at the HOXC locus involved in PRC2-mediated repression is HOTAIR, which encodes a large intervening noncoding RNA (lincRNA) associated with breast cancer prognosis (Sørensen et al., 2013). HOTAIR targets PRC2 to hundreds of genes involved in the inhibition of cancer progression (Gupta et al., 2010), and its depletion abrogates the capacity of EZH2 to induce cancer matrix invasion (Gupta et al., 2010). DCGs are enriched in PRC2 and HOTAIR target genes, but not in differentially expressed genes. Since a trend of promoter hypermethylation in DCGs exists, it is likely that PRC2 mediates DCG allele-specific long-term repression through DNA methylation-dependent and independent mechanisms.
Another seemingly important factor in DCG events is CTCF, which is a major chromatin "architect." Notably, CTCF haploinsufficient mice are predisposed to cancer and show and increased variability in CpG methylation genome wide (Kemp et al., 2014). CTCF is involved in imprinting regulation through high-order chromatin loops in their insulator regions (Guibert et al., 2012;Liu et al., 2014). Loss of genomic imprinting (LOI) constitutes a hallmark of many cancers and it is an oncogenic mechanism itself (Holm et al., 2005;Gieni and Hendzel, 2009). Particularly, LOI at the IGF2 locus involves demethylation of both alleles and depletion of PRC2 and CTCF binding . An important role of CTCF at imprinted genes is to mediate intrachromosomal looping, which permits PRC2 binding and heterochromatin formation . Since DCGs are enriched in CTCF peaks and in imprinted genes, it is reasonable to think that local and large chromatin reorganization patterns are involved in the epigenetic regulation of DCGs during carcinogenesis, with a special emphasis on regions of allele specific expression.
Finally, this study uses public retrospective data, and it has limitations and concerns inherent to the study design, the heterogeneity of the platforms and the availability of the data. Five datasets were analyzed, one of which was tumor-normal matched. We observed discordant correlation results in this latter case, which can be related to confounding factors or, more probably, to an epigenetic effect on the non-tumoral adjacent microenvironment. The original study didn't find batch effects due to separate plates or chips, and the methylation levels of normal and cancerous lung samples didn't change substantially between smokers and non-smokers at any locus (Selamat et al., 2012). This dataset is composed of approximately equal parts of smokers and non-smokers, but different proportion of tumor stages (58% Stage I, 19% Stage II, 21% Stage III, and 2% Stage IV), ethnicities and sexes exist. Using Combat (Johnson et al., 2007) to adjust the variability associated with these traits lead to exactly the same results. Similarly, we replicated the same observations in the breast cancer cohort after removing the variability associated with age and ethnicity (Pearson's correlations > 0.99; slope ≈ 1), the clearest potential confounders. Although the possible effect of these seems to be minimal, future studies should be carried on data from very homogeneous cohorts in order to validate the formulated hypotheses. For this reason, we propose to develop a new study to corroborate our findings. This proposal would test the correlation changes that exist between cancer tissues and normal-adjacent and non-adjacent ones, along with the chromatin changes accompanying them. The study expects to use sequencing technology instead of arrays in order to inspect events at a base resolution, and it would be accompanied by highresolution cytogenetic information to detect regions of SCNA. Finally, the proposed study would also try to address the role that CTCF and PRC2 play at DCGs, and the possibility of modulating correlations according to their binding kinetics.

Conclusion
This study provides proof-of-concept for the implication of differential expression-methylation correlation in cancer biology. We have demonstrated general trends of differential expressionmethylation in cancer compared with normal tissues, and we identified a group of genes with striking and significant correlation differences in breast cancer, known as DCGs. Breast DCGs FIGURE 5 | A model of the differential epigenetic regulation that occurs at DCGs during carcinogenesis is proposed. This model postulates that PRC2-repressive complexes, CTCF-mediated chromatin loops and G Quadruplex structures cooperate to modulate the chromatin structure toward any of the following three: (A) Euchromatin, (B) Poised chromatin, or (C) Heterochromatin. Note that the potential implication of G4 structures must be interpreted with great caution, since it might be an artifact of the increased GC content found in DCGs. Due to the high enrichment of DCGs in heterochromatin and poised chromatin in model cell lines, these are the most likely states. Indeed, since most of the genes do not show differential expression trends, it is likely that DCGs preferentially indicate bivalent chromatin regions where monoallelic expression switches between cancer and normal cells.
were enriched in cell differentiation and cancer-related pathways, with a special emphasis on the family of Homeobox transcription factors. DCGs indicate hotspots of epigenetic reprogramming in cancer, where other epigenetic or genomic factors exceed or modify the effect of DNA methylation. By integrating DCG data with previous epigenomics studies, we discovered that DCGs are markedly enriched in repressive and bivalent chromatin features. Our data supports a model where DCGs undergo epigenetic reprogramming during carcinogenesis triggered by PRC2 redistribution; which involves large and local chromatin reorganization through CTCF (Figure 5). These locations are likely to be prone to chromosomal instability and loss of imprinting, which can be a consequence of epigenetic aberrations in rapidly proliferating cells. Since cancer cells can spontaneously transform into a stem cell-like phenotype modulated by genes within bivalent chromatin regions (Chaffer et al., 2013), it is tempting to speculate that DCGs undergo an epigenetic reprogramming that facilitates monoallelic gene expression and stemness. Moreover, since breast DCGs were neither substantially differentially expressed nor differentially methylated, this suggests that a greater understanding of biological variability can be achieved by integrating expression and methylation data. Thus, it is relevant to corroborate these findings and to expand our knowledge about the mechanisms behind this phenomenon and its functional role in cancer biology.

Author Note
AMO is a young Spanish researcher specialized in cancer genomics. He is not affiliated to any public or private institution and all his research is self-financed with his (scarce) savings. Information requests and comments can be sent to his email or his LinkedIn profile.

Author Contributions
AMO designed the study, carried out all the research and wrote the manuscript.

Acknowledgments
This research has not received any kind of funding, but the author would like to express his gratitude for the full submission waiver that was awarded to him by Frontiers. The author would like to dedicate this article to Gumersinda Rey Soto and Lourdes Fariña Orgeira, two courageous women who fought breast cancer; and by extension to all those who fight cancer day after day. Finally, the author would like to show his special gratitude to his family and all those who have supported his research projects during the last months.

Ethics Statement
This study used previously published data of human cancer and normal tissue samples. Thus, we did not deal with any type of human sample; and in case of any doubt or question the reader should refer to the team that drove the respective experiment.

Supplementary Material
The Supplementary Figure 6 | DCGs can be more positively or negatively correlated in cancer with respect to normal breast tissues. We performed GAM regression of the correlations in DCGs separately for both groups. Here we report the diagnostic plots of GAM regression in cancer samples of DCGs with higher correlations in breast cancer than in normal breast. Once again, absolute distance to the closest TSS was the independent variable. Plots follow the same order as those listed in Supplementary Figure 2.
Supplementary Figure 7 | DCGs can be more positively or negatively correlated in cancer with respect to normal breast tissues. We performed GAM regression of the correlations in DCGs separately for both groups. Here we report the diagnostic plots of GAM regression in normal breast samples of DCGs with higher correlations in breast cancer than in normal breast. Once again, absolute distance to the closest TSS was the independent variable. Plots follow the same order as those listed in Supplementary Figure 2.
Supplementary Figure 8 | DCGs can be more positively or negatively correlated in cancer with respect to normal breast tissues. We performed GAM regression of the correlations in DCGs separately for both groups. Here we report the diagnostic plots of GAM regression in cancer samples of DCGs with higher correlations in normal tissues than in breast cancer. Once again, absolute distance to the closest TSS was the independent variable. Plots follow the same order as those listed in Supplementary Figure 2.
Supplementary Figure 9 | DCGs can be more positively or negatively correlated in cancer with respect to normal breast tissues. We performed GAM regression of the correlations in DCGs separately for both groups. Here we report the diagnostic plots of GAM regression in normal samples of DCGs with higher correlations in normal tissues than in breast cancer. Once again, absolute distance to the closest TSS was the independent variable. Plots follow the same order as those listed in Supplementary Figure 2.
Supplementary Figure 10 | (A) DCG promoters' differential methylation results. Average promoter methylation was calculated for each DCG and differential methylation was studied using the SAM algorithm. The black dots indicate differences in methylation values between cancer and normal, and the red dots indicate genes whose promoters are significantly hypermethylated in cancer compared with normal breast (5% FDR). Note the clear tendency for almost all promoters to be more methylated in cancer than in normal breast (black dots above the expected black line). (B) DCG non-promoter regions' differential methylation results. The differences between cancer and normal breast samples are less obvious than in the previous case, but a group of DCGs are hypermethylated in cancer at a 5% FDR (red dots). Note that in this case, DCGs do not show a general tendency to be hypermethylated in cancer. (C) SAM results for DCGs' differential expression. Note that although most genes do not show differential expression patterns at a 5% FDR, a small group is down-regulated in cancer (green dots).
Supplementary Figure 11 | Gene Ontology hierarchical diagram that shows significant enrichments of DCGs in breast cancer colored in red (5% FDR).
Supplementary Table 1 | DCG list in HNSCC and Lung Adenocarcinoma. The list of all the genes included in the analysis is also reported.  (1)