Genome-Wide Identification and Analysis of the Methylation of lncRNAs and Prognostic Implications in the Glioma

Glioma is characterized by rapid cell proliferation and extensive infiltration among brain tissues, but the molecular pathology has been still poorly understood. Previous studies found that DNA methylation modifications play a key role in contributing to the pathogenesis of glioma. On the other hand, long noncoding RNAs (lncRNAs) has been discovered to be associated with some key tumorigenic processes of glioma. Moreover, genomic methylation can influence expression and functions of lncRNAs, which contributes to the pathogenesis of many complex diseases. However, to date, no systematic study has been performed to detect the methylation of lncRNAs and its influences in glioma on a genome-wide scale. Here, we selected the methylation data, clinical information, expression of lncRNAs, and DNA methylation regulatory proteins of 537 glioma patients from TCGA and TANRIC databases. Then, we performed a differential analysis of lncRNA expression and methylated regions between low-grade glioma (LGG) and glioblastoma multiform (GBM) subjects, respectively. Next, we further identified and verified potential key lncRNAs contributing the pathogenesis of glioma involved in methylation modifications by an annotation and correlation analysis, respectively. In total, 18 such lncRNAs were identified, and 7 of them have been demonstrated to be functionally linked to the pathogenesis of glioma by previous studies. Finally, by the univariate Cox regression, LASSO regression, clinical correlation, and survival analysis, we found that all these 18 lncRNAs are high-risk factors for clinical prognosis of glioma. In summary, this study provided a strategy to explore the influence of lncRNA methylation on glioma, and our findings will be benefit to improve understanding of its pathogenesis.


INTRODUCTION
Glioma is the most common and highly malignant tumor in the intraparenchymal central nervous system (CNS) tumors (1). It is characterized by the rapid and extensive proliferation among brain tissues (2,3). The high grade glioma subtype, glioblastoma multiform (GBM), could cause the significant mortality that are disproportionate to their relatively rare incidence (4). Even under the best treatment, the median survival time is just over a year, and the few GBM patients survive more than 3 years (1). The etiology and pathogenesis of GBM have been extensively investigated, but the epigenetic mechanisms contributing to its pathogenesis were much less understood (2,3,5).
To date, DNA methylation is the most widely studied epigenetic mechanisms (6). Tremendous evidences shows that the DNA methylation is involved in tumorigenesis and development of the GBM (1,7). For example, the promoter DNA methylation pattern of genes involved in RB1 and TP53 signaling pathways were identified in GBM patients (7). The promoter methylation of the DNA repair enzyme (O6methylguanine-DNA methyltransferase) was discovered as a significant prognostic factor for temozolomide resistance in GBM patients (8).
The long non-coding RNAs (lncRNAs), a kind of non-protein coding transcripts of >200 nucleotides (9)(10)(11), has been reported to be a key regulator in a broad range of biological and cellular processes of GBM, including cell proliferation, motility, hypoxia response, and apoptosis (12)(13)(14). The expression levels and functions of lncRNAs could be significantly affected by the genomic methylation in many complex diseases (15)(16)(17)(18). Moreover, there is increasing evidences that the methyltransferase, demethylase, and binding protein dynamically regulate the methylation level of the lncRNAs, which influences their expression in specific biological processes (18,19). However, to date, no systematic study has been conducted to discovery the methylation of lncRNAs and its influences in the glioma on a genome-wide scale.
Herein, to address this lack of knowledge, we used a cohort of low-grade glioma (LGG) and GBM from The Cancer Genome Atlas (TCGA) database to investigate the contribution of lncRNA methylation to tumorigenesis and development in glioma. Specifically, we first downloaded the expression data of lncRNAs from The Atlas of Noncoding RNAs in Cancer (TANRIC) database, and then implemented a differential expression analysis between the LGG and GBM subjects. Second, we obtained the glioma-related methylation array data and the protein-coding gene expression data of the same samples from TCGA database, and then identified the differentially methylated regions of the differentially expressed lncRNAs according the GENCODE reference annotation for human genomes. Third, we conducted a correlation analysis between methylation level and expression of the lncRNAs and the genes involving in the three kinds of methylation regulatory proteins, and identified the potential key lncRNAs contributing the pathogenesis of glioma. Finally, we conducted the univariate Cox regression, least absolute shrinkage and selection operator (LASSO) regression, clinical correlation, and survival analysis based on the clinical data of these samples to explore the influence of these methylated and potentially disease-related lncRNAs on clinical prognosis of glioma. The flow chart was shown in Figure 1.

Data Collection and Preprocessing
The clinical information and the methylation information of patients with glioma were downloaded from the TCGA database (http://cancergenome.nih.gov), a comprehensive resource for investigating the molecular basis in various cancers. According to TCGA annotation, glioma is classified as the LGG and the GBM. The Genomic Data Commons (GDC) Data Portal (https://portal.gdc.cancer.gov/) was used to access these TCGA data. Particularly, we selected "DNA methylation" in the Data Category, "Illumina human methylation 450" in the Platform, "brain" in the Primary Site, and "gliomas" in the Disease Type to screen out the methylation information of patients. Then, we selected "clinical" in the Data Category, "brain" in the Primary Site, and "gliomas" in the Disease Type to screen out the clinical information of patients. Next, we removed the samples which lack the methylation or clinical information. Finally, the lncRNA expression data of the same patients was downloaded from the TANRIC database, which quantified the expression profiles of lncRNAs in Ensembl using the TCGA data (20).
Moreover, we further searched all the possible studies in PubMed database (http://www.ncbi.nlm.nih.gov/pubmed) using the keywords to "methylase gene," "methyltransferase gene," "binding protein gene," "demethylase gene" to identify the DNA methylation regulatory proteins. The search was performed before the last update of this database on May 13, 2020. The gene expression of the methylation regulatory proteins was obtained from TCGA database. The expression data of lncRNAs and methylation regulatory proteins have been normalized as reads per kilobase of exon model per million mapped reads (RPKM) and RNA-Seq by Expectation-Maximization (RSEM), respectively. The DNA methylation values were normalized using the "betaqn" function of the R package "wateRmelon" (http://bioconductor.org/packages/ release/bioc/html/wateRmelon.html) (21).

Differential Expression Analysis of lncRNAs Between Low-Grade Glioma and Glioblastoma Multiform
To identify the key lncRNAs which are potentially associated with the gliomas progression, we performed a differential expression analysis of all the lncRNAs obtained from TANRIC database between LGG and GBM subjects using the R package "lncDIFF" with its default parameter settings (i.e. link.function = "log," simulated.pvalue = FALSE, permutation = 100) (https:// CRAN.R-project.org/package=lncDIFF). It is a powerful differential analysis tool by the normalized expression data (e.g. RPKM values) as the input, and has high sensitivity to identify the low abundant differentially expressed genes, as commonly observed in lncRNAs. This package adopts the generalized linear model with zero-inflated exponential quasi-likelihood to estimate group effect on normalized counts, and employs the likelihood ratio test to detect differential expressed genes. The proposed method and tool are applicable to data processed with standard RNA-seq preprocessing and normalization pipelines (22). We first removed 21 lncRNAs whose expression is zero in all the LGG or GBM subjects. Then, we set significance level according to the common threshold of the absolute value of fold change (FC) ≥ 2 and false discovery rate (FDR) p < 0.05. The p values are corrected for multiple testing by Benjamini-Hochberg method. Finally, we used a volcano plot to describe the profile of whole lncRNA expression by the R package "ggplot2" (https:// CRAN.R-project.org/package=ggplot2), and used a heatmap to visualize the cluster pattern of the differentially expressed lncRNAs based on Manhattan distance by the R package "gplots" (https://CRAN.R-project.org/package=gplots).

Differential Methylation Analysis and lncRNA Annotation
To identify the glioma-related methylation positions and regions, and the differentially expressed lncRNAs located in these regions, we performed a differential methylation analysis and lncRNA annotation. The differential methylation analysis was conducted by the R package "minfi" which is a specialized tool designed to process the Illumina methylation 450 array data (http:// bioconductor.org/packages/release/bioc/html/minfi.html). It used the Subset-quantile Within Array Normalization method to preprocess data and the bump-hunting algorithm to discover the differential methylation information (23). Firstly, we used the "densityBeanPlot" function of this package to conduct the quality control for each array. The qualified samples should have the characteristic that the methylation levels (beta values) of CpG positions are distributed around 0 and 1, respectively. Then, we used the "dmpFinder" function (type = "categorical") of this package to identify the differentially methylated positions between LGG and GBM subjects based on the methylation array. The significance level was set according to a common threshold of the absolute intercept ≥ 0.2 (i.e. 20% difference on the beta values) and p < 1×10 −3 (24). Next, based on these differentially methylated positions, we further used the "bumphunter" function (B = 10, type = "Beta") of this package to look for the differentially methylated regions between LGG and GBM subjects with the common threshold of average methylation level difference ≥ 0.2 (25,26). The differentially methylated regions are the consecutive genomic locations containing a battery of differentially methylated positions in the same direction. Finally, we download the ensGene annotation file (hg19) from the Ensembl (release 75) which stores the location information of lncRNA transcripts and exons in human genome. Based on this file, the ANNOVAR software was used to perform an lncRNA annotation and identify the differentially expressed lncRNAs located in the differentially methylated regions. ANNOVAR is a Perl command-line tool for rapidly and efficiently annotating the genomic variants, including gene-based, region-based and filterbased annotations on a variant call format (VCF) file generated from human genomes (27).

Correlation Analysis Between Methylation and Expression of lncRNAs
To explore the influence of methylation on the corresponding lncRNAs, and identify the potential key lncRNAs contributing the pathogenesis of glioma, we performed a correlation analysis between methylation and expression of lncRNAs. Particularly, we first selected the differentially methylated positions to be included in each of the identified lncRNAs in the previous step, and calculated the average values of these methylation positions for each lncRNAs, respectively. Then, we calculated the Pearson's correlation coefficient between the expression of these lncRNAs and their average methylation level using the R function "cor.test." The threshold of significance was set at the absolute value of r > 0.6 and FDR p < 0.05. The p values are corrected for multiple testing by Benjamini-Hochberg method. Finally, it is reported that the methylation regulatory proteins (including methyltransferase, demethylase, and binding protein) dynamically regulate the methylation level of lncRNAs, which influences their expression in specific biological processes (18,19). Therefore, to explore which methylation regulatory proteins are involved in the methylation modification of the potential key lncRNAs and further increase the reliability of our findings, we selected the known methylation regulatory proteins and calculated the Pearson's correlation coefficient between their expression and the average methylation level of these lncRNAs using the same significance threshold.

Influence of the Methylated lncRNAs on Clinical Prognosis of Glioma
We further analyzed the Influence of these identified methylated lncRNAs potentially contributing the pathogenesis of glioma on the clinical prognosis of glioma. First, we calculated the average expression of the key lncRNAs obtained above in each patient and get the median of these average expressions. According to the median, the patients were separated into the lncRNAs low and high expression groups. We compared prognosis between the high expression and low expression subjects using a Kaplan-Meier overall survival curves. Then, we performed a univariate Cox regression analysis to assess the association between these methylated lncRNAs and the prognosis of glioma. The threshold of significance was set at 95% confidence interval (CI) of hazard ratio (HR) ⊉ 1 and p < 0.05. The R package "survival" (https:// CRAN.R-project.org/package=survival) was used for these analyses. Next, based on the results of univariate Cox regression analysis, we further used the LASSO regression algorithm to identify the key lncRNAs whose methylation and expression impact on the prognosis of glioma by R package "glmnet." It is a pathwise algorithm for the Cox proportional hazards model, regularized by convex combinations of ℓ1 and ℓ2 penalties (elastic net). The algorithm fits via cyclical coordinate descent, and employs warm starts to find a solution along a regularization path (28). The parameter familiy, maxit, and alpha were set to Cox, 1000 and 1, respectively (others were set by their default values). And then we calculated the risk score of each subject using them through the "survival" package. Finally, we used a receiver operator characteristic (ROC) curve to verify the reliability of the risk score by the R package "survivalROC" (https://CRAN.R-project.org/ package=survivalROC). In addition, we also assessed the association between these lncRNA expressions and other clinical features of the patients (including age at initial pathologic diagnosis, vital status, and gender) using the chi-square test. The threshold of significance was set at p < 0.05.

Methylation, Expression, and Clinical Information of 537 Glioma Samples
After the data collection, we found a total of 537 glioma samples (including 486 LGG and 51 GBM patients from TCGA) with the DNA methylation values, expression levels of protein-coding genes, and clinical information. Particularly, according to the annotation of Illumina human methylation 450 array, a total of 369,531 CpGs methylation positions were quantified after removing the missing values. We normalized the CpGs methylation values for the subsequent analyses. The results were shown in Supplementary Figure S1. The lncRNA expression data of the 537 glioma samples were obtained from TANRIC database. A total of 12,727 lncRNAs of these samples were quantified as RPKM values. Through the keyword search and the title/abstract screening, 23 articles containing genes for methylation-related enzymes were obtained from PubMed.
In total, we identified 32 DNA methylation regulatory proteins (including methyltransferase, demethylase, and binding protein) from the 23 articles (Table 1). We extracted the expression data of these 32 methylation regulatory proteins for each sample (quantified as RSEM values) from the TCGA database. The clinical information of these samples contains age, gender, survival time, and vital status. The summary of these glioma samples was listed in Table 2.

Differential Expression Analysis of lncRNAs Between Low-Grade Glioma and Glioblastoma Multiforme
We used the R package "lncDIFF" to perform the differential expression analysis of lncRNAs between LGG and GBM subjects according to the significance threshold of |FC| ≥ 2 and FDR p < 0.05.
In total, we identified 1,988 significantly differentially expressed lncRNAs, which include 1,284 highly expressed (i.e. FC ≥ 2) and 704 lowly expressed lncRNAs (FC ≤ −2) in the GBM subjects. The details are described in Supplementary Table S1. We used a volcano plot to describe the profile of whole lncRNA expression ( Figure 2A). Then, to verify these findings, we contrasted these identified differentially expressed lncRNAs with another independent study. This study used 19 glioblastoma and 9 control brain samples to perform the differential expression analysis of 30,586 lncRNA transcripts (Arraystar Human lncRNA Microarray V3.0, nearly 30% of them overlap with our study). According to its results, about 71.5% differentially expressed lncRNAs overlapped with our findings (52). Further, we selected the top most 100 differentially expressed lncRNAs to visualize the cluster pattern of their expression by a heatmap. As Figure 2B shows, the GBM and LGG subjects are mainly grouped under a cluster according to high  and low expression of the lncRNAs, respectively, and most of these lncRNAs (83%) are significantly highly expressed in the GBM than LGG subjects. These differentially expressed lncRNAs may contribute to the progress of glioma. Thus, we used these significantly differentially expressed lncRNAs to conduct the subsequent analysis.

Differential Methylation Analysis and lncRNA Annotation
We performed a differential methylation analysis and lncRNA annotation to identify the glioma-related DNA methylation and the differentially expressed lncRNAs located in the regions. The results of array quality control showed that the beta values of DNA methylation positions are mainly distributed around 0 and 1, respectively, for each sample (Supplementary Figure S2). Then, we used these arrays to identify the differentially methylated positions and regions, respectively. The results showed that there are a total of 208,138 positions and 13,227 corresponding regions with a significantly differential methylation level between LGG and GBM subjects (Supplementary Tables S2, S3). The methylation array GSE90496 (contains 347 GBM and 301 LGG subjusts) was used to verify these findings. The results showed that about 71.1% differential methylation positions are consistent with our findings (53). Finally, based on these differentially methylated regions, we performed the location annotation of the differentially expressed lncRNAs. In total, we identified 744 lncRNAs which are located in the differentially methylated regions. According to the results of annotation, these differentially methylated regions are at five categories of different genomic locations, i.e. intergenic, ncRNA_exonic, ncRNA_intronic, upstream and downstream, and the proportion of intergenic areas is significantly increased compared with other types (Supplementary Table S4). We further calculated the proportion of them in the different genomic locations. We found that these lncRNAs are mainly distributed in chromosome 1, 2, 7, and 12 (11.42, 10.48, 8.06, and 8.06%, respectively) ( Supplementary Table S5). Moreover, as the Supplementary Table S1 shown, these identified lncRNAs include 1,284 highly and 704 lowly expressed ones in the GBM subjects. But not all of these highly and lowly expressed lncRNAs have significantly reduced and increased methylation levels in the GBM subjects, respectively, which imply that not all of DNA methylation changes can affect the expression of lncRNAs in the corresponding genomic regions.

Correlation Analysis Between Methylation and Expression of lncRNAs
We first conducted a Shapiro-Wilk normality test for each vector by the R function "shapiro.test." According to the threshold P > 0.05, 11 lncRNAs that do not obey the normal distribution were removed. Then, to identify the differentially expressed lncRNAs affected by the glioma-related DNA methylation, we performed a Pearson's correlation analysis between methylation and expression of lncRNAs. The results revealed that there are a total of 18 lncRNAs (including 16 highly and 2 lowly expressed ones) whose expression is significantly associated with their DNA methylation level, and all of them show a significant negative correlation (r < −0.6 and FDR p < 0.05) ( Table 3). It is consistent with the common understanding that DNA methylation inhibits the corresponding gene expression in a variety of tissues and cell lines (54,55). Next, we further measured the association between the expression of the 32 methylation regulatory proteins and methylation level of these lncRNAs. The results showed that there is a significantly negative correlation between TET1 expression and most of the 18 lncRNAs' methylation level. TET1 is a demethylase which can catalyze the conversion of 5-methylcytosine to 5hydroxymethylcytosine and maintains hypomethylation status of the corresponding regions (37). Besides this gene, the expression of KDM4B and MBD2 also show a significantly negative and positive correlation (r > 0.6 and FDR p < 0.05) with a part of the 18 lncRNAs' methylation level, respectively. KDM4B is also a demethylase of histone lysine by a hypoxiainduced pathway, and an important epigenetic modifier in cancer (46). MBD2 is a methyl-CpG binding protein which binds and maintains methylated gene promoter to repress its transcriptional activity (35). However, this significant correlation is not observed in the other 29 genes, which imply that the DNA methylation of lncRNAs in glioma may be influenced predominantly by some specific methylation regulatory proteins ( Table 3 and Supplementary Table S5). Moreover, the mean differential methylation of the 18 lncRNAs was calculated. We found that all of the lowly expressed lncRNAs have significantly reduced methylation levels and most of them have significantly reduced methylation levels in the GBM subjects. It is also consistent with the understanding that DNA methylation inhibits gene expression (56,57). Therefore, we considered them as the potential key lncRNAs contributing the pathogenesis of glioma involved in methylation modifications. Finally, we further queried PubMed for the functions of the 18 potential key lncRNAs contributing the pathogenesis of glioma. We found that seven of them have been demonstrated to be functionally linked to the pathogenesis of glioma by the previous studies. For example, the overexpression of the ENSG00000222041 (CYTOR) partially reversed the inhibitory effects of UPF1 on proliferation and invasion abilities in glioma (58,59). The more details were described in the Supplementary Table S6.

Influence of the Methylated lncRNAs on Clinical Prognosis of Glioma
We further analyzed the influence of the 18 identified lncRNAs, which potentially contribute the glioma pathogenesis by methylation modifications, on the clinical prognosis of glioma. For the 16 lncRNAs highly expressed in GBM patients, we found that the overall survival curve of the subjects with high lncRNA expression is significantly longer than the subjects with low lncRNA expression (p = 1.38 × 10 −10 ) ( Figure 3A). On the contrary, the overall survival curve of the subjects with low lncRNA expression is significantly longer than the subjects with high lncRNA expression for the two lowly expressed lncRNAs (p = 3.11 × 10 −10 ) ( Figure 3B). It reflects an association between the dysregulation of lncRNA expression and a bad prognosis of glioma patients. To avoid dependence on the tumor grade, we performed the univariate Cox regression analysis of the 18 lncRNAs in GBM and LGG subjects, respectively. We did not find a significant association between lncRNA expression and poor patient outcomes in GBM subjects. However, the results showed that all of the 18 identified methylated lncRNAs are high-risk factors for the prognosis of glioma in LGG subjects (i.e. 95% CI HR ⊉ 1 and p < 0.001) ( Figure 3C). This suggests that both over-expression of those 16 lncRNAs and under-expression of other 2 ones can lead to a poor prognosis in LGG patients, which is also consistent with common sense, given that GBM patients are in advanced stages of the disease and their survival may be affected by other complications or factors. In addition, the univariate Cox regression analysis was further performed for all the 1,988 significantly differentially expressed lncRNAs in LGG patients. We found that only about 39% of the lncRNAs unlikely affected by methylation modifications are associated with poor patient outcomes (Supplementary Table S7). We further applied LASSO regression algorithm to the 18 lncRNAs to identify the key ones for glioma prognosis and calculate the risk score of each subject. As Figures 3D, 3E show, there are four key lncRNAs (i.e. ENSG00000256802, ENSG00000232533, ENSG00000227372, ENSG00000222041) selected when the cross-validated partial likelihood deviance reaches its minimum value, and the coefficients of all these lncRNAs are positive (i.e. increase risk of disease). The area under the curve (AUC) of the ROC is 0.903, which shows the reliability of the risk score ( Figure 3F). According to the median of risk scores, the patients were separated into the low and high-risk groups. We found that the GBM subjects are mainly distributed in high-risk group, while the LGG subjects are mainly distributed in low-risk group. This demonstrates the consistency between the sample risk score by the key lncRNAs and the severity of glioma. Moreover, as Figure 4A shows, the risk classification by the key lncRNAs is significantly associated with the age at initial pathologic diagnosis (p = 1.32 × 10 −2 ) and vital status (p = 1.72 × 10 −8 ). But we observed no association with the gender of the patients (p = 1.97 × 10 −1 ). The similar results were also observed for all the 18 identified methylated lncRNAs (p value of age, vital status and gender is 2.87 × 10 −14 , 2.01 × 10 −9 , and 1.45 × 10 −1 , respectively) ( Figure 4B).

CONCLUSIONS
In this study, we used the TCGA data to identify potential key lncRNAs contributing the pathogenesis of glioma involved in methylation modifications and further explore influence of them on the clinical prognosis of glioma. In total, we identified 18 such lncRNAs which has the following four characteristics: 1) they are significantly differentially expressed between the LGG and GBM subjects; 2) at least one of the differentially methylated regions, which cover the contiguous differentially methylated positions, is located in these lncRNA sequences; 3) there is a strong correlation between the methylation level of these lncRNAs and the expression of methylation regulatory proteins; 4) the expression of these lncRNAs is significantly associated with their methylation level. Further, the results of clinical data analysis show that all these 18 lncRNAs are high-risk factors for the clinical prognosis of glioma, and four of them (i.e. ENSG00000256802, ENSG00000232533, ENSG00000227372 and ENSG00000222041) are most important for the severity of glioma. All in all, we performed a strategy to explore the influence of the lncRNA methylation on the pathogenesis of glioma, and these findings will be benefit to further glioma research in the future.  LGG groups, which are also significantly associated with the age and vital status, but not with the gender.