A Diagnostic Panel of DNA Methylation Biomarkers for Lung Adenocarcinoma

Lung adenocarcinoma (LUAD) is one of the most common cancers and lethal diseases in the world. Recognition of the undetermined lung nodules at an early stage is useful for a favorable prognosis. However, there is no good method to identify the undetermined lung nodules and predict their clinical outcome. DNA methylation alteration is frequently observed in LUAD and may play important roles in carcinogenesis, diagnosis, and prediction. This study took advantage of publicly available methylation profiling resources and a machine learning method to investigate methylation differences between LUAD and adjacent non-malignant tissue. The prediction panel was first constructed using 338 tissue samples from LUAD patients including 149 non-malignant ones. This model was then validated with data from The Cancer Genome Atlas database and clinic samples. As a result, the methylation status of four CpG loci in homeobox A9 (HOXA9), keratin-associated protein 8-1 (KRTAP8-1), cyclin D1 (CCND1), and tubby-like protein 2 (TULP2) were highlighted as informative markers. A random forest classification model with an accuracy of 94.57% and kappa of 88.96% was obtained. To evaluate this panel for LUAD, the methylation levels of four CpG loci in HOXA9, KRTAP8-1, CCND1, and TULP2 of tumor samples and matched adjacent lung samples from 25 patients with LUAD were tested. In these LUAD patients, the methylation of HOXA9 was significantly upregulated, whereas the methylation of KRTAP8-1, CCND1, and TULP2 were downregulated obviously in tumor samples compared with adjacent tissues. Our study demonstrates that the methylation of HOXA9, KRTAP8-1, CCND1, and TULP2 has great potential for the early recognition of LUAD in the undetermined lung nodules. The findings also exhibit that the application of improved mathematic algorithms can yield accurate and particularly robust and widely applicable marker panels. This approach could greatly facilitate the discovery process of biomarkers in various fields.

Lung adenocarcinoma (LUAD) is one of the most common cancers and lethal diseases in the world. Recognition of the undetermined lung nodules at an early stage is useful for a favorable prognosis. However, there is no good method to identify the undetermined lung nodules and predict their clinical outcome. DNA methylation alteration is frequently observed in LUAD and may play important roles in carcinogenesis, diagnosis, and prediction. This study took advantage of publicly available methylation profiling resources and a machine learning method to investigate methylation differences between LUAD and adjacent non-malignant tissue. The prediction panel was first constructed using 338 tissue samples from LUAD patients including 149 non-malignant ones. This model was then validated with data from The Cancer Genome Atlas database and clinic samples. As a result, the methylation status of four CpG loci in homeobox A9 (HOXA9), keratin-associated protein 8-1 (KRTAP8-1), cyclin D1 (CCND1), and tubby-like protein 2 (TULP2) were highlighted as informative markers. A random forest classification model with an accuracy of 94.57% and kappa of 88.96% was obtained. To evaluate this panel for LUAD, the methylation levels of four CpG loci in HOXA9, KRTAP8-1, CCND1, and TULP2 of tumor samples and matched adjacent lung samples from 25 patients with LUAD were tested. In these LUAD patients, the methylation of HOXA9 was significantly upregulated, whereas the methylation of KRTAP8-1, CCND1, and TULP2 were downregulated obviously in tumor samples compared with adjacent tissues. Our study demonstrates that the methylation of HOXA9, KRTAP8-1, CCND1, and TULP2 has great potential for the early recognition of LUAD in the undetermined lung nodules. The findings also exhibit that the application of improved mathematic algorithms can yield accurate and particularly robust and widely applicable marker panels. This approach could greatly facilitate the discovery process of biomarkers in various fields.
Keywords: lung adenocarcinoma, DNA methylation, random forest, HOXA9, KRTAP8-1, CCND1, TULP2, biomarker INTRODUCTION Lung cancer is the leading cause of cancer-related deaths worldwide. As the most frequent histological subtype of nonsmall-cell lung cancer (NSCLC), lung adenocarcinoma (LUAD) accounts for more than 40% of the incidence of lung cancer. It is usually diagnosed at an advanced stage. When diagnosed at an early stage, however, the survival rate is dramatically prolonged. The 1-year survival rate of stage I NSCLC is 81-85%, while it is only 15-19% when diagnosed at stage IV (1). Effective early recognition methods and relevant biomarkers are urgently needed to reduce the mortality caused by LUAD. Traditional screen methods for NSCLC include sputum cytology, chest radiography, and computed tomography (CT). CT, which is simple and with high sensitivity, has been shown to be capable of detecting early stage lung cancers and is beneficial for the treatment and patient outcome. Although high falsepositive rate has been reported to be a major problem in LUAD by CT, several studies have shown that CT-positive findings that are largely based on nodule size and/or volume can reduce false-positive rates and the risk of overdiagnosis (2). In addition, due to the demand for further optimization of CT screening in LUAD, there is still a need for other available biomarkers to support the risk assessment (3). For example, a number of changes in gene expression (4,5), somatic mutations (6), copy number variations (7), differences in methylation (8) or the abundance of plasma proteins (9), and sequencevariations in circulating free DNAs (10, 11) have been extensively studied. In recent years, aberrant changes in DNA methylation have been observed in various cancers and are considered to be a cause of tumorigenesis. Global hypomethylation occurs frequently in repeated DNA sequences and plays an important role in chromosomal instability. The promoter regions of tumor suppressor genes are often hypermethylated, leading to inactivation of corresponding genes in tumors. It has been reported that p16 (12), APC (13), BCL2 (14), BRCA1, and BRCA2 (15) are hypermethylated in NSCLC, but single methylation markers and the bias introduced by experiments often caused these methods to suffer problems of insufficient precision and specificity. To improve the detection performance of methylation biomarkers, many panels of multiple loci were studied (16,17). DNA methylation is a relatively stable biochemical modification carried out by DNA methyltransferases and can be detected in not only DNA molecules from tissue but also the free DNAs in serum and plasma (18), making it a promising biomarker for the early recognition of the undetermined lung nodules.
With the advent of big data areas, publicly available databases such as The Cancer Genome Atlas (TCGA), The Encyclopedia of DNA Elements, and Gene Expression Omnibus (GEO) contain many methylation profiles for a variety of tumors and normal samples. The development of machine learning technology and its application in the biological fields makes it possible to select the most inspective detection markers from a massive number of potential loci and establish prediction models with better performance.
In this study, we utilized methylation array data from publicly available databases and tried to explain lung carcinogenesis from an epigenetic perspective. Because early recognition of LUAD can benefit patients, we took advantage of machine learning algorithms to build a concise but robust prediction model using methylation markers for making predictions on patients with lung cancer. As a result, the methylation status of four CpG loci in homeobox A9 (HOXA9), keratin-associated protein 8-1 (KRTAP8-1), cyclin D1 (CCND1), and tubby-like protein 2 (TULP2) were highlighted as potential biomarkers. Then, we identified four LUAD-specific methylation biomarkers by comparing LUAD tumor samples and matched adjacent samples. The results further confirmed that the hypermethylation of HOXA9 and the hypomethylation of KRTAP8-1, CCND1, and TULP2 were observed in LUAD tumor samples. This study could also pave ways for a more extensive application of non-invasive cancer detection.

Datasets
In the discovery phase, data about 338 LUADs and nonmalignant samples were collected from GEO datasets that are based on analyses with the Illumina Infinium HumanMethylation27 BeadChip for DNA methylation measurements. Detailed dataset descriptions are shown in Table 1. TCGA LUAD validation cohorts profiled by both the Illumina Infinium HumanMethylation27 and HumanMethylation450 platforms were downloaded from Xena Functional Genomics Explorer. These datasets are heavily imbalanced. For analyses with the Illumina Infinium

Identification of Differentially Methylated Probes
The 338 LUADs and non-malignant samples in the GEO database profiled by Illumina Infinium HumanMethylation27 BeadChip were used for differential methylation analysis. The datasets from the five experiments were combined using R package illuminaio by common probe IDs. Array data produced by different studies are confounded by non-biological variables such as different technicians or the environments. These biases are termed as "batch effects" and cannot be eliminated unless all the samples are performed in a single batch. Here, we adjusted the batch effects by applying gene-wise one-way ANOVA adjustment for expression values using the pamr.batchadjust function from the pamr package. Differential methylation analysis was performed using linear model provided in the limma R package (19). Probes that were differentially methylated between LUADs and non-malignant tissues were selected with thresholds of |logFC| > 1 and adjusted p < 0.05. Gene oncology and Reactome pathway enrichment analysis were performed using DAVID version 6.8 (20).

Building of a Prediction Model
Random forest was chosen as the prediction model because of its superior performance for classification with binary features. To make a robust, easy-to-generalize model to predict the sample status, each of the differentially methylated probes was first binary assigned. First K-means clustering with two cores was applied to each probe, and the mean values were calculated for both clusters. Samples in the high/low mean cluster were signed as 1 or 0, respectively. In the training phase, the importance value of each probe to the classification model was evaluated by recursive feature elimination. According to descending importance value, the selected CpGs were added one by one to the random forest model if its Pearson correlation value with any already existing probe in the model was <0.7. Each time a new feature was added to the model, the performance of the model was re-evaluated using 10-fold cross-validation. The final model was chosen when best accuracy and kappa were achieved.

Model Validation
In the validation phase, four candidate CpGs were validated using two TCGA cohorts. LUAD methylation status was profiled by two platforms, namely, the HumanMethylation27 and HumanMethylation450 systems. Since both platforms contain all four selected probes, both the above two datasets were used for model validation. LUAD methylation datasets profiled by Illumina Infinium HumanMethylation27 platform had a total of 151 samples, with 127 from LUADs and 24 from non-malignant samples, and the data from the HumanMethylation450 platform consisted of 492 samples, which contained 458 samples from tumors and only 34 from normal tissues. Owing to the batch effect mentioned above, K-means clustering was recalculated for the four selected CpG probes in the TCGA HumanMethylation27

Statistical Analysis
All statistical analysis was performed with R version 3.0.2, and the graphs were generated using GraphPad PRISM 7.0 (GraphPad Software, Inc., San Diego, CA, USA). The differential methylation levels of tumors and non-malignant samples in patients with LUADs were compared using t test. A p < 0.05 was considered statistically significant.

Datasets and Differential Methylation Analysis
Four GEO datasets produced on Illumina Infinium HumanMethylation27 BeadChips for LUAD methylation profiling with a total of 338 samples (189 tumor samples and 149 matched adjacent lung samples) were included in our study. The analysis identified 62 probes that were differentially methylated between LUADs and adjacent non-malignant lung tissues. Among them, only 3 were hypomethylated in tumor samples, while all other 59 probes showed a hypermethylated status ( Table 2). Gene oncology (GO) and Reactome pathway enrichment analysis were carried out using DAVID24 (adjusted p < 0.05, Table 3). GO enrichment results identified items related to transcription factor activity, indicating that methylation can change gene expression in tumor samples by modulating the activities of transcription factors. Both GO and pathway analysis showed the enrichment in G-protein-coupled receptor signaling pathway, which is often dysregulated in tumor cells to facilitate their proliferation, invasion, and immune system escape.

Feature Selection and Construction of the Prediction Panel
To establish a robust prediction model to identify LUADs, we binarily assigned each differentially methylated probe using Kmeans clustering algorithms. Recursive feature elimination was applied to evaluate the discriminative power of methylation loci to ensure that only the most informative features could be included in the final prediction model. The correlation assessment between loci was also used when adding loci one by one during model construction to ensure that the loci with relatively small effect were set by removing possible redundant information. Each time a new probe was added to the model, the prediction performance was updated with 10fold cross-validation. The eventual model was chosen based on both accuracy and kappa values (Figure 1). Finally, a random forest classification model with an accuracy of 94.57% and a kappa of 88.96% was obtained. The methylation status of four CpG loci of the genes of homeobox A9 (HOXA9), KRTAP8-1, CCND1, and TULP2 were highlighted as predictors in the final model. Methylation levels of the selected probes between adenocarcinomas and normal tissues are shown in Figure 2.

Validation of the Prediction Model
This model was then validated in two independent TCGA LUAD methylation datasets profiled on both the Illumina Infinium HumanMethylation27 platform and HumanMethylation450 platform. The methylation patterns of the four selected probes are the same between the GEO and TCGA datasets (Figure 3). There are 151 samples in the TCGA LUAD HumanMethylation27 dataset, with 127 from LUADs and 24 from adjacent nonmalignant lung tissues. Imbalanced datasets in which the sample To evaluate the relevance of this prediction model further, DNA methylation analysis of these four selected probes was performed in 25 paired samples, including LUAD and matched non-malignant tissues, from patients with LUAD collected by two hospitals using pyrosequencing (Figure 4). Although the sample size in this study was limited, the analysis revealed congruence with the GEO and TCGA data. The methylation level of HOXA9 in tumors was upregulated compared to non-malignant samples (29.96 ± 15.89% vs. 13.64 ± 5.89%, p = 0.00007). On the other hand, the methylation level of KRTAP8-1 (63.40 ± 17.62% vs. 82.76 ± 4.32%, p = 0.00002), TULP2 (81.68 ± 14.63% vs. 90.56 ± 3.19%, p = 0.00837), and CCND1 (75.52 ± 24.55% vs. 93.36% ± 3.89%, p = 0.00012) were downregulated in tumors than that in normal samples.

DISCUSSION
Epigenetic markers such as DNA methylation and histone modifications may be used as candidate biomarkers for classifying individuals with respect to their disease risk. Along with the wide application of high-throughput epigenomic measurement technologies and the resulting publicly available databases, more and more big data-based biomarkers have been proposed. Recently, the incidence of small nodules in the lungs stays high worldwide. How to distinguish between benign and malignant nodules is a clinical problem that needs to be solved urgently. To develop a reliable recognition panel utilizing DNA methylation signals, we selected training samples of tumors and non-malignant tissues from the GEO dataset to build a predication model for LUAD and tested our prediction model in two validation sets derived from the TCGA database. In this study, we systematically analyzed the DNA methylation data of LUAD and found that the hypermethylation of HOXA9 and the hypomethylation of KRTAP8-1, CCND1, and TULP2 were observed in LUAD tumor samples. This study identified that the methylation levels of HOXA9, KRTAP8-1, CCND1, and TULP2 may be helpful as LUAD-specific diagnostic panel in undetermined lung nodules.
Researchers have worked in the field of identifying methylation panels, which combine multiple genes for cancer detection as well as prognosis prediction. Ooki et al. proposed a methylation panel of six genes (CDO1, HOXA9, AJAP1, PTGDR, UNCX, and MARCH11) with a prediction accuracy of 92.2% in the training cohort and 93.0% in an independent testing cohort of stage IA primary NSCLC (21). Another study showed that hypermethylation of five genes (HIST1H4F, PCDHGB6, NPBWR1, ALX1, and HOXA9) was significantly associated with shorter relapse-free survival in stage I NSCLC (22). However, these studies mainly followed the path of directly combining altered methylation loci with known biological functions related to cancer. Such approach ignores the potential correlations between selected features, which will in turn result in marker redundancy. In addition, directly using methylation values as model input makes the model both vulnerable and difficult to generalize. As more and more extensive applications of machine learning techniques enter the fields of biomarker discovery and  prediction model building, we took advantage of unsupervised clustering method for a binary feature transforming and recursive feature elimination to select the most concise but simultaneously informative methylation marker set. The random forest machine learning model, which shows superior performance in the tasks of binary classification, was applied in the present study for model construction. Random forest is a useful algorithm, which can select featured genes differentially expressed among different samples. The advantage of random forest is that it can avoid overfitting effectively although the mechanism is not currently clear (23). In addition, we used a value of bagged trees sufficiently large to further settle down the error rate in the present study. In this study, we have identified a candidate prediction panel of four CpG loci located in HOXA9, KRTAP8-1, CCND1, and TULP2, which achieves an accuracy of 94.57%. The prediction panel could be validated in three independent testing cohorts, including two large datasets from TCGA and one experimental dataset. Remarkably, the results were similar among different datasets and were not affected by the platform used in the analysis. The results demonstrated that the application of random forest machine learning model could greatly improve the ability to discovery cancer biomarkers.
Of the four selected methylation markers, HOXA9 and CCND1 have previously been reported to participate in the occurrence and development of tumors. Nelson et al. identified a 10.3-fold higher level of methylation of HOXA9 in lung tumor than that in the adjacent normal tissues in 146 sample pairs (24). Other studies including 40 pairs of primary lung cancer and normal tissues as well as 185 induced sputum specimens also found that methylation of HOXA9 in lung cancer tissues was significantly higher compared with normal tissues. Further investigation discovered that the HOXA9 gene in lung cancer patients were significantly more hypermethylated compared with patients with benign lung diseases and the healthy group (25). In addition to HOXA9, the methylation of HOXB4, HOXD8, and HOXD9 were also different between tumor and adjacent normal tissues in lung cancer ( Table 2). HOX genes play crucial roles in a wide range of processes, including in lung development and are expressed in the normal adult lungs. Hence, abnormal expression or methylation levels of HOX gene might cause lung cancer (26). In addition to HOXA9, CCND1 has long been described as a prognosis maker for NSCLC. It plays key roles in G1/S phase transition and can promote tumor cell proliferation. Studies on 69 resected NSCLC samples between stages I and IIIA showed that overexpression of CCND1 is significantly positively correlated with lymph nodes metastasis, advanced pathological stages, and poor survival (27). As opposed to HOXA9 and CCND1, there is no report on changes in the methylation status in cancers of KRTAP8-1 and TULP2. The underlying mechanisms, by which methylation of these two genes changes the biology in tumors, need further investigation.
Although the prediction panel demonstrates high accuracy in both TCGA datasets, the present study still has some limitations. First, all data used were from methylation arrays. Nowadays, more and more high-throughput methylation detection methods such as reduced representation bisulfite sequencing (28) and methyl-sensitive cut counting technique (29) have been applied to both scientific and clinical examinations. An adjustment of our model for fitting a wider range of applications may be required. Second, we only studied our model in LUAD tumors and nonmalignant tissues. Thus, further analysis that aims to validate this prediction panel in the body fluids such as blood samples from lung cancer patients for non-invasive diagnostics will be required. Third, markers or marker panels for a more accurate prediction of NSCLC subtypes are also in urgent need to provide guidance for more personal and precise treatment and prognosis. Lastly, the prediction panel in this study was only validated in patients with LUAD. However, standard validation should be performed in the clinical setting of an intended-to-use cohort (3). Therefore, the efficacy of the prediction panel will be investigated in multiple cohorts such as patients with undetermined lung nodules in the future study. In summary, this study presents a process by which robust and accurate diagnostic panels can be obtained. It was applied to methylation data from lung cancer patients and relevant control samples. The study demonstrated the effectiveness of the procedure, which could be applied to different sample cohorts and diseases other than cancer, thereby greatly facilitating the discovery process of biomarkers in various fields. For lung cancer, it was shown that the methylation levels of HOXA9, KRTAP8-1, CCND1, and TULP2 have great potential for the early recognition of LUAD in the undetermined lung nodules. Their suitability for liquid biopsy still needs to be demonstrated.

DATA AVAILABILITY STATEMENT
Publicly available datasets were analyzed in this study. These data can be found here: https://www.ncbi.nlm.nih.gov/gds and https://portal.gdc.cancer.gov.

ETHICS STATEMENT
The study was approved by the Institutional Review Board and the Ethics Committee of the Affiliated Tumor Hospital of Xiangya Medical School of Central South University and Xinhua Hospital Affiliated to Shanghai Jiao Tong University School of Medicine, Chongming Branch, and written informed consent was obtained from each patient.