Identification of the Signature Associated With m6A RNA Methylation Regulators and m6A-Related Genes and Construction of the Risk Score for Prognostication in Early-Stage Lung Adenocarcinoma

Background N6-methyladenosine (m6A) RNA modification is vital for cancers because methylation can alter gene expression and even affect some functional modification. Our study aimed to analyze m6A RNA methylation regulators and m6A-related genes to understand the prognosis of early lung adenocarcinoma. Methods The relevant datasets were utilized to analyze 21 m6A RNA methylation regulators and 5,486 m6A-related genes in m6Avar. Univariate Cox regression analysis, random survival forest analysis, Kaplan–Meier analysis, Chi-square analysis, and multivariate cox analysis were carried out on the datasets, and a risk prognostic model based on three feature genes was constructed. Results Respectively, we treated GSE31210 (n = 226) as the training set, GSE50081 (n = 128) and TCGA data (n = 400) as the test set. By performing univariable cox regression analysis and random survival forest algorithm in the training group, 218 genes were significant and three prognosis-related genes (ZCRB1, ADH1C, and YTHDC2) were screened out, which could divide LUAD patients into low and high-risk group (P < 0.0001). The predictive efficacy of the model was confirmed in the test group GSE50081 (P = 0.0018) and the TCGA datasets (P = 0.014). Multivariable cox manifested that the three-gene signature was an independent risk factor in LUAD. Furthermore, genes in the signature were also externally validated using the online database. Moreover, YTHDC2 was the important gene in the risk score model and played a vital role in readers of m6A methylation. Conclusion The findings of this study suggested that associated with m6A RNA methylation regulators and m6A-related genes, the three-gene signature was a reliable prognostic indicator for LUAD patients, indicating a clinical application prospect to serve as a potential therapeutic target.


INTRODUCTION
Lung adenocarcinoma (LUAD) is a type of non-small cell cancer. In the 2018 Global Cancer Report, lung cancer ranked top 1 with the highest incidence and mortality among all cancers (Bray et al., 2018).
N6-methyladenosine (m 6 A) RNA methylation is the most abundant epigenetic modification in eukaryotic mRNA. M 6 A methylation regulators of each modified RNA require a writer to place, an eraser to erase, and a reader to read. Based on these proteins, m 6 A affected RNA splicing (He et al., 2019), translation, and RNA stability (Wang et al., 2014;He et al., 2019). Evidence is now mounting that m 6 A methylation underlies the progression of tumors and affects specific biological processes through non-coding RNA modification (Xiao et al., 2019). Moreover, the over-expression of YTHDF1 in the reader might affect the prognosis of ovarian cancer patients (Liu et al., 2020). In the writer family, high expression of METTL3 promoted the proliferation of bladder cancer (Cheng et al., 2019) and led to a poor prognosis (Han et al., 2019). Over-expression knockdown of ALKBH5 could effectively reduce cell proliferation in pancreatic cancer in erasers family (Tang et al., 2020). Meanwhile, m 6 A has many functions in cancer (He et al., 2019;Ma et al., 2019;Ma and Ji, 2020), such as reduced m 6 A has a relationship with phenotypes of gastric cancer , KIAA1429 is associated with prognosis of liver cancer (Lan et al., 2019), and FTO could facilitate the development of breast cancer (Niu et al., 2019). However, to our knowledge, there are few studies related to m 6 A methylation in early LUAD, and this may be a novel treatment strategy for patients with early LUAD.
In this study, GEO and TCGA data were used to explore the influence of m 6 A methylation genes and their regulated genes on the prognosis of early lung adenocarcinoma. The signature was conducted for identifying new therapeutic biomarkers and treatment strategy development.

Expression Data
Data was downloaded from Gene Expression Omnibus (GEO) and The Cancer Genome Atlas (TCGA) public databases. GSE31210 (n = 226) was used as the training set, GSE50081 (n = 128) as the validation set 1, and TCGA (n = 400) data as the validation set 2. Three independent datasets were used for model construction and model verification. Each independent dataset included the clinical characteristics: survival status, survival time, age, sex, and clinical TNM stage. In GEO data, TNM clinical stage was divided into stages I and II, which were shown in Table 1. Besides, the GPL570 chip platform was re-annotated by the probe to get the final expression profile of the GEO data (Fan et al., 2018). Only mRNA probes were selected, and 8,597 mRNA expression profiles were obtained.

Selection of m 6 A RNA Methylation Regulatory Factors and m 6 A-Related Genes
We collected 21 m 6 A methylated genes through literature investigation (Supplementary Table 1) . We found that among these 21 genes, 14 genes were expressed in   the training set (GSE31210). In LUAD, a total of 5,486 m 6 Aregulated genes were downloaded from the m 6 Avar database 1 (Zheng et al., 2018). Among the 5,507 genes, 2,615 genes were expressed in GSE31210.
Discovery of the m 6 A RNA Methylation Regulators and m 6 A-Related Genes and Establishment of the m 6 A Methylation Risk Score Model.
We obtained prognostic-related gene sets through survival analysis [univariate cox and Kaplan-Meier (KM)] and receiver operating characteristic (ROC) curve. In the training set GSE31210, we used the random survival forest (RSF) (Ishwaran et al., 2008) to establish a prognostic model related to patient overall survival (OS). Methods of analyzing survival data were often parametric, nonlinear effects of variables, and modeled by expanding matrix for specialized functions. Identifying multiple variable interactions was also problematic. These difficulties could be effectively solved using RSF (Ishwaran et al., 2008). Its basic formula is: The meanings of the parameters in this formula are: RS is the risk score, N is the number of genes, Exp is the expression amount of genes in the data, and Coef is the coefficient of cox analysis for the genes resulting from the random survival forest. We used gene combinations to select the largest AUC to construct a prognostic model. Based on the median of the risk scores, the data were divided into two groups: the high-risk group and the low-risk group. The KM curve was used to compare the difference between the high and low-risk groups. In the three datasets, using the median score divided two groups.

Estimation of Outcome Signature for Patients' Prognosis and Its Relationship With Clinical Characteristics
To assess the characteristics of the patients' prognosis and its relationship with clinical features, we used chi-square analysis to judge the correlation between the model and clinical data. KM survival curve and log-rank test were used to describe the relationship between the model and OS. Furthermore,  Frontiers in Genetics | www.frontiersin.org multivariate Cox regression analysis was used to study whether the clinical data (age, gender, and pathological stage) were related to OS in the training set and validation set. We used univariate Cox analysis to judge whether the clinical information had prognostic value.

External Validation of the Genes in the Gene Signature
Furthermore, four online tools were used to verify the gene expression levels (Oncomine database 2 ; TIMER database 3 , and GEPIA database, Gene Expression Profiling Interactive Analysis 4 ) and protein levels [The Human Protein Atlas (HPA) database 5 ] in the model. Meanwhile, online public databases (The cBioPortal for Cancer Genomics 6 ) were used to analyze and understand the gene influence on early treatment of LUAD.

Function Notes and Protein-Protein Interaction
The R package "clusterProfiler" was used to annotate the function and select the statistically significant pathways. The relationship between proteins was analyzed by using the online website STRING 7 (Szklarczyk et al., 2019). Cytoscape was used to visualize the network. Then the main networks were chosen by a degree of the gene in the net analysis.

Statistical Analysis
In three independent datasets, all KM and cox analyses were performed using the R package "survival". Cox analysis was used to select prognostic genes and test models. "ROC" and "TimeROC" were available to verify the feasibility of the model. Functional annotations were made using the R package "Clusterprofiler." All of our analyses (besides online website analysis) were performed in the R language. R packages were used as follows: "pROC, " "TimeROC, " "survival, " "clusterProfiler' , " and "randomForestSRC." The P values of the above analyses were all <0.05 as statistically significant.

Patient Population
All 226, 128, and 400 patients diagnosed with LUAD were collected from the GEO (GSE31210 and GSE50081) and TCGA database, respectively. A total of 2,615 m 6 A-related genes out of the genes expressed were identified in the GSE31210 dataset. In Table 1, the median age of the enrolled samples was 61 years. The ratio of male vs. female was 1.15:1, with 191 live cases and 35 death cases. The longest survival was 10 years. Each sample data was only distributed in stages I-II of LUAD. The study workflow is demonstrated in Supplementary Figure 1. Construction of the Risk Score Model, the m 6 A RNA Methylation Regulators, and m 6 A-Related Genes Risk Score After we used univariate cox analysis and ROC curve, 218 prognostic-related genes were selected, and the screening criteria were P < 0.01 and AUC > 0.6 in Supplementary  Table 2. Furthermore, gene screening was performed by the importance scores of the random survival forest analysis. We permuted and combined the eight genes selected from the random survival forest, obtaining 2 8 -1 = 255 prediction models (Figures 1A,B). The 255 models were evaluated by AUC, and the optimal predictive ability was found in the combination of three genes, ZCRB1, ADH1C, and YTHDC2. As a prediction model, the AUC of the three-gene model was 0.762 ( Figure 1C). The risk score of the model was RSF = (1.725151 × ZCRB1) + (−1.964326 × ADH1C) + (−2.015378 × YTHDC2). Each gene name represented its expression level in a certain sample. We used the RSF formula to calculate the risk score of each sample and plotted the heat map of the three genes (Figures 2A-C), finding that in the high-risk group, ADH1C and YTHDC2 basically had low expression, while ZCRB1 obviously had high expression. This was particularly evident in the training group (GSE31210) (Figure 2A).
The results made it clear that ADH1C had high expression in the low-risk group as a protection factor by cox analysis ( Table 2).

The Validation of Performance in Predicting Overall Survival
In the training set, the median risk score divided all patients into two groups: high-risk group (n = 113) and low-risk group (n = 113) ( Figure 3A). The KM survival curve and log rank test showed that our model had an excellent predictive power. In the validation set, the median risk score was also used to divide the patients into two groups in GSE50081 (n = 128), and there were 64 patients in the high-risk group and 64 patients in the low-risk group (Figure 3B). The KM survival curve showed that there was a significant difference between the high-risk group and low-risk group (Log rank P = 0.0018). Grouped by median risk score in TCGA (n = 400), there were 200 patients in the highrisk group and 200 samples in the low-risk group, with log rank P = 0.014 ( Figure 3C).
In the training group (GSE31210), the 5-years survival rate was 53.10% in the low-risk group and 38.05% in the highrisk group (Figure 3A). Additionally, the overall survival rate was 45.58%, indicating that the risk score could differentiate the data correctly. Survival was significantly improved in the two independent validation data (GSE50081 and TCGA). Moreover, in GSE50081, the low-risk group was 56.25% and the high-risk group was 29.69% ( Figure 3B). The overall survival rate was 42.97% and the grouping label was also evident. Meanwhile, in TCGA, we selected a sample data of TNM stage (I+II) (a total of 400 cases) ( Figure 3C).
Five-years survival rate was calculated in the high-and lowrisk groups, and the rates were 14.5 and 8%, respectively. The overall 5-years survival rate was 11.25% in both lowand high-risk groups, and the survival rate in the low-risk group had markedly improved. Using time ROC in 5-years survival circumstances, we found that the label had an excellent prediction effect (Figures 3D-F). In GSE31210 and GSE50081, the AUC was 0.773 and 0.656, respectively, and the AUC was 0.647 in the TCGA.

The Relationship Between the Signature and Clinical Characteristics
The association was demonstrated between the model and clinical information through the chi-square test in Table 3. There was a significant relationship between the pathological stage and the model (P < 0.05) in the GEO independent dataset rather than the TCGA dataset. Besides, there were 401 females in 754 cases, accounting for 53.18% of the total. A multivariate Cox test was utilized to determine if the signature had an independent prognostic value as a factor. The results in Table 4 showed that the signature was a risk factor, and it was statistically significant (high-vs. low-risk, GSE31210, HR = 16.24, 95% CI 3.85-68.58, P < 0.001, n = 226; GSE50081, HR = 2.23, 95% CI 1.24-4.02, P = 0.008, n = 128; TCGA, HR = 1.50, 95% CI 1.03-2.18, P = 0.036, n = 400). Univariate Cox also indicated that the signature was a risk factor.

Functional Annotation and Protein-Protein Interaction
A 218 gene set, obtained from survival analysis and AUC analysis, was used for functional annotation and PPI network analysis. Among the 218 genes, including m 6 A RNA methylation  regulatory factors, ElAVL1, METTL14, and YTHDC2 were significantly associated with OS in LUAD. Top 10 biological processes (BPs) and cellular components (CCs) were selected by functional annotation of 218 genes, among which several results of BPs were related to division (nuclear division, organelle fission) and regulation (regulation of mRNA metabolic process and regulation of chromosome organization). The primary outcome of CCs was linked to the chromosome (Figures 4A,B). PPI network was constructed by STRING, generated and visualized in Cytoscape. The combined score of the PPI criteria was >0.9. The PPI network had 88 relationships, and some genes were removed that were not part of the network (Figure 4C).
Many key genes were observed in the network, such as PLK1, CCNB1, MAD2L1, RHOA, and ACTR2.

External Validation Using Online Database About Genes in the Signature
The results of external validation data were consistent with our results. In LUAD, two genes YTHDC2 and ADH1C were lowly expressed in the three sets of independent data (Figure 5A), which was almost the same in both the TIMER database (Figure 6) and the GEPIA database (Supplementary Figure 2). Interestingly, the aberrant expression of the three genes were frequently observed in various cancers and showed some tissue-dependent patterns. For example, ZCRB1 was overexpressed in lymphoma, and ADH1C in cervical cancer and esophageal cancer, and YTHDC2 in breast cancer, gastric cancer, head and neck cancer, myeloma, and sarcoma ( Figure 5A).
Survival analyses for each gene in the signature (ZCRB1, ADH1C, and YTHDC2) were performed in the cohorts of GSE31210, GSE50081, and TCGA datasets (Figure 7). ZCRB1 low-expression patient group displayed more OS than ZCRB1 high-expression patient group in GSE31210. While, ADH1C and YTHDC2 high-expression patient group displayed more OS than low-expression patient group not only in GSE31210 but also in GSE50081 and TCGA dataset.
We then reviewed the proteomic data and found YTHDC2 protein was reported significantly underexpressed in non-small cell lung cancer (Sun et al., 2020). The representative protein expression of ADH1C and YTHDC2 was explored in the human protein profiles and is shown in Figure 5B. Nevertheless, ZCRB1 was not found in the HPA website. YTHDC2 possessed the most frequent genetic alterations (3%) among the three genes. Meanwhile, amplification mutation, missense mutation, and deep deletion were the most common alterations among the three genes ( Figure 5C). In summary, we further verified the abnormal expression of these three genes in LUAD, and genetic changes may help explain the aberrant expression of these genes to a certain extent.

DISCUSSION
At the post-transcriptional level, more than 160 kinds of chemical modifications were discovered in various RNAs (Roundtree et al., 2017;Boccaletto et al., 2018). Among these modifications, more and more evidence showed that m 6 A modification had an essential effect on some underlying  diseases and prognosis of tumors. Therefore, the identification of m 6 A RNA methylation regulators and m 6 A-related genes in fatal LUAD may offer valuable therapeutic targets to us and clinicians. Doctors usually diagnosed LUAD as advanced, and there was a high death rate with it. Many studies illuminated that the m 6 A process was linked to lung cancer, which made m 6 A RNA methylation regulators and m 6 A-related genes potential biomarkers for clinical practice. According to our research, the classification of m 6 A-related genes in LUAD patients was in association with prognosis. We identified a signature that consisted of one m 6 A RNA methylation regulator (YTHDC2) and two m 6 A-related genes (ZCRB1 and ADH1C) using different statistical and machine learning methods.
Up to now, little is known about the role of YTHDC2 in tumorigenesis. Even less so in LUAD, two studies were found on the role of YTHDC2 in LUAD in recent studies. In a mouse model, low YTHDC2 expression was associated with poor prognosis in LUAD patients, and YTHDC2 improves the prognosis of LUAD patients by inhibiting the independent antioxidant function of SLC7A11 (Ma et al., 2021). In non-small cell lung cancer, a research analyzed a series of publicly available online databases and found that low YTHDC2 expression was associated with lymph node metastasis and poor prognosis (Sun et al., 2020).
ZCRB1 is a zinc finger CCHC-type and RNA binding motif containing 1. A previous study found that it was U12-type splicing playing a pivotal role by RefSeq analysis. However, the function of ZCRB1 was rarely reported in cancer, only in two studies. For example, ZCRB1's high expression can improve viral replication and transcription (Tan et al., 2012). Through genome-wide analysis of lung adenocarcinoma and healthy subjects, it was found that ZCRB1 may encode viral receptors. COVID-19 has infected plenty of people around the world, and ZCRB1 high expression may impact patients' prognosis (Cotroneo et al., 2021).
ADH1C is alcohol dehydrogenase 1C (class I). Many reports showed that drinking had an effect on some diseases. High expression of ADH1C was found to protect patients of nonsmall cell lung cancer (Wang et al., 2018). Using machine learning algorithms, the researchers found that ADH1C could be a prognostic marker (Shen et al., 2019).
For the reversible effect of m 6 A on mRNA expression, we believe that m 6 A-related genes may have different functional patterns and networks when participating in malignant tumors. Thus, m 6 A-related genes may have different expression patterns in LUAD. In previous research, little was known about the interaction of m 6 A-related genes. Moreover, m 6 A RNA methylation regulators (WTAP, RBM15, KIAA1429, YTHDF1, and YTHDF2) were linked with TP53 and highly expressed in TP53 mutant LUAD (Zhuang et al., 2020). However, it is worth nothing that whether the TP53 mutant affects the expression of ZCRB1, ADH1C, and YTHDC2 is still unclear, and more evidence is needed to clarify their mechanism.

CONCLUSION
In conclusion, our study systematically analyzed the expression, prognostic value, protein-protein interaction, and potential function of m 6 A RNA methylation regulators and m 6 Arelated genes. We found that the expression of m 6 A RNA methylation regulators and m 6 A-related genes was closely related to the clinicopathologial characteristics of LUAD. A three-gene signature was identified that might effectively identify new therapeutic targets or strategies for LUAD. In summary, our study provided important clues for further studies on the role of RNA m 6 A methylation regulators and m 6 A-related genes in LUAD.

DATA AVAILABILITY STATEMENT
The original contributions presented in the study are included in the article/Supplementary Material, further inquiries can be directed to the corresponding authors.

AUTHOR CONTRIBUTIONS
CY, XJ, and WZ: study conception and design. BG, HZ, and CY: manuscript writing. BG, RW, and CY: literature review. All authors: data interpretation, discussion, final editing, and approval of the manuscript in its present form.