DNA Damage Response Gene-Based Subtypes Associated With Clinical Outcomes in Early-Stage Lung Adenocarcinoma

DNA damage response (DDR) pathways play a crucial role in lung cancer. In this retrospective analysis, we aimed to develop a prognostic model and molecular subtype based on the expression profiles of DDR-related genes in early-stage lung adenocarcinoma (LUAD). A total of 1,785 lung adenocarcinoma samples from one RNA-seq dataset of The Cancer Genome Atlas (TCGA) and six microarray datasets of Gene Expression Omnibus (GEO) were included in the analysis. In the TCGA dataset, a DNA damage response gene (DRG)–based signature consisting of 16 genes was constructed to predict the clinical outcomes of LUAD patients. Patients in the low-DRG score group had better outcomes and lower genomic instability. Then, the same 16 genes were used to develop DRG-based molecular subtypes in the TCGA dataset to stratify early-stage LUAD into two subtypes (DRG1 and DRG2) which had significant differences in clinical outcomes. The Kappa test showed good consistency between molecular subtype and DRG (K = 0.61, p < 0.001). The DRG subtypes were significantly associated with prognosis in the six GEO datasets (pooled estimates of hazard ratio, OS: 0.48 (0.41–0.57), p < 0.01; DFS: 0.50 (0.41–0.62), p < 0.01). Furthermore, patients in the DRG2 group benefited more from adjuvant therapy than standard-of-care, which was not observed in the DRG1 group. In summary, we constructed a DRG-based molecular subtype that had the potential to predict the prognosis of early-stage LUAD and guide the selection of adjuvant therapy for early-stage LUAD patients.


INTRODUCTION
Lung cancer is the major cause of global cancer mortality in 2020, with an estimated 1.8 million deaths worldwide (Sung et al., 2021). Non-small cell lung cancer (NSCLC) represents 85% of all lung cancers. Based on histology, NSCLC can be further divided into lung adenocarcinoma (LUAD), lung squamous cell carcinoma (LSCC), large-cell carcinoma, etc. (Bender, 2014). The survival of patients with NSCLC is largely determined by the tumor stage at diagnosis. Only 15% of patients with late-stage disease (stages III-IV) are alive after 5 years, which makes NSCLC one of the cancers with the worst prognosis (Necchi et al., 2017). Although the 5-year survival rate increases to approximately 60% and 40% for stage I and stage II patients, respectively, around 30-55% of them experienced disease recurrence within 5 years after surgery (Howington et al., 2013;Wang et al., 2017). In recent years, the immuno-oncology (IO)-based strategies, such as immune checkpoint inhibitors (ICIs), the combination of different ICIs, or chemotherapies, have achieved evolutionized improvements in the treatment for a subset of patients with lung cancer (Listì et al., 2019;Passiglia et al., 2021). Besides the breakthrough in cancer therapy, it is also important to improve recurrence prediction and clinical management with the increase of early-stage tumors due to the progress of lung cancer screening.
The rapid development of high-throughput technologies, especially DNA microarrays and RNA-sequencing, has facilitated the exploration of several expression-based gene signatures for risk stratification in NSCLC patients. Beer et al. proposed a 50-gene signature to identify low-and high-risk stage I lung adenocarcinomas using microarray analysis (Beer et al., 2002). The Director's Challenge Consortium validated the performance of several such prognostic models in a large multi-site cohort with 442 lung adenocarcinomas (Chen et al., 2007;Shedden et al., 2008;Sun et al., 2008). In addition, a 14-gene expression signature (RT-PCR-based) has been commercialized to stratify different risk groups for resected non-squamous NSCLC patients (Kratz et al., 2012). A 25-immune gene signature and a 31-proliferation gene signature both have shown promising clinical utility for risk stratification and individualized management in NSCLC patients (Wistuba et al., 2013;Li et al., 2017). However, none of these signatures was further analyzed in patients with and without adjuvant therapy to validate the potential clinical utility in the guidance of adjuvant therapy.
Genomic instability is one of the key hallmarks of cancer, and DNA damage response (DDR) plays a significant role in maintaining genomic integrity (Hanahan and Weinberg, 2011). The DDR system is a complex signaling network which involves eight pathways: base excision repair (BER), mismatch repair (MMR), homologous recombination repair (HRR), nonhomologous end joining (NHEJ), checkpoint factors (CPF), Fanconi anemia (FA), nucleotide-excision repair (NER), and DNA translesion synthesis (TLS) (Scarbrough et al., 2016). These pathways operate collectively to detect diverse types of DNA lesions and activate signaling mechanisms to boost the repair machine (Jackson and Bartek, 2009). Previous studies have demonstrated that the DDR pathways play significant roles in cancer progression and the response to cancer therapies. Several prognostic models, based on DDR genes, have been constructed for glioblastoma, ovarian cancer, and low-grade gliomas (Knijnenburg et al., 2018;Gobin et al., 2019;Sun et al., 2019;Pang et al., 2020). However, the DDR genes identified in these prognostic models vary widely between different cancers, suggesting that DDR genes may exert different molecular effects in different cellular environments. The relationships of various DDR genes with prognosis in lung adenocarcinoma are not well-established.
In this study, we aimed to identify and validate a group of DDR genes to stratify early-stage LUAD patients into different subtypes with different prognoses and guide the use of adjuvant therapy.

Molecular and Clinical Data
The LUAD dataset of The Cancer Genome Atlas (TCGA) and six microarray datasets of Gene Expression Omnibus (GEO) were included in the analysis. For the TCGA dataset, RNA-sequencing data (FPKM format), genetic mutations, copy number variant (CNV), and clinical features, including age, sex, tumor stage, histology subtype, adjuvant treatment, and follow-up information, were obtained from the GDC (https://portal.gdc. cancer.gov/). In addition, normalized microarray data and the corresponding clinical characteristics of patients with early-stage (stages I and II) lung adenocarcinoma from six GEO cohorts (GSE31210, GSE37745, GSE68465, GSE30219, GSE72094, and GSE13213) were obtained for further external validation in this study.

DDR Gene-Based Signature Construction
A total of 200 DDR-related genes were curated and analyzed to identify prognosis-related markers (Scarbrough et al., 2016). These genes used in the study are listed in Supplementary  Table S1. Univariable Cox regression and LASSO Cox regression analyses with minimum partial likelihood deviance were performed to select genes associated with OS. We defined the risk score using the following formula riskscore n k 0 coef of gene k p expr of gene k , where n is the number of markers. The nearest neighbor estimation method was applied to identify the best cutoff point of risk score to stratify patients into high-and low-risk subgroups. Kaplan-Meier (KM) analysis and receiver operating characteristic (ROC) curve were used to assess the performance of the signature.

Association Between DDR Signature and Genome Features and Gene Expression
In order to explore the potential molecular mechanisms of the DDRgene-based signature, the associations of the DDR signature with somatic mutation, CNV, genomic scar signature, and gene expression data were analyzed based on TCGA data.

DDR Molecular Subtype Identification and Validation
Unsupervised clustering with the hierarchical cluster algorithm (based on Euclidean distance and Ward's linkage) of the expression profiles of the genes in the DDR signature was performed to identify molecular subtype in early-stage LUAD. The default parameters of the hclust function were used to perform the classification. The cluster number was selected as 2. It was further validated in six GEO datasets.

Statistical Analyses
R software v4.0.2 was used for all the bioinformatics and statistical analyses, including data preprocess, LASSO Cox regression, CNV and mutation visualization, and ROC analysis. The KM method and log-rank test were adopted to generate and evaluate the statistical significance of the survival curves between groups. The specificity and sensitivity of the signature were evaluated using the ROC curve, and the area under the curve (AUC) of distinct survival time was quantified using R-package pROC. The Kappa consistency test was used to analyze the consistency between the two group methods. The Cox proportional hazards model was applied to identify the independence of the signature. The prognostic values of single genes in signatures were accessed using the "szcox" function of the ezcox package. R-package SubgrPlots was used for subgroup analysis, which was visualized using the Forester package. A propensity score matching (PSM) analysis was performed according to a 1:1 ratio between the two subgroups (with or without adjuvant therapy) to adjust for clinicopathologic characteristics bias using the MatchIt package. Heat maps of TCGA-LUAD and GEO datasets were generated using the pheatmap package. The maftools package was used to visualize the mutation landscape in the TCGA-LUAD dataset. Two-sided p < 0.05 was considered to be statistically significant.

Association Between the 16 Genes and Clinicopathological Factors
To further study the underlying mechanism of the DRG, we explored the molecular function and the association with prognosis of genes in the DRG. Most of them had positive coefficients in this regression equation with HR > 1, indicating poor prognostic genes, while genes (REV3L, REV1, NEIL1, and HFM1) had negative coefficients with HR < 1 (Figure 2A). To depict the genomic and expression alterations of the 16 DDR genes, we further described the prevalence of somatic mutations, CNV, and mRNA expression of the 16 genes in LUAD patients ( Figure 2B). Of the 486 LUAD patients, 63 (13.0%) patients harbored at least one mutation of the pattern genes. Among them, HFM1 had the highest mutation frequency (4%) followed by EOX1 and REV3L, while there were no mutations in ERCC1, NEIL1, and PCNA. Meanwhile, CNV analysis showed that EXO1, NBN, and POLD2 had a widespread frequency of CNV gain. Furthermore, the mRNA expressions for these genes were significantly higher in patients with CNV gain, suggesting CNV alteration may be a vital contributor to the altered mRNA expression of these genes. Moreover, protein expression levels of 13 genes were obtained from The Human Protein Atlas (THPA). Representative IHC images revealed that these proteins had upregulated expression in lung adenocarcinoma tissues and downregulated expression in normal lung tissues (Supplementary Figure S1 and Supplementary Table S3).
To identify the biological significance of the genes in the DRG signature, GO analysis was conducted, and the results revealed that these genes were enriched in DNA-dependent DNA replication, nucleotide-excision repair, DNA recombination, and DNA geometric change. Furthermore, the outcomes of KEGG pathway analysis illustrated that these genes were mainly enriched in the Fanconi anemia pathway, base excision repair, and homologous recombination (Supplementary Figure S2).
Next, we investigated the associations between the two groups and various genomic features. The high-risk group was associated with higher aneuploidy score (AS), tumor mutational burden (TMB), SNP, and indel burden than the low-risk group ( Figure 2C-H). Higher mRNA expression of ATM was observed in the low-risk group than in the high-risk group, and expression pattern of TP53, ATM, and ATR between high-risk and low-risk patients in the TCGA dataset. High, gray; low, yellow. (P) Expression profiles of 16 genes between high-and low-risk groups in the stages I and II TCGA-LAUD dataset. p-value of continuous variables was tested using Wilcoxon rank-sum test. Pearson's chisquare test was used to test the categorical variables.
Frontiers in Molecular Biosciences | www.frontiersin.org June 2022 | Volume 9 | Article 901829 5 while not for TP53 and ATR ( Figure 2I-K). We also observed that samples in the high-risk group exhibited higher genomic instability-telomeric allelic imbalance (TAI), large-scale state transitions (LST), loss of heterozygosity (LOH), and an incorporated homologous recombination deficiency (HRD) score ( Figure 2L-O). These results showed the heterogeneity in genomic scar and DDR checkpoint gene expression between the two groups.

Molecular Subtype Identification
As shown in Figure 2P, two expression patterns of the 16 genes were identified from the expression heat map of these signature genes in patients with stages I and II LUAD from the TCGA cohort. Patients in the low-risk group had better clinical outcomes (OS and DFS) and showed significantly higher expressions of REV1, REV3L, HFM1, and NEIL1, while the other genes had significantly lower expressions in this group. Meanwhile, the low-risk group had a higher percentage of patients with stage I than in the high-risk group (Chi test, p = 0.0025). The abovementioned results demonstrated that the DRG-related genes could be used to classify the early-stage LUAD patients.
Unsupervised hierarchical clustering (based on Euclidean distance and Ward's linkage) of the expression profiles of DDR genes was used to identify molecular subtype instead of the formula derived from the TCGA cohort. The expression profile of the 16 genes was used to develop a DRG-related molecular subtype to stratify early-stage LUAD into two subtypes (DRG1 and DRG2) with statistically significant differences in clinical outcomes. A clustering heat map was generated to illustrate that the expressions of DRG-related genes were significantly different between the two subtypes ( Figure 3A). The Kappa consistency test revealed the consistency of the two methods (DRG and molecular subtype, K = 0.61, p < 0.001, Figure 3A). As shown in Figure 3B, 71.4% (142/199) of the low-risk DRG patients were grouped into DRG1 subtype, and 89.9% (169/188) of the high-risk DRG patients were grouped into DRG2 subtype. Similar results were also documented in the Kaplan-Meier analysis (DFS, log-rank p = 0.001; HR = 0.57, 95% CI: 0.40-0.80; OS, log-rank p < 0.001; HR = 0.43, 95% CI: 0.28-0.65, Figure 3C,D). After adjusting for clinical factors, the molecular subtype remained an independent prognostic molecular classifier for DFS and OS (DFS, HR = 0.60, 95% CI: 0.41-0.86, p = 0.006; OS, HR = 0.50, 95% CI: 0.32-0.76, p = 0.011, Figure 3E). These results indicated that DRG-related genes could stratify early-stage LUAD into two molecular subtypes with distinct prognosis.

Validation in GEO Datasets and Meta-analysis
In order to validate the molecular subtype and prognostic prediction of the DRG-related genes, a total of 1,285 stage III LUAD patient RNA expression microarray data were collected. The expression patterns of these genes and the survival status of patients in each GEO dataset are shown in Figure 4 and Supplementary Figure S3 A meta-analysis was performed with a fixed-effects model, and the results indicated that compared with the DRG2 subtype, patients with the DRG1 subtype exhibited higher OS (HR = 0.48, 95% CI: 0.41-0.57, p < 0.01, Figure 4I) and DFS (HR = 0.5, 95% CI: 0.41-0.62, p < 0.01, Figure 4J) in the overall dataset. Heterogeneities were not significant in all pooled analyses (OS, p = 0.63; DFS, p = 0.43).
Then, we tested whether the molecular subtype could serve as an independent prognostic factor for early-stage lung adenocarcinoma. In multivariable analysis, the associations of DDR subtypes and prognosis were still significant (Tables 2, 3), which confirmed that the selected DDR genes could stratify patients with different prognoses.

Subgroup Analysis
A stratification analysis was conducted to assess whether clinical factors had interaction effects on the DRG subtypes. Patients in TCGA and GSE31210 datasets were artificially stratified based on clinical factors, such as age (≤60/>60), sex (female/male), stage (I/ II), smoking (ever/never), and adjuvant treatment (no/yes). As shown in Figure 5A,B and Supplementary Figure S4A-B, patients in the DRG1 subtype had higher OS and DFS than the DRG2 subtype irrespective of their age, sex, and smoking status. Meanwhile, a significant interaction (p = 0.01) between adjuvant treatment and DRG subtypes was observed in earlystage LUAD patients. Furthermore, we examined the association between adjuvant treatment and prognosis in DRG1 and DRG2 subtypes. We found that in the DRG2 subtype, patients with adjuvant treatment tended to have longer OS and DFS than patients without adjuvant treatment (OS, log-rank p = 0.259, HR = 0.33, 95% CI: 0.04-2.5; DFS, log-rank p = 0.105, HR = 0.32, 95% CI: 0.08-1.4), while in the DRG1 subtype, the results were opposite (OS, log-rank p = 0.001, HR = 5.3, 95% CI: 1.7-16; DFS, log-rank p < 0.001, HR = 6.7, 95% CI: 3.0-15). The abovementioned observation was not statistically significant because the sample size was limited (Supplementary Figure  S4C,D and Figure 5C,D). After the patients were matched by propensity score, similar results were observed (Supplementary Figure S5).

DISCUSSION
In this study, we trained and validated 16 DDR genes with prognostic values and classification effects in early-stage lung adenocarcinoma and classified patients into two subtypes, DRG1 and DRG2. Furthermore, we found that the DRG1 patients without adjuvant therapy and the DRG2 patients with adjuvant therapy tended to have prolonged survival than other patients in the corresponding subtypes.
The DDR system comprised eight pathways with diverse biological functions to maintain genomic integrity. In this study, we discovered that the 16 identified DDR genes were mainly involved in TLS, NER, and BER pathways. REV3L and REV1 were involved in TLS whose lower expressions were associated with worse prognoses. In human cells, when the expressions of TLS genes decrease, the DNA replication stress escalates the accumulative fork stalling and double-strand breaks (DSBs), resulting in genome instability and poor survival (Ghosal and Chen, 2013). These two genes are important DNA polymerase and deoxycytidyl transferase, which play significant roles in maintaining genome stability in the advent of DNA damage (Sasatani et al., 2020;Zhou et al., 2020). It has been reported that lower REV3L expression was also shown to be associated with lower DFS and OS (Agulló-Ortuño et al., 2020), which was consistent with our findings.
Furthermore, we discovered that genes with higher expression in the DRG2 subtype were mainly involved in NER and BER pathways. RAD23B, DDB1, and ERCC family genes (ERCC1, ERCC5, and ERCC6) are key genes in the NER pathway. Many studies have reported that they are significantly correlated with prognosis in different cancer types, such as colorectal cancer, pancreatic cancer, gastric cancer, and so on (Luo et al., 2018;Zhang et al., 2019;Li et al., 2021). NEIL3, PCNA, RFC3, and POLD2 play important roles in the BER pathway which are recruited to DNA lesions and cleave and repair the damaged bases cooperatively (Robertson et al., 2009;Hurst et al., 2021;Wang et al., 2021). Zhao et al. found that NEIL3 activated cell cycle progression, leading to poor prognosis (Zhao et al., 2021). Zhang et al. discovered that RFC3 was involved in the epithelial-mesenchymal transition in lung adenocarcinoma, resulting in worse survival (Gong et al., 2019). In tumorigenesis, increased DNA replication stress results in the increased generation of reactive oxygen species (ROS), leading to DNA damage (Jackson and Bartek, 2009). Accumulating evidence supports that NER and BER pathways are involved in the repair of oxidative DNA lesions. Therefore, high expressions of NER and BER genes suggest that more oxidative DNA lesions are being generated, which lead to genome instability and poor prognosis (Melis et al., 2013). Therefore, the imbalance of DNA damage and repair can increase the genome instability and promote tumor cell proliferation which might contribute to worse survival.
The use of adjuvant therapy in early-stage (IA-IIB) lung adenocarcinoma is controversial in NCCN guidelines and mainly depends on the physician's experience (Zheng and Bueno, 2015).
Although several studies have constructed various gene expression signatures to stratify LUAD patients, none of them have provided sufficient evidence about whether the high-risk patients could benefit from adjuvant therapy (Chen et al., 2007;Shedden et al., 2008;Sun et al., 2008). In our study, we classified LUAD patients into DRG1 and DRG2 subtypes and explored the interaction between these subtypes and adjuvant therapy in the GSE31210 dataset, which was a relatively rigorous clinical trial with clear inclusion and exclusion criteria. The patients in GSE31210 received no neoadjuvant therapies before surgery, whose stages were pathologically defined. Based on the GSE31210 cohort, we found that in the DRG2 subtype, the prognosis of patients who received adjuvant therapy had prolonged survival than those who did not, whereas in the DRG1 subtype, the patients without adjuvant therapy had better prognosis. Several previous studies also revealed that the low activity of TLS including low expression of REV3L enhanced the chemosensitivity of cancer Yang et al., 2015;Agulló-Ortuño et al., 2020), which supports our findings that patients in the DRG2 subtype may benefit from adjuvant chemotherapy. In summary, the different clinical benefits of adjuvant therapy in various subtypes suggest that DRG subtypes have the potential to guide the selection of adjuvant therapy for early-stage LUAD patients.
In the present study, we identified novel DDR-gene expression subtypes and explored the association with prognosis and adjuvant therapy. However, there are still some limitations in our study. The current study was a retrospective analysis with a limited sample size in a public database. In addition, the mRNA expression in our study was based on RNA-seq or microarray whose results were less stable than those of RT-PCR or IHC, so the evidence would be more solid if it was validated by RT-PCR or IHC, as well as more cost-effective in clinical application scenarios (Pisapia et al., 2022). However, our findings have been validated in six independent cohorts to reduce false-positive results, and they were further validated in the adjuvant therapy subgroups to confirm the role in guiding therapy selection. In the future, prospective studies with large sample sizes are required to confirm the clinical utility of the 16 DDR-gene expression subtypes on the platform of RT-PCR or IHC. In summary, we explored the association between DDR gene expression and prognosis in patients with stage I or II LUAD. Sixteen DDR gene-related subtypes were constructed to predict prognosis and guide the use of adjuvant therapy. More research studies are warranted to further confirm the clinical utility of the 16 DDR-gene classifiers.

DATA AVAILABILITY STATEMENT
The original contributions presented in the study are included in the article/Supplementary Material; further inquiries can be directed to the corresponding authors.

AUTHOR CONTRIBUTIONS
Conceptualized and designed by JD and LC; administrative support provided by JD and LC; provision of study materials or patients by YZ, BQ, CX, JZ, YL, PC, GW, SC, and YS; collection and assembly of data by BQ and JZ. Data analysis and interpretation conducted by YZ, CX, and YL. All authors contributed to the writing of manuscript. All authors have provided final approval of the manuscript.

FUNDING
This work was supported by the General Program of National Natural Science Foundation of China (81972905).