DNA Repair–Related Gene Signature in Predicting Prognosis of Colorectal Cancer Patients

Background: Increasing evidence have depicted that DNA repair–related genes (DRGs) are associated with the prognosis of colorectal cancer (CRC) patients. Thus, the aim of this study was to evaluate the impact of DNA repair–related gene signature (DRGS) in predicting the prognosis of CRC patients. Method: In this study, we retrospectively analyzed the gene expression profiles from six CRC cohorts. A total of 1,768 CRC patients with complete prognostic information were divided into the training cohort (n = 566) and two validation cohorts (n = 624 and 578, respectively). The LASSO Cox model was applied to construct a prediction model. To further validate the clinical significance of the model, we also validated the model with Genomics of Drug Sensitivity in Cancer (GDSC) and an advanced clear cell renal cell carcinoma (ccRCC) immunotherapy data set. Results: We constructed a prognostic DRGS consisting of 11 different genes to stratify patients into high- and low-risk groups. Patients in the high-risk groups had significantly worse disease-free survival (DFS) than those in the low-risk groups in all cohorts [training cohort: hazard ratio (HR) = 2.40, p < 0.001, 95% confidence interval (CI) = 1.67–3.44; validation-1: HR = 2.20, p < 0.001, 95% CI = 1.38–3.49 and validation-2 cohort: HR = 2.12, p < 0.001, 95% CI = 1.40–3.21). By validating the model with GDSC, we could see that among the chemotherapeutic drugs such as oxaliplatin, 5-fluorouracil, and irinotecan, the IC50 of the cell line in the low-risk group was lower. By validating the model with the ccRCC immunotherapy data set, we can clearly see that the overall survival (OS) of the objective response rate (ORR) with complete response (CR) and partial response (PR) in the low-risk group was the best. Conclusions: DRGS is a favorable prediction model for patients with CRC, and our model can predict the response of cell lines to chemotherapeutic agents and potentially predict the response of patients to immunotherapy.

Conclusions: DRGS is a favorable prediction model for patients with CRC, and our model can predict the response of cell lines to chemotherapeutic agents and potentially predict the response of patients to immunotherapy.
Keywords: DNA repair-related genes, prognostic, colorectal cancer, immunotherapy, microsatellite instability BACKGROUND With the third highest incidence rate in the world, colorectal cancer (CRC) is a serious threat to human health (Bray et al., 2018). Nowadays, due to lifestyle changes, there is an increasingly high incidence of mortality from CRC (Zheng et al., 2014). As one of the most common gastrointestinal tumors in general surgery, CRC is a multifactorial disease with extremely complex pathogenesis (Migliore et al., 2011). At present, the early diagnosis of CRC has involved epigenetics, genomics, and so on (Marcuello et al., 2019). DNA repair is a series of processes by which a cell recognizes and corrects damage to the DNA molecules that encode its genome (Zinovkina, 2018;Burdak-Rothkamm and Rothkamm, 2021), and it is extremely important for maintaining the stability of the genome and protecting the genome from damage by endogenous and environmental agents (Friedberg, 2001). It is estimated that human cells suffer more than 2 × 10 4 DNA damage events per day (Lindahl and Wood, 1999), but generally speaking, cells can respond to this damage through efficient and highly regulated DNA repair mechanisms (Lindahl and Wood, 1999;Iyama and Wilson, 2013). Repair mechanisms include nuclear excision repair, base excision repair, mismatch repair (MMR), and double-strand break repair (Iyama and Wilson, 2013). As we all know, genomic instability caused by the destruction of DNA damage and repair mechanism can lead to cancer progression, and DNA repair genes are often found to mutate in cancer (Knijnenburg et al., 2018). Recently, Knijnenburg et al. (2018) discovered mutations related to DNA damage response genes by analyzing the TCGA data and found that several mutations in DNA damage response and repair genes occur in the colon adenocarcinoma and rectal adenocarcinoma data sets.
Due to the limited options for capturing the molecular heterogeneity of the disease and the lack of consideration and sufficient validation of other gene expressions, few of the prognostic models of early stage CRC have been applied in clinical practice (Guinney et al., 2015;Phipps et al., 2015). Thus, an accurate method is needed to identify effective prognostic models to assess the disease-free survival (DFS) of patients with CRC. The aim of the present study is to examine the interrelationships between DNA repair-related genes (DRGs) and CRC, to determine an effective prognostic model to evaluate the DFS of patients with CRC and provide guidance for clinicians in early diagnosis and treatment.

Patients
We retrospectively analyzed the gene expression profiles of CRC samples from six public cohorts. Totally, 1,768 samples were available for analysis in the current study. The CIT/GSE39582 (n = 566) was used for training the model, and The Cancer Genome Atlas colorectal cancer (TCGA, n = 624) was selected to serve as a validation-1 cohort. The remaining four microarray data sets (GSE14333, GSE33113, GSE37892, and GSE39084) were merged into a validation-2 cohort (n = 578) ( Table 1). The transcriptome RNA-sequencing data of the CRC samples were from the TCGA data portal, and other microarray data sets were acquired directly from the GEO database. The institutional review board of our hospital approved this study, and data were collected from 12 May to 10 October 2020.

Construction and Validation of DNA Repair-Related Gene Signature
Firstly, a complete list of DRGs was available online from MSigDB (version 6.2, https://www.gsea-msigdb.org/gsea/msigdb). We identified a list of candidate genes differentially expressed between relapsed samples and non-relapsed samples by using the "limma" R package (Diboun et al., 2006). The genes with an absolute log2-fold change of more than 1 and an adjusted p < 0.05 were considered for subsequent analysis. In order to minimize over-fitting risk, we applied a Cox proportional hazards regression model on CRC samples combined with the least absolute shrinkage and selection operator (LASSO) (Tibshirani, 1997). The penalty parameter was estimated by 10-fold cross-validation in the training data set at the minimum partial likelihood deviance.
We divided patients into high-risk and low-risk groups by determining the optimal threshold through the time-dependent receiver operating characteristic (ROC) curve (survivalroc, version 1.0.3) at 5 y in the training data set. The ROC curve was estimated by the Kaplan-Meier estimation method. We performed univariate and multivariate Cox regression analyses of the cohort to verify that the 11-DRG signature was independent of other clinical features.

Functional Annotation Analysis
To evaluate the biological functions of the DNA repair-related gene signature (DRGS), enrichment analysis for differentially expressed genes in different groups was applied using the R package "gProfileR." We used the Bioconductor package "HTSanalyzeR" to perform Gene Set Enrichment Analysis (GSEA) to predict significant dysregulated pathways (Subramanian et al., 2005;Wang et al., 2011). Gene sets of cancer hallmarks from MSigDB (Liberzon et al., 2015) were examined.

Validation of Genomics of Drug Sensitivity in Cancer Database, Immunotherapy Database, and Tumor Immune Dysfunction and Exclusion
To further explore the clinical application of our model, we used Genomics of Drug Sensitivity in Cancer (GDSC) to analyze the differences of chemotherapeutic drugs between the high-risk group and the low-risk group.
As known, immunotherapy is a hot topic, and we want to know whether this model can predict immunotherapy. We verified our model by using the data provided in the article "Interplay of somatic alterations and immune infiltration modulates response to PD-1 blockade in advanced clear cell renal cell carcinoma (ccRCC)" published in Nature Medicine (Braun et al., 2020). We constructed the DRGS in the data set of advanced clear cell renal cell carcinoma and divided it into the high-risk and low-risk groups according to the cutoff of our original model. The overall survival (OS) curve was drawn using the Kaplan-Meier method. In addition, we selected some immune-related indicators in the data set and compared the differences of these indicators between the high-and low-risk groups by t-test. Besides, we also analyzed the OS curve of the objective response rate (ORR) of immunotherapy.
The tumor immune dysfunction and exclusion (TIDE) algorithm can be used to predict the tumor response to immune checkpoint inhibition treatment and the function of genes regulating tumor immunity, so as to effectively predict the effect of immune checkpoint inhibition treatment.

Statistical Analysis
All the statistical analyses were performed on R (version 3.4.3, www.r-project.org). The hazard ratios were calculated using the "survcomp" package28 (version: 1.28.4) (Schröder et al., 2011). The LASSO regression was implemented using "glmnet" R package (version: 2.0.16). Cox regression analysis was used for single-factor and multifactor analyses of the results, and the receiver operating characteristic (ROC) curve and C-index were used to evaluate the model. A p-value of less than 0.05 was defined as statistical significance in all tests. A total of 1,768 CRC patients were included in the analysis. The CIT data set (GSE39582, n = 566) was used as the training cohort and genes with relatively high variation were maintained as candidates (Table 1, Figure 1). With median absolute deviation >0.5 and excluding the genes expressed less in the median expression level, 1,286 genes were screened out of 1,376 DRGs measured on all platforms from the data sets. In addition, in order to improve the robustness of the identification for the limited sample size, we further selected DRGs by using the Cox proportional hazards regression against 1,000 randomized trials (80% portion of samples each time) to assess the correlation between each candidate gene and patients' DFS in the training cohort. A total of 46 DRGs were robustly associated with individual patients' DFS. In order to minimize the over-fitting risk, we applied a Cox proportional hazards regression model to the CRC samples combined with the LASSO. By using LASSO Cox regression, 11 prognostic DRGs were selected and combined for the construction of DRGS (Figures 2A,B). The risk scores were calculated by the formula designed by the Cox regression model. The total risk score was imputed as follows (−0.1145 × POLR2B) + (−0.0653 × RAD1) + (0.0370 × CDA) + (0.1711 × NPR2) + (−0.0328 × UBE2D2) + (−0.0992 × BCL2) + (−0.0473 × PLD6) + (0.0896 × ERBB2) + (0.1220 ×ARPC1B) + (−0.1086 × FUT4) + (−0.0765 × PSME2). The time-dependent ROC curve analysis showed that the optimal cutoff to stratify highand low-risk groups was −0.147 ( Figure 2C).

Prognostic Evaluation of the DNA Repair-Related Gene Signature
Six colorectal cancer transcription data sets containing prognostic data were selected to assess the prognostic ability of the DRGS. The GSE39582 data set (n = 566) was used as a training data set ( Figure 2D). The TCGA CRC dataset was enrolled as validation-1 cohort (n = 624), and additional data sets from the GEO were combined as validation-2 cohort (n = 578). Among the patients in the training and validation cohorts, more recurrences were found in the high-risk group than in the low-risk group (Figures 3A,D,G). When applied to a follow-up duration, the promising prognostic values of 2-, 3-, and 5-year AUC were 0.640, 0.664, and 0.653, respectively, in the training cohort. In the validation-1 cohort, the values of 2-, 3-, and 5-year AUC were 0.620,   Figure 1). Compared to the risk scores calculated using the FDAapproved assay Oncotype DX colon algorithm, we found that the DRGS achieved better survival correlation in the training cohort (C-index, 0.78 vs 0.60), validation-1 cohort (C-index, 0.65 vs 0.51), and validation-2 cohort (C-index, 0.66 vs 0.62) ( Table 2).
To further investigate whether the DRGS could serve as an independent predictor of prognosis, univariate and multivariate Cox proportional hazards regression analyses were performed. As expected, age, sex, tumor stage, tumor location, and pathologic gene status were associated with outcomes for CRC patients (Table 3). In the univariate analysis, DRGS, MMR status, and KRAS mutation status were significantly correlated with worse prognosis in the training cohort. After adjusting for clinical features such as age, gender, tumor location, and molecular types, the DRGS remained an independent prognostic factor in the multivariate analyses in both validation cohorts.

Functional Annotation of Genomics of Drug Sensitivity in Cancer
Gene Ontology (GO) analysis revealed that some biological process pathways (extracellular region, cell proliferation, and cell adhesion) were the main enriched pathways in the high-risk group ( Figure 4A). In addition, the GSEA in the high-risk group when compared with the low-risk groups shown that the metastasis-related pathways (i.e., angiogenesis, KRAS signaling, epithelial mesenchymal transit, and myogenesis pathways) were enriched in the high-risk group ( Figure 4B, Supplementary Table S1). Similarly, we obtained consistent results in the TCGA and validation-2 cohorts (Supplementary Figure 2). These findings suggest that the enrichment of pathways provided evidence of molecular mechanisms affected by the DRGS and thus can predict the prognosis of CRC. Validation of Genomics of Drug Sensitivity in Cancer Database and Immunotherapy Database As known, MSI/MMR-deficient (dMMR) is widely considered as a promising biomarker, suggesting greater efficacy for immune checkpoint inhibitor (ICB) (Zhao et al., 2019). In order to further investigate the clinical application of our  model, we used GDSC to analyze the differences of chemotherapeutic drugs between the high-risk and low-risk groups. We selected 48 kinds of cell lines related to CRC. After dividing the cell lines into the high-risk and low-risk groups according to the cutoff of our model, we selected the chemotherapeutic drugs commonly used in clinics to see the IC50 of the cell lines in the high-risk and low-risk groups. We can see that among the chemotherapeutic drugs such as oxaliplatin, 5-fluorouracil, and irinotecan, the IC50 of the cell line in the low-risk group was lower ( Figure 5). It showed that the cell lines in the low-risk group were more sensitive to these three drugs.
To examine whether the DRGS could predict the survival for ccRCC patients, the patients were divided into the high-risk and the low-risk groups according to the cutoff of our original model. The cutoff was still −0.147, and the prognosis data of these patients were analyzed. The OS of the high-risk group was worse than that of the low-risk group in ccRCC patients (HR = 1.45, 95% CI = 1.09-1.92, p = 0.0103) ( Figure 6A). When it comes to the ORR of immunotherapy, we can clearly see the ORR with complete response (CR) and partial response (PR) that had better OS for both the high-risk and low-risk groups (p < 0.001). Notably, the OS of the low-risk group with the CR + PR was the best ( Figure 6B).

Validation of Tumor Immune Dysfunction and Exclusion Database
We applied the TIDE algorithm which can predict the response to immunotherapy. The low-risk group had a lower TIDE score in GSE39582 and TCGA data sets, indicating that this subgroup was most likely to benefit from immunotherapy. Besides, the low-risk group had higher interferon gamma (IFNG), higher microsatellite instability (MSI) score, and lower cancerassociated fibroblasts (CAFs) amount, which confirmed the more activated immune landscape in this subgroup (Figure 7).

DISCUSSION
Colorectal cancer is the leading cause of death among gastrointestinal cancers. The incidence and mortality from colorectal cancer are increasing year by year, and its prognosis is closely related to early diagnosis (Siegel et al., 2016;Siegel et al., 2017). Numerous studies have highlighted the biomarkers that are associated with the pathogenesis and biology of CRC (Shah et al., 2014;De Rosa et al., 2016;Lech et al., 2016;Das et al., 2017), and many multigene prognostic signatures have been developed for CRC (Shah et al., 2014;Kandimalla et al., 2018;Ozawa et al., 2018;Gao et al., 2019;Kandimalla et al., 2019). Unfortunately, the accuracy of their prognosis predictions remains uncertain (Fung et al., 2014). We still need much more effort to achieve good prognostic CRC prediction, which is still considered a challenge.
In recent years, some studies have found some new results in DNA pathway repair and DRGs research. DRGs inactivation may  disrupt genomic integrity, which may increase the risk of accumulation of gene mutations associated with cancer development (Bouwman and Jonkers, 2012). MSI/dMMR is widely considered as a promising biomarker, suggesting greater efficacy for ICB despite some limitations (Zhao et al., 2019). In this study, our purpose was to identify and validate a FIGURE 4 | (A) Gene ontology of the differentially expressed genes between the two risk groups. "GeneRatio" is the percentage of total differential genes in the given GO term. (B) GSEA showed several metastasis-related processes enriched in the high-risk group, including angiogenesis, KRAS signaling, epithelial mesenchymal transit (EMT), and myogenesis signal pathways.
Frontiers in Genetics | www.frontiersin.org April 2022 | Volume 13 | Article 872238 8 reliable DRGS and improve the accuracy of survival prediction for CRC patients.
A total of 1,768 CRC patients from one training cohort and two validation cohorts were included in this study. Our prognostic DRGS can stratify CRC patients into two groups with different survival outcomes. A multivariate analysis suggested that the DRGS remained an independent prognostic factor and was significantly associated with poor prognosis in CRC. Furthermore, the C-index results of the DRGS showed its clinical superiority to Oncotype DX. Thus, it offers a significantly promising prognostic biomarker potential compared to the clinicopathological risk factors that are currently in use. The GSEA revealed that the metastasis-related pathways (i.e., angiogenesis, KRAS signaling, epithelial mesenchymal  transit, and myogenesis pathways) were enriched in the high-risk group, all of which were well known to play a crucial role in the progression and proliferation of CRC in numerous studies (Cooks et al., 2013;De Simone et al., 2015;Lu et al., 2016). Further studies are required to clarify the effects of DNA repair in order to identify more targets and improve the prognosis of CRC patients.
In order to further investigate the clinical application of our model, we divided the CRC cell lines in the GDSC database into the high-risk group and low-risk group according to the DRGS and analyzed the differences in chemotherapy response between the two groups. We can see that among the chemotherapeutic drugs such as oxaliplatin, 5-fluorouracil, and irinotecan, the IC50 of the cell line in the low-risk group was lower. It showed that the cell lines in the low-risk group were more sensitive to these three drugs. On the contrary, the cell lines in the high-risk group were more insensitive to these three chemotherapeutic drugs. This indicated that our model could predict the response of cell lines to chemotherapeutic agents. This may provide some guidance for clinical medication.
We knew that MSI/dMMR was widely considered as a potential biomarker for predicting ICB (Zhao et al., 2019). We wanted to know whether our model can predict immunotherapy, so we verified our model by using the data provided in the article that "Interplay of somatic alterations and immune infiltration modulates response to PD-1 blockade in advanced clear cell renal cell carcinoma (ccRCC)" published in Nature Medicine (Braun et al., 2020). From the OS curve of the high-and lowrisk groups, we could see that the OS of the high-risk group was worse in the ccRCC patients, and it suggested that our model can also well predict the OS of patients with ccRCC. When it comes to the ORR of immunotherapy, we can clearly see the ORR with CR + PR that had better OS in both the high-risk and low-risk groups. Notably, the OS of the low-risk group with CR + PR was the best. This indicated that our model can potentially predict the response of patients to immunotherapy. Our model can be used to further identify cancer patients who are more suitable for immunotherapy.
To further demonstrate that our model can predict the response to immunotherapy, we used the TIDE algorithm for validation in the training and validation data sets. From the results, we can see that the TIDE score of the high-risk group was higher than that of the low-risk group, indicating that the highrisk group was less sensitive to immunotherapy than the low-risk group. That is to say, the low-risk group was more effective for immunotherapy. IFNG, produced by T cells in the immune system and natural killer cells, is a potent viral inhibitor (Jorgovanovic et al., 2020). MSI, caused by defects in MMR genes, is an important molecular marker for prognosis and the development of adjuvant treatment regimens in colorectal and other solid tumors (Boland and Goel, 2010). CAFs are a group of activated fibroblasts with significant heterogeneity and plasticity in the tumor microenvironment, which have significant tumorpromoting functions (Chen and Song, 2019). The low-risk group had higher IFNG, higher MSI score, and lower cancer CAFs amount, which showed that the immune landscape of the lowrisk group was more active. The consistent results of the training and validation data sets not only proved the reliability and robustness of our model but also proved that our model can predict the response to immunotherapy, which may bring some clinical benefits to CRC patients.
As for how to apply the model to the clinic, we can detect these 11 genes for patients. Because it is a small panel of genes, it can avoid the waste of large medical resources and reduce the problem of high diagnostic cost for patients as much as possible. By detecting the 11 small panel genes, we calculated the risk score of patients and grouped them. With the help of the prediction model, not only patients can make more favorable choices for themselves but also doctors can make better clinical decisions according to the patient's risk score.
There are some limitations to our study. First, this is a retrospective study, although we validated the signature in independent data sets. In addition, the samples from primary tumor or metastatic disease may have inconsistent genetic heterogeneity, which could lead to sampling bias (NEJM Group, 2012;Mimori et al., 2018). In addition, systematic errors result from analyzing samples of disparate databases or the influence of measuring instruments, and not all batch effects can be eliminated based on their complexity. In verifying whether the model could predict immunotherapy response to CRC, we used immunotherapy data sets from ccRCC as there is currently a lack of data sets for public immunotherapy response to CRC. However, we also used the TIDE algorithm to further verify that our model can predict the immunotherapy benefit of patients. Therefore, we have sufficient evidence to prove that our model can predict the benefit of immunotherapy for patients. Although we investigated as many genes as possible, further clinical and pharmacological tests are needed to validate our results.

CONCLUSION
In summary, our work provides an accurate prognostic approach for estimating the survival outcomes of CRC patients. Further prospective studies are needed to evaluate the clinical application of this signature for the prognosis of CRC.

DATA AVAILABILITY STATEMENT
The data sets generated and analyzed during the current study are available in the TCGA cohort data downloaded from Broad GDAC Firehose (http://gdac.broadinstitute.org/). Other microarray data sets were acquired directly from the GEO database (https://www.ncbi.nlm.nih.gov/geo/query).

AUTHOR CONTRIBUTIONS
M-YL and WW contributed equally to this study. M-YL, WW, XH and FG contributed to the study concept and design, acquisition, analysis, and interpretation of data and in drafting of the manuscript. M-EZ, DC, DF, C-HL, W-BK, Z-PH, XD, CH and QZ contributed to the data collections and manuscript reviews. All authors read and approved the final manuscript.