The Prognostic Value of the DNA Repair Gene Signature in Head and Neck Squamous Cell Carcinoma

Purpose To construct a prognostic signature composed of DNA repair genes to effectively predict the prognosis of patients with head and neck squamous cell carcinoma (HNSCC). Methods After downloading the transcriptome and clinical data of HNSCC from the Cancer Genome Atlas (TCGA), 499 patients with HNSCC were equally divided into training and testing sets. In the training set, 13 DNA repair genes were screened using univariate proportional hazard (Cox) regression analysis and least absolute shrinkage and selection operator (LASSO) Cox regression analysis to construct a risk model, which was validated in the testing set. Results In the training and testing sets, there were significant differences in the clinical outcomes of patients in the high- and low-risk groups showed by Kaplan-Meier survival curves (P < 0.001). Univariate and multivariate Cox regression analyses showed that the risk score had independent prognostic predictive ability (P < 0.001). At the same time, the immune cell infiltration, immune score, immune-related gene expression, and tumor mutation burden (TMB) of patients with HNSCC were also different between the high- and low-risk groups (P < 0.05). Finally, we screened several chemotherapeutics for HNSCC, which showed significant differences in drug sensitivity between the high- and low-risk groups (P < 0.05). Conclusion This study constructed a 13-DNA-repair-gene signature for the prognosis of HNSCC, which could accurately and independently predict the clinical outcome of the patient. We then revealed the immune landscape, TMB, and sensitivity to chemotherapy drugs in different risk groups, which might be used to guide clinical treatment decisions.


INTRODUCTION
Head and neck squamous cell carcinoma (HNSCC) is a type of tumor that originates from the squamous epithelium of the head and neck areas, including the mucous membranes of the lips, tongue, pharynx, larynx, and others (1). HNSCC is currently one of the most common malignant tumors worldwide, with morbidity and mortality accounting for 3.6 and 3.4% of all malignant tumors in 2020, respectively (2). HNSCC is highly malignant, and there are no specific prognostic-related biomarkers for clinical application. Therefore, prognostic-related biomarkers with clinical applicability are urgently required.
DNA damage and repair play important roles throughout the life of a cell (3). DNA damage affects the expression of a variety of genes, including proto-oncogenes and cancer suppressor genes. Changes in the activity of proto-oncogenes and cancer suppressor genes are crucial in tumorigenesis (4). Several DNA repair genes have been confirmed to play an important role in the development and prognosis of HNSCC (5)(6)(7). Hence, constructing a risk model composed of DNA repair genes may be useful for predicting the prognosis of patients with HNSCC.
In this study, we aimed to establish a prognostic prediction model for HNSCC based on DNA repair genes. We first equally divided all patients with HNSCC into training and testing sets. In the training set, we screened prognostic-related DNA repair genes using univariate proportional hazard (Cox) regression analysis and least absolute shrinkage and selection operator (LASSO) regression analysis to construct a risk model (8). All patients with HNSCC were classified into high-and low-risk groups according to the median value of the training set risk score. Subsequently, we verified the prognostic relevance and prognostic predictive ability of the risk model in the training and testing sets. We also analyzed the tumor-infiltrating immune cells, immune-related gene expression, tumor mutation burden, and drug sensitivity of patients with HNSCC in the high-and low-risk groups. The results showed that the risk model composed of DNA repair genes could effectively distinguish patients with different clinical outcomes and has independent predictive prognostic ability.

Construction of Risk Model
To construct the risk model, we first combined the transcriptome data and clinical information of patients with HNSCC to obtain 499 samples with complete clinical information and transcriptome information, and then randomly divided them into a training set and a testing set on average. Subsequently, LASSO regression analysis was performed to further screen out 13 more representative DNA repair genes for use in constructing the risk model, and the correlation coefficients (Coef) and expression (EXP) of these 13 genes were obtained using the "glmnet" package in R (19). Finally, the risk score of each patient was calculated by the following formula: Risk Score = S n i=1 Expi Â Coefi, where n refers to the number of selected DNA repair genes, Expi indicates the expression levels of gene i in each HNSCC sample, and Coefi is the correlation coefficient of gene i. Finally, we classified all HNSCC samples into high-and low-risk groups based on the median value of the risk score of the training set.

Validation of the Risk Model
We verified the risk model separately in the training and testing sets. To this end, we first performed principal component analysis (PCA) in the training and testing sets to evaluate the discrimination of the risk model for patients in the high-and low-risk groups. We then utilized heat maps to show the expression patterns of the DNA repair genes in the risk model in the training and testing sets. The Kaplan-Meier survival curve was used to distinguish the difference in the clinical outcome of patients in the high-and low-risk groups, and the significant difference P-value was calculated by the log-rank test. The area under the curve (AUC) of the receiver operating characteristic (ROC) curve was used to evaluate the prognostic diagnostic accuracy of the risk score and clinical characteristics. Univariate and multivariate Cox regression analyses of risk score and clinical characteristics were used to evaluate the independent correlation between the risk score and prognosis of patients with HNSCC. We also performed the above verification in all patients with HNSCC. Then we divided all samples into multiple clinical subgroups based on clinical characteristics, and the Kaplan-Meier survival curve was performed in each subgroup to demonstrate the good prognostic ability of the risk score.

Evaluation of the Tumor Immune Microenvironment and Immune-Related Gene Expression
Before analyzing the immune-infiltration situation using the CIBERSORT algorithm, which contains 22 types of immune cells, we first standardized the gene expression data through the "CIBERSORT" package in R (20). The Wilcoxon test was used to compare the different infiltrations of the 22 immune cells in the high-and low-risk groups. The Pearson test was used to analyze the correlation between risk genes and tumor-infiltrating immune cells through Statistical Product and Service Solutions 25.0 (SPSS 25.0) (21). The ESTIMATE (Estimation of STromal and Immune cells in MAlignant Tumors using Expression data) algorithm was used to evaluate the immune score, stromal cell content, and ESTIMATE score of each sample (22). We analyzed the expression of negative regulatory immune genes in the highand low-risk groups using the Wilcoxon test. Finally, as the research on the role of immune checkpoint genes in various tumors is increasing, we analyzed the correlation between these genes and risk scores using the Spearman test and analyzed their differences in expression in the high-and low-risk patients using the Wilcoxon test.

Assessment of Tumor Mutation Burden
We displayed the 30 genes with the highest mutation rate in all HNSCC samples and calculated the tumor mutation burden (TMB) of all samples through the "maftools" package in R (23). We then divided the HNSCC samples into high-and low-TMB groups according to the best cut-off value of the TMB of each sample. The Kaplan-Meier survival curve showed the clinical outcome of the two groups of patients with HNSCC. By combining the TMB groups and the risk groups, we further evaluated the impact of the risk score and tumor mutation burden on the clinical outcome of patients with HNSCC and displayed them with survival curves.

Online Website Verification
We verified the influence of the expression of the 13 DNA repair genes on the Oncolnc website (http://www.oncolnc.org).

Analysis of Drug Sensitivity
To evaluate the model in the clinical treatment of HNSCC, we calculated the half-inhibitory concentration (IC50) of chemotherapeutic drugs for HNSCC. The difference in the IC50 between the high-and low-risk groups was compared by Wilcoxon signed-rank test using the "pRRophetic" package in R (24).

Statistical Analysis
The significance level of the P-value was set to <0.05. All statistical analyses were performed using R 4.0.4 (https://www. r-project.org/).

Development and Validation of the Prognostic Model
The flowchart of this research is shown in Figure 1. After merging the transcriptome data and the clinical data of patients with HNSCC downloaded from TCGA, we obtained 499 samples with complete information. We then divided all HNSCC samples into a training set (n = 251) and a testing set (n = 248). The basic clinical information of the two groups of patients is shown in Table 1. Subsequently, we screened out 82 prognosticrelated genes among 545 DNA repair genes through univariate Cox regression analysis (Table S1, P < 0.05), of which 21 were risk genes (hazard ratio > 1). Subsequently, we screened out a further 13 representative DNA repair genes through LASSO Cox regression analysis, which were used to construct the risk model. The risk score was calculated based on the sum of the product of the expression (Exp) of all genes in the model and its correlation coefficient (Coef). The formula of the risk score was as follow: Risk Score = MORF4L2 * (0.0037) + COPS2 * (0.0063) + USP10 * (0.0255) + WAS * (-0.0123) + UVSSA * (-0.1324) + PRRX1 * (-0.0148) + ZBTB1 * (-0.0632) + DCLRE1C * (-0.0502) + MSH5 * (-0.3824) + DOT1L * (-0.1573) + ZBTB7A * (-0.00610) + POLR2C * (0.0085) + MORF4L1 * (0.0047). A negative correlation coefficient indicated that the gene was a protective factor in patients with HNSCC. In contrary, the gene with a positive correlation coefficient was a risk factor.
After calculating the risk scores of all patients with HNSCC, we divided the training set and testing set samples into high-and low-risk groups according to the median value of the training set risk score, as shown in Figures 2A, G. We found that in both the training and testing sets, the proportion of patients with HNSCC who died in the high-risk group was higher than that in the low-risk group (Figures 2B, C, H, I). The high-and low-risk groups were well distinguished ( Figures 2D, J). Moreover, the DNA repair genes in the risk model showed the same expression pattern in the training and testing sets ( Figures 2E, K). The Kaplan-Meier survival curve showed that the clinical outcomes of patients in the low-risk group were better than those in the high-risk group ( Figure 2L), both in the training set (P = 8.439e-09; Figure 2F) and the testing set (P = 1.161e-04; Figure 2L).
To verify the ability and independence of our model to predict the prognosis of patients with HNSCC, we conducted ROC curves and univariate and multivariate Cox regression analyses in the training and testing sets, respectively. The sensitivity and specificity of the risk score were assessed using the ROC curve. In the training set, the area under the curve (AUC) of the 1-, 3-, and 5-year ROC curves of the risk score were all >0.7 ( Figure 3A). The risk score had the largest AUCs of the 3-year ROC curve, compared to the clinical traits of patients with HNSCC ( Figure 3B). In the testing set, the AUCs of the 1-, 3-, and 5year ROC curves were all >0.65 ( Figure 3E), and the risk score also had the largest AUCs of the 3-year ROC curve ( Figure 3F). The hazard ratio (HR) value of the risk score was the largest in the univariate and multivariate Cox regression analyses of the risk score and multiple clinical features, which showed that the risk score was an independent prognostic factor ( Figures 3C, D). The independence of the risk score for predicting the prognosis of HNSCC was confirmed in the test set ( Figures 3G, H). Table  S2 shows the univariate and multivariate Cox regression analyses of the training and testing sets. Overall, the risk score was an independent prognostic factor for HNSCC. We were unable to find an external validation dataset with transcripts of all risk genes. However, we still verified the predictive ability of other genes except MSH5 in patients with HNSCC in GSE41613. We found that despite the lack of MSH, patients with HNSCC in the low-risk group showed better clinical outcomes than those in the high-risk group in our model (P < 0.05), and the expression pattern of the remaining genes was consistent with the training and testing sets ( Figure S1). And we still verified in GSE117973 without transcript of UVSSA, GSE27020 without transcripts of MSH5 and UVSSA, and GSE65858 without transcripts of UVSSA and ZBTB1. The differences of prognosis of patients with HNSCC in the high-and low-risk groups were not significant (P > 0.05, Figure S2). For these external validation, we did a sensitivity analysis by using only 12 risk genes to recalculate the risk score. And we found that deleting every risk gene had little effect on the Kaplan-Meier survival curves ( Figures S3, S4).
To further verify the accuracy of the model, we divided all samples into clinical subgroups based on different clinical traits, and we analyzed differences in the clinical outcomes of high-and low-risk samples in each clinical subgroup. Before clinical subgroup validation, we conducted a risk model validation for all samples. The Kaplan-Meier survival curve of all patients showed that the clinical outcomes of patients in the low-risk group were significantly better than those in the high-risk group (P = 1.884e-11; Figure 4A). The sensitivity and specificity of the risk scores of all HNSCC samples were assessed using the ROC curve. The AUCs of the ROC curves of risk score for 1-, 3-, and 5-year were all >0.65 ( Figure 4B). The risk score had the largest AUCs of the ROC curves for 3-year compared to the clinical traits of patients with HNSCC ( Figure 4C). PCA showed that patients in the high-and low-risk groups showed    good discrimination ( Figure 4D). The HR value of the risk score was the highest in the univariate and multivariate regression analyses of risk score and clinical characteristics ( Figures 4E, F). Details of the univariate and multivariate Cox regression analyses of the training and testing sets are shown in Table S3.

Evaluation of the Immune Microenvironment and Expression of Immunoregulatory Genes
To reveal the differences in the immune microenvironment of high-and low-risk groups, including immune cell infiltration and expression of immunoregulatory and immune checkpoint genes, we first used the bioinformatics algorithm CIBERSORT to estimate 22 types of tumor-infiltrating immune cells in HNSCC. First, we found that among these 22 cell types, acquired immunerelated immune cells infiltrated to a greater extent in HNSCC samples ( Figure 5A). There were more naïve B cells (P = 5.3e-05, Figure 5B), resting mast cells (P = 1.8e-06, Figure 5C), T cells CD8 (P = 0.0093, Figure 5D), regulatory T cells (Tregs, P = 6.7e-08, Figure 5E), and follicular helper T cells (P = 1.3e-05, Figure 5F) in the low-risk group. In contrary, activated mast cells (P = 0.00053, Figure 5G), M0 macrophages (P = 0.00026,   Figure 5H), and M2 macrophages (P = 0.0013, Figure 5I) showed greater infiltration in the high-risk group. The HNSCC samples in the low-risk group had higher immune scores (P = 1e-06, Figure 5J), stromal scores (P = 0.00033, Figure 5K), and ESTIMATE scores (P = 1.9e-06, Figure 5L) evaluated by ESTIMATE than the high-risk group. In other words, the tumor purity of HNSCC was lower in the low-risk group. Naïve B cells were positively correlated with eight risk genes that had negative correlation coefficients and negatively correlated with MORF4L2, which had a positive correlation coefficient (P < 0.05). CD8+ T cells were positively correlated with five risk genes that had negative correlation coefficients and negatively correlated with eight risk genes that had negative correlation coefficients (P < 0.05). Tregs and follicular helper T cells were positively correlated with all risk genes that had negative correlation coefficients and negatively correlated with all risk genes that had positive correlation coefficients (P < 0.05).
Monocytes and Macrophages M2 were negatively related to most risk genes (P < 0.05). Macrophages M0 were negatively correlated with some risk genes that had negative correlation coefficients and positively correlated with some risk genes that had positive correlation coefficients (P < 0.05). Details are shown in Table S4.

Assessment of Tumor Mutation Burden
To determine the tumor mutation burden (TMB), we first downloaded all the mutation data of HNSCC from TCGA and showed the top 30 mutation rate genes ( Figure 7A). Subsequently, we identified the genes with the top 20 mutation rates in the high-and low-risk groups (Figures 7B, C). The tumor mutation rate of high-risk group samples was slightly higher than that of patients in the low-risk group, and the gene with the highest mutation rate in the high-and low-risk groups samples was TP53. According to the best cut-off point of TMB, all patients with HNSCC were divided into high-and low-TMB groups. The Kaplan-Meier survival curve showed that the clinical outcomes of patients with low TMB were significantly better than those of patients with high TMB (P = 0.003, Figure 7D). To further evaluate the influence of TMB and risk score on the prognosis of patients with HNSCC, we combined the TMB group with the risk group and analyzed the clinical outcomes of different groups using the Kaplan-Meier survival curve. The results showed that patients with low risk and low TMB had the best clinical outcome, followed by patients with low risk and high tumor mutation load, and that patients with high risk and high tumor mutation load had the worst clinical outcome (P < 0.001, Figure 7E). Considering the high mutation rate of TP53, we analyzed the correlation between TP53 and the risk score and its expression in the high-and low-risk groups. As a result, we found that TP53 was negatively correlated with the risk score (R = −0.31, P = 3.9e-12, Figure 7F) and was highly expressed in the low-risk group (P = 2.5e-05, Figure 7G).

Analysis of Drug Sensitivity
To evaluate the possible clinical application of the risk model, we analyzed the sensitivity difference of chemotherapy drugs for HNSCC in the current stage of clinical trials between the highand low-risk groups, with the drug sensitivity expressed by IC50. We showed that patients in the high-risk group were more sensitive to erlotinib (P = 8.3e-16, Figure 9A), gefitinib (P = 0.00056, Figure 9B), paclitaxel (P = 2.9e-05, Figure 9C), docetaxel (P = 2e-10, Figure 9D), and sorafenib (P = 2.7e-05, Figure 9E), whereas patients in low-risk group were more sensitive to methotrexate (P = 6e-07, Figure 9F), vinorelbine (P = 8.3e-05, Figure 9G), and rapamycin (P = 0.00015, Figure 9H), which indicated that the model could be used as a potential predictor of chemotherapy sensitivity.

DISCUSSION
An increasing number of studies have shown that DNA damage and repair play important roles in malignant tumors, including HNSCC (25). DNA repair has been proven to be widely involved in the development, prognosis, and metastasis of HNSCC (26). Further studies on the expression profile of DNA repair genes in HNSCC specimens may provide new ideas to improve the clinical prognosis of patients. A total of 545 DNA repair genes were obtained from the "GO_DNA REPAIR" gene set of the GSEA database for subsequent analysis. Through univariate and LASSO Cox regression analyses in the training set, we constructed a risk model that included 13 DNA repair genes. Patients in high-risk group had worse clinical outcomes than low-risk patients. The AUC of the ROC at 1-, 3-, and 5-year confirmed the good prediction performance of the risk score. In addition, prediction accuracy and independence were verified using univariate and multivariate Cox regression analyses. We also performed clinical subgroup validation in the internal dataset and further validated the model in the online database Oncolnc, which reflected good accuracy and repeatability of the risk model.
We illustrated the immune landscape of patients with HNSCC using CIBERSORT and ESTIMATE, including tumorinfiltrating immune cells, immune score, immune regulatory genes, and immune checkpoint genes, all of which are considered important in HNSCC (27). Comprehensive analysis revealed that the risk score was more negatively related to tumorinfiltrating cells such as naïve B cells, resting mast cells, CD8+ T cells, Tregs, and follicular helper T cells, and positively related to activated mast cells and macrophages. According to Table S4, the correlation between risk score and tumor-infiltrating immune cells was contributed by the influence of all risk genes on tumor-infiltrating immune cells. Tumor-infiltrating immune cells both correlated with eight gene transcripts that have a negative correlation coefficient and five gene transcripts having a positive correlation coefficient. In addition, patients in the lowrisk group had higher immune scores, stromal scores, and ESTIMATE scores, which indicated that their tumor purity was lower.
In this study, some of the DRGs in the risk model have already been identified as having an important role in the immune system while others have not been well studied in the immune system at present. Decreasing the activity of DOT1L (DOT1 like histone lysine methyltransferase) through silencing or an inhibitor preferentially suppressed the production of interleukin 6 (IL-6) and interferon b (IFN-b) but not of tumor necrosis factor a (TNF-a) in macrophages triggered by Toll-like receptor (TLR) ligands or virus infection. DOT1L-mediated selective histone 3 lysine 79 (H3K79me2/3) modifications at the IL-6 and IFN-b1 promoters are required for the full activation of innate immune responses (28). DO1L plays an important role in regulating the differentiation and complete function of CD4+, CD8+T cells and B cells in the process of acquired immunity, while DO1L knockdown or mutation invalidates acquired immunity (29)(30)(31)(32). ZBTB1 (zinc finger and BTB domain containing 1) prevents DNA damage in replicating immune progenitors, allowing the generation of B cells, T cells, and myeloid cells (33). In alveolar macrophages, antigen presentation was ZBTB7A (zinc finger and BTB domain containing 7A)-dependent where alveolar macrophages deficient in ZBTB7A failed to induce antibody production and T cell responses (34). CD8+ T cell infiltration indicates better prognosis of patients with HNSCC (35). Because of the negative correlation between the risk score and tumor-infiltrating cells, we investigated the differential expression of negative immune regulatory genes, CD4+ T cell and CD8+ T cell regulatory genes in different groups. The results showed that almost all of these genes were highly expressed in the low-risk group, potentially due to increased infiltration of immune cells in the low-risk group samples. Subsequently, the correlation between the risk score and the expression of five immune checkpoint genes, CTLA4, LAG3, PD1, PD-L1, and TIM3, indicated that the expression of immune checkpoint genes was negatively correlated with the risk score and was highly expressed in the low-risk group, suggesting that immune checkpoint inhibitors may be beneficial to patients with HNSCC with low risk scores.
In recent years, there has been an increasing number of studies on the TMB of various tumors, including HNSCC, not only in the context of its use as a biomarker, but also in the treatment of immune checkpoint inhibitors (36). In our study, TMB was positively correlated with risk score and poorer clinical outcomes. Because TP53 showed the highest mutation rate, we compared its expression in different groups and found that it was negatively correlated with the risk score and highly expressed in the low-risk group. Our model suggested that patients with HNSCC with high risk scores were more sensitive to biological inhibitors such as erlotinib, gefitinib, and sorafenib, instead of chemotherapeutics like methotrexate. These analyses of drug sensitivity were based on "pRRophetic" package in R (20). Although the authenticity of the difference in drug sensitivity of these drugs among patients with HNSCC in different risk groups needs to be verified by further clinical trials, this model based on DNA repair genes provides the possibility for guiding clinical drug use. We speculated that the effect of immunotherapy on HNSCC would be better than that of traditional chemotherapy.
In this study, some of the DRGs in the process of modeling that have already been identified play an important role in the malignant phenotypes of various cancer types. DOT1L is involved in tumorigenesis and tumor metabolism or metastasis of ovarian cancer (37,38), prostate cancer (39,40), leukemia (41,42), neuroblastoma (43), colorectal cancer (44), and breast cancer (45). PRRX1 (paired related homeobox 1), a homeodomain transcriptional factor, has been demonstrated to be important in pancreatic cancer, especially in the regulation of epithelial-to-mesenchymal transition (EMT) in pancreatic cancer (46)(47)(48)(49). Moreover, UPS10 (ubiquitin-specific peptidase 10), a deubiquitinase, promotes proliferation of hepatocellular carcinoma by deubiquitinating and stabilizing YAP/TAZ, and suppresses lung tumorigenesis by deubiquitinating and stabilizing KLF4 (50,51). ZBTB7A (zinc finger and BTB domain containing 7A) acts as a tumor suppressor through transcriptional repression in several carcinomas (52)(53)(54). Moreover, its mutation or downregulation promotes cancer progression (55,56). Furthermore, its homologous gene, ZBTB1, participates in regulating the treatment effectiveness and resistance to chemotherapy (57,58). At present, other DRGs in the model have not been studied in depth in tumors.
In general, the prognosis model constructed based on the DNA repair gene transcripts and clinical information of patients with HNSCC in TCGA can well predict the prognosis of patients with HNSCC in the high-and low-risk groups. And this model systematically elaborated the molecular characteristics and immune microenvironment of HNSCC. The internal verification established based on the TCGA database also proved the stability of the model and provided reference value for prediction of the clinical outcomes of patients with HNSCC. In addition, the significant differences of multiple immune checkpoint genes between the high-and low-risk groups point out possible directions for the immunotherapy of patients with HNSCC.
However, we recognized that there were limitations to this study. On the one hand, the HNSCC samples involved in this study were not sufficient, and the DNA repair gene transcripts and clinical information of multiple GEO databases were incomplete, which hindered our external verification. On the other hand, the immaturity of the biobank of our institution was not enough to verify. Nevertheless, we still successfully completed external verification with the remaining genes in GSE41613 without MSH5 transcript, which further confirmed the availability and stability of the prognostic model. However, there were no significant differences in the Kaplan-Meier survival curves validated in the GSE117973 (without UVSSA), GSE27020 (without UVSSA and MSH5), and GSE65858 (without ZBTB1 and UVSSA). We assumed that the lack of a relatively important gene would reduce the predictive ability of the model, which might be the reason for the failure of the verification in GSE27020, GSE117973, and GSE65858.

CONCLUSION
In conclusion, this study constructed a 13-DRG signature for the prognosis of HNSCC, which could accurately and independently predict the clinical outcome of the patient. We then revealed the immune landscape, TMB, and sensitivity to chemotherapy drugs in different risk groups, which might be used to guide clinical treatment decisions.

AUTHOR CONTRIBUTIONS
RM and EW designed the study, performed the experiments and plotted the data. RM and JW validated the data. RM and EW drafted the manuscript. HX, SZ, and JS reviewed and edited the manuscript. HX and SZ supervised the project. HX and SZ funded the experiments for the study. All authors contributed to the article and approved the submitted version.

FUNDING
This study was supported by grants from the National Natural Science Foundation of China (Grant numbers 81771002, 82071057).