Prognostic Value of Eight-Gene Signature in Head and Neck Squamous Carcinoma

Head and neck cancer (HNC) is the fifth most common cancer worldwide. In this study, we performed an integrative analysis of the discovery set and established an eight-gene signature for the prediction of prognosis in patients with head and neck squamous cell carcinoma (HNSCC). Univariate Cox analysis was used to identify prognosis-related genes (with P < 0.05) in the GSE41613, GSE65858, and TCGA-HNSC RNA-Seq datasets after data collection. We performed LASSO Cox regression analysis and identified eight genes (CBX3, GNA12, P4HA1, PLAU, PPL, RAB25, EPHX3, and HLF) with non-zero regression coefficients in TCGA-HNSC datasets. Survival analysis revealed that the overall survival (OS) of GSE41613 and GSE65858 datasets and the progression-free survival(DFS)of GSE27020 and GSE42743 datasets in the low-risk group exhibited better survival outcomes compared with the high-risk group. To verify that the eight-mRNA prognostic model was independent of other clinical features, KM survival analysis of the specific subtypes with different clinical characteristics was performed. Univariate and multivariate Cox regression analyses were used to identify three independent prognostic factors to construct a prognostic nomogram. Finally, the GSVA algorithm identified six pathways that were activated in the intersection of the TCGA-HNSC, GSE65858, and GSE41613 datasets, including early estrogen response, cholesterol homeostasis, oxidative phosphorylation, fatty acid metabolism, bile acid metabolism, and Kras signaling. However, the epithelial–mesenchymal transition pathway was inhibited at the intersection of the three datasets. In conclusion, the eight-gene prognostic signature proved to be a useful tool in the prognostic evaluation and facilitate personalized treatment of HNSCC patients.


INTRODUCTION
Head and neck cancer (HNC) includes cancers occurring in the tongue, oral cavity, nasopharynx, oropharynx, larynx, and hypopharynx (1). HNC is one of the most common cancers globally, and in the U.S, 38,380 new oral cavity and pharynx cancer cases were reported in 2020 (2). About 90% of HNCs are classified as head and neck squamous cell carcinoma (HNSCC) (3). The most common risk factors for HNSCC include genetic factors, tobacco, and alcohol consumption, and viral infection such as Epstein-Barr virus (EBV) and the human papillomavirus (HPV) (4,5). Besides, the complexity of its etiology leads to the heterogeneity of HNSCC. HNSCC patients usually present with locally advanced stage and a substantial proportion of them undergo primary surgery (6,7). Despite the availability of combined modality treatment such as surgery, chemotherapy, radiotherapy, and molecular targeted therapies in the treatment of HNSCC patients with metastatic and/or recurrent HNSCC, HNSCC is reported to have a poor prognosis (8). Cancer prevention is an important component of HNSCC than treatment. Nowadays, safe and efficacious HPV vaccines against the HPV types that cause 70% of cervical cancer and other HPV-related cancers or diseases have been introduced in 99 countries and territories (9). Lifestyle interventions are known to be important components of cancer prevention. However, in the post-genomic era, a better understanding of the genetic factors and other associated prevention strategies of HNSCC remains essential.
Recent advances in sequencing technologies especially the "second-generation" sequencing technologies have allowed indepth molecular characterization, stratification of cancer patients, and development of individualized accurate treatment strategies based on specific markers (10,11). Besides, the rapid development of computer technology has also facilitated the development of bioinformatic tools. Based on enormous amounts of biological data and advanced bioinformatic tools, several studies have explored and developed tumor models (12). Among these, deep learning-based survival prediction has been used to guide clinicians in predicting prognosis and devising individualized treatment strategies (13). To preoperatively identify occult peritoneal metastasis in gastric cancer, an individualized nomogram was developed and validated (14). Sanghani et al. developed and validated a nomogram that could predict tumor recurrence in breast cancer patients using two independent population-based datasets (15). Nassiri et al. developed a prediction model of early recurrence risk combining clinical and molecular factors in meningioma (16). Besides, several prognostic prediction models for different cancers have been constructed based on prognosis-related molecular features (17,18). A robust miRNA-based signature for predicting the prognostic outcome of HNSCC with high accuracy has been developed (19). However, additional gene signatures and more accurate models are needed for predicting the prognosis and guiding therapeutic decision-making.
In this study, we performed an integrative analysis to establish an eight-gene signature for the prediction of prognosis in patients with head and neck squamous cell carcinoma (HNSCC). LASSO Cox regression analysis identified eight genes (CBX3, GNA12, P4HA1, PLAU, PPL, RAB25, EPHX3, and HLF) with non-zero regression coefficients in TCGA-HNSC datasets. The nomogram was constructed, and GSVA revealed the associated functional signaling pathways.

Data Collection
We downloaded mRNA sequencing data and clinical information for HNSCC patients from The Cancer Genome Atlas (TCGA) database as the training dataset (https://portal.gdc.cancer.gov/). Gene expression profiles and clinical information of HNSCC patients in the GSE23036, GSE41613, GSE65858, GSE27020, and GSE42743 datasets were downloaded from Gene Expression Omnibus (GEO) database (https://www.ncbi.nlm.nih.gov/geo/) as the test datasets. Table 1 presents detailed information of the test datasets based on tumor type, sample, and platform.

Identification of Prognostic Gene Signature
The mRNA sequencing data and clinical information from TCGA, GSE41613, and GSE65858 were analyzed by univariate Cox analysis in the "survival" R package to identify genes associated with prognosis. Intersecting genes were identified as prognosis-related genes. To further screen for potential prognostic biomarkers, gene expression profiling interactive analysis (GEPIA, http://gepia. cancer-pku.cn) and gene expression profiling of the GSE23036 dataset were used for differential expression analysis. GEPIA is an online data tool that uses a standard processing pipeline to analyze RNA-seq data of 9,736 tumors and 8,587 normal samples from the TCGA and Genotype-Tissue Expression (GTEx) database (20).

Construction of a Prognostic mRNA Signature
In this study, we included 499 samples with complete clinical survival information from the TCGA-HNSC database. The 'glmnet' R package in R was used to perform LASSO Cox  Prognostic Significance of the Eight-mRNA Prognostic Signature in Independent Validation Datasets The prognostic value of the prognostic mRNA signature in predicting OS and DFS was confirmed in the GSE41613 and GSE65858, and GSE27020 and GSE42743 datasets, respectively. Time-dependent ROC curve analysis was also plotted to prove the survival prediction accuracy.

Construction and Validation of the Prediction Nomogram
A nomogram constructed using several independent indicators can be used for multiple predictions (21)(22)(23). Univariate and multivariate Cox regression analyses were used to screen for independent risk factors, which were used to construct a nomogram for the prediction of HNC prognosis. Forest plots were used to display the results of univariate and multivariate Cox regression analyses. Three independent prognostic factors, M stage, N stage, and the eight-mRNA gene signature, were used to construct the nomogram. The 'regplot' package in R was used to build the nomogram. To further assess the discrimination power and accuracy of the nomogram, calibration curves were used to predict the OS at 3 and 5 years. Besides, decision curve analysis (DCA) was performed to evaluate the clinical utility of the nomogram by calculating the clinical net benefit across the range of decision threshold probabilities in the TCGA-HNSC dataset.

Gene Set Variation Analysis
Here, samples from the TCGA-HNSC, GSE65858, and GSE41613 datasets were divided into high and low-risk groups based on the risk score. The "GSVA" package in R was used to perform GSVA between the high-risk and low-risk groups, using the hallmark gene sets as a reference. We set |log2FC| >2 and P <0.05 as the cut-off values to identify valuable signaling pathways. The intersection between integrated differential pathways from the three datasets was determined to identify the activated and suppressed pathways.

Survival Analysis
To determine if there were significant differences in survival between the high and low-risk groups, Kaplan-Meier survival curves were plotted. The 'survival' package in R was used to perform a two-sided log-rank test and univariate and multivariate Cox regression analyses.

Identification of Prognostic Candidate Genes
The flow chart of data preparation, processing, analysis, and validation is shown in Figure 1A. Cox regression analysis was performed, and prognosis-related genes were identified (with P < 0.05) from the GSE41613, GSE65858, and TCGA-HNSC RNA-Seq datasets. Genes associated with prognosis at the intersection of all the three datasets and differentially expressed genes of HNSC (with | log2FC| > 1 and FDR < 0.05) from GEPIA were screened as candidate prognostic genes. A total of eleven hub genes were identified; PITX1, EPHX3, PPL, RAB25, and HLF were found to be down-regulated, and PLOD3, GNA12, CBX3, P4HA1, SERPINH1, and PLAU were up-regulated ( Figure 1B).

Validation of the Eight-mRNA Prognostic Signature
To determine the predictive accuracy of the eight-mRNA prognostic signature, two datasets (GSE41613and GSE65858) were used for OS validation, and another two external datasets (GSE27020 and GSE42743) were used for DFS validation. Patients in the validation datasets were divided into the high-risk or low-risk groups using the median risk score as the cut-off. Results from survival analysis showed that both the overall survival (OS) of the GSE41613 and GSE65858 datasets and the progression-free survival (DFS)of the GSE27020 and GSE42743 datasets in the low-risk group, exhibited a better trend in survival compared with the high-risk group.

The Prognostic Value of the Eight-mRNA Signature
To evaluate the prognostic significance of the model in different clinicopathological features, all clinical variables, samples were randomly divided into two subgroups based on TNM stage, grade, age, gender, pathological T stage, pathological N stage, and pathological M stage. Patients in different subgroups were further divided into the high-risk and low-risk groups, using the median risk score of the prognostic model as the cut-off value. As shown in Figures 5 and 6, KM survival analysis of different subgroups indicated that in 12 subgroups, including the stage III/ IV, grade I/II, grade III/IV, age <65 y, age ≥65 y, male, female, T3/4 stage, N+ stage, and M0 stage, the prognostic model significantly correlated with the survival outcomes of HNSC patients. However, in the regional-stage subgroups, including stage and N0, the eight-mRNA prognostic model failed to show any positive prognostic value (Figures 5 and 6).

The Establishment and Clinical Application of the Nomogram
We performed univariate Cox regression analysis and multivariate Cox regression analysis to identify independent risk factors associated with the prognosis of HNSC. Patients from the TCGA-THCA dataset with complete clinical data including age, gender, stage, T stage, N stage, M stage, grade, and the risk score were included in the identification of prognostic factors. Results from the univariate and multivariate analysis showed that risk score of the eight-mRNA prognostic signature [univariate analysis: P < 0.  (Figures 7A, B). Therefore, the three independent prognostic factors were selected and used to construct a nomogram for predicting the probability of 3-and 5year OS in HNSC patients ( Figure 7C). Each prognostic parameter had a score, and the sum of the three prognostic parameter scores was used to predict the 3-year and 5-year OS. The higher the total score, the worse the prognosis. Besides, to determine the clinical utility of the nomogram, DCA was used to compare the net benefits of different models, including none, all, risk score, and nomogram. As shown in Figures 7D, E, compared with the other three groups, the nomogram revealed higher net benefits with wider threshold probabilities. The DCA results also indicated that the nomogram had better clinical benefit than the risk score calculated based on the eight-mRNA gene signature alone. Moreover, the 3-and 5-year OS calibration curves for the TCGA-HNSC dataset showed similar performance with the ideal model, indicating that the nomogram has good predictive discrimination power and accuracy ( Figures 7F, G).

GSVA Analysis
To explore the biological function of the prognostic signature between the high-risk and low-risk group, we performed GSVA via "GSVA" package in R and using hallmark gene sets as reference. According to the cut-off criteria (|log2FC| > 2 and P < 0.05), 11 pathways were found to be activated and 15 pathways inhibited in the TCGA-HNSC dataset, 12 pathways were activated and 13 pathways inhibited in the GSE65858 dataset, and 11 pathways were activated and four pathways were inhibited in the GSE41613 dataset (Figure 8Aa-c). Figure 8 shows that six pathways were activated in the intersection of the TCGA-HNSC, GSE65858, and GSE41613 datasets, including estrogen response early, cholesterol homeostasis, oxidative phosphorylation, fatty acid metabolism, bile acid metabolism, and Kras signaling. The epithelial-mesenchymal transition pathway was inhibited at the intersection of the three datasets (Figure 8Bb).

DISCUSSION
Genomic studies of HNSCC have greatly increased our understanding of genetic heterogeneity, disease diversity, and key genes driving tumorigenesis (24). With recent developments in sequencing techniques, such as high-throughput sequencing technologies (25), the emergence of enormous amounts of data has led to a wide variety of methods and tools for data analysis. Based on gene expression profiles and clinical data from public databases, a variety of prognostic models have been constructed. Genes that could serve as therapeutic targets and prognostic biomarkers for head and neck cancer were identified by Fan et al. (26). Li et al. performed a systematic analysis of the immunogenomic landscape and identified IRGs as potential biomarkers of HNSCC (27). Th identified molecules and constructed models can also provide an immeasurable reference for further basic and clinical research.
In this study, differentially expressed genes between the tumor and normal tissues were identified, indicating that multiple genetic abnormalities might be involved in HNSCC tumorigenesis. Univariate Cox analysis and LASSO Cox regression analysis identified eight (CBX3, GNA12, P4HA1, PLAU, PPL, RAB25, EPHX3, and HLF) prognosis-related genes. CBX3 promotes cell proliferation and predicts poor prognosis in glioma, and P4HA1 is a biomarker of unfavorable prognosis in malignant melanomas (28,29). Aberrant expression of RAB25 promotes tumorigenesis of skin squamous cell carcinoma (30), while HLF down-regulation promotes distant metastases in non-small cell lung cancer (31). We constructed the eight-gene prognostic model and calculated the risk score based on the model. Our results revealed that patients with a high-risk score had a worse prognosis than those with a low-risk score. To further assess the predictive accuracy of the model, two datasets were used for OS and DFS validation, respectively. Our results showed that based on the OS and DFS, the low-risk group exhibited a better trend in survival, compared with the high-risk group. Subgroup analysis also revealed that the prognostic model was significantly correlated with the survival of HNSC patients based on 12 subgroups. Our results were consistent with previous studies on CBX3, P4HA1, and RAB25. Limited studies on the function of GNA12, PLAU, PPL, EPHX3, and HLF are available. Our results might serve as a basis for future studies of these genes. Univariate and multivariate Cox regression analyses were performed to screen for independent prognostic factors, and construct a nomogram. Each prognostic parameter had a score, and the sum of the three prognostic parameter scores was used to predict the 3-and 5-year OS. The easy-to-use nomogram has its unique advantages in clinical practice compared with the AJCC staging system (32). A prognostic nomogram for patients with newly diagnosed lower-grade gliomas was constructed and validated in a large-scale Asian cohort and a free online tool for this nomogram is available for use in clinical practice (33). To identify occult peritoneal metastasis in patients with advanced gastric cancer, Han et al. developed and validated an individualized nomogram (14). There are several studies reporting nomogram construction and validation in various cancers, including gallbladder, prostate, and breast cancers (34)(35)(36). However, studies on head and neck cancer are not available. GSAV was performed to gain in-depth insights into the molecular functions of the eight-mRNA signature. A total of six pathways were found to be activated in the intersection of the TCGA-HNSC, GSE65858, and GSE41613 datasets, including early estrogen response, cholesterol homeostasis, oxidative phosphorylation, fatty acid metabolism, bile acid metabolism, and Kras signaling. The epithelial-mesenchymal transition pathway was found to be inhibited in the intersection of the three datasets. The estrogen receptor is closely related to breast cancer (37), and estrogen can significantly promote tumor growth by combining with the estrogen receptor (38). In another study, the cholesterol transporter links cholesterol homeostasis and tumor immunity (39). As one of the cancer signals, oxidative phosphorylation supports the development of a variety of cancers (40,41). Cancer cells have characteristic alterations in metabolisms, such as fatty acid metabolism and bile acid metabolism. Therefore, limiting the availability of fatty acids can control cancer cell proliferation (42). Kras signaling is essential for tumorigenesis and specific targeting of tumours using mutant KRAS has been used in clinical practice (43). Our results showed that the prognostic model is associated with the activation of the identified six pathways, which is consistent with previous findings. The phenomenon of epithelial-mesenchymal transition refers to when epithelial cells loosen cell-cell adhesion structures and promote tumor progression (44). Numerous studies have shown that epithelial-mesenchymal transition plays important roles in tumor initiation and malignant progression (45,46). Our eight-mRNA signature was reported to be associated with poor prognosis in HNSCC patients, which is also consistent with the inhibition of EMT. This study also has several limitations. First, in our current study, we presented bioinformatic evidence suggesting that the eight-mRNA signature can accurately predict the prognosis of HNSCC and constructed a nomogram. Data were generated from public databases, which lack experimental validation. Secondly, a prognostic model consisting of eight mRNAs was constructed and validated in our study. Several studies have investigated the functions of CBX3, P4HA1, and RAB25, but few studies on the function of GNA12, PLAU, PPL, EPHX3, and HLF are available. Therefore, more studies on the functions of these genes are needed in the future. Thirdly, to our knowledge, viruses have been implicated in head and neck cancers (47). In this study, we analyzed the correlations between the prognostic model and common clinical features, including TNM stage, grade, age, gender, pathological T stage, pathological N stage, and pathological M stage. However, EBV or HPV status was not included in the analysis and which is also very important when stratifying prognosis and in clinical decision making. Therefore, there is a need to explore the correlation between virus status and the 8-mRNA signature in HNSCC in the future.

CONCLUSION
A better understanding of the genetic factors and models for predicting prognosis and personalizing therapeutic decisionmaking in HNSCC is needed. In this study, LASSO Cox regression analysis is conducted to identify the eight-mRNA signature (CBX3, GNA12, P4HA1, PLAU, PPL, RAB25, EPHX3, and HLF) predicting the prognosis of HNSCC. Then, the constructed nomogram has been identified clinical utility. GSVA results show that six pathways are activated, while the epithelialmesenchymal transition pathway is inhibited in our gene signature.

DATA AVAILABILITY STATEMENT
Publicly available datasets were analyzed in this study. This data can be found here: https://portal.gdc.cancer.gov/; https://www. ncbi.nlm.nih.gov/geo/. contributed to analysis of the data and drafted the manuscript. XH contributed to the conception, design, and analysis of the manuscript. All authors contributed to the article and approved the submitted version.

FUNDING
This work was supported by the Nature Science Foundation of Bengbu Medical College (BYKY18177) and Postdoctoral Innovation Project of Shandong Province in 2020 (no.20203068).