Identification and Validation of an Immune-Related RNA Signature to Predict Survival of Patients With Head and Neck Squamous Cell Carcinoma

Head and neck squamous cell carcinoma (HNSCC) is a heterogeneous disease characterized by different molecular subgroups and clinical features. Therefore, it is important to uncover reliable molecular biomarkers for distinguishing different risk patient subgroup. Here, we conducted a multi-omics analysis to examine the joint predictive power of a multi-type RNA signature in the prognosis of HNSCC patients through integration analysis of mRNA, miRNA, and lncRNA expression profiles and clinical data in a large number of HNSCC patients. A multi-type RNA signature (15SigRS) was constructed which can classify patients into the high-risk group and low-risk group with the significantly different outcome [hazard ratio (HR) = 2.718, 95% confidence interval (CI), 2.258–3.272, p < 0.001] in the discovery data set, and subsequently validated in the Cancer Genome Atlas (TCGA) testing data set (HR = 1.299, 95% CI, 1.170–1.442, p < 0.001) and another independent GSE65858 data set (HR = 1.077, 95% CI, 1.016–1.143, p = 0.013). Further multivariate Cox regression analysis and stratification analysis demonstrated the independence of predictive performance of the 15SigRS relative to conventional clinicopathological factors. Furthermore, the 15SigRS has a prior performance in prognostic prediction than other single RNA type-based signatures. Functional analysis suggested that the 15SigRS are involved in immune- or metabolism-related KEGG pathways. In summary, our study demonstrated the potential application of mixed RNA types as molecular markers for predicting the outcome of cancer patients.


INTRODUCTION
Head and neck squamous cell carcinoma (HNSCC), the most frequent histological type of head and neck cancers, is the sixth most common cancers worldwide and account for nearly 5% of all malignancies worldwide (Marur and Forastiere, 2016).
Smoking tobacco, drinking alcohol, and human papillomaviruses (HPV) are important risk factors and have been implicated in the pathogenesis of HNSCC (Kobayashi et al., 2018). Surgery combined with radiation therapy, chemotherapy, and targeted therapy is the main treatment option. Although TNM stage has been considered as an important clinical prognostic factor for guiding treatment options, some patients with the same clinical features may have different prognosis because of molecular heterogeneity. Therefore, there is an urgent need to identify reliable biomarkers for predicting prognosis of HNSCC patients With advances in high-throughput omics technique, increasing efforts have been made to meet this urgent need. Some previous studies used gene expression data and identified some mRNAbased signatures. For example, Bai and colleagues identified a 12gene signature for predicting progression and prognosis (Bai et al., 2019) Another six-mRNA signature was identified by Tian et al. to predict the death risk of HNSCC patients using gene expression profiles in the Cancer Genome Atlas (TCGA) (Tian et al., 2019). Recently, non-coding RNAs (ncRNAs) have been found to be an important class of RNA molecules and are involved a wide range of biological processes (Fatica and Bozzoni, 2014;Bracken et al., 2016). The dysregulation of ncRNAs has been implicated in various human diseases including cancers (Esteller, 2011), demonstrating the role of ncRNAs as a potential biomarker in cancer diagnosis, prognosis, and treatment (Li et al., 2014;Gonzalez et al., 2015;Jiang et al., 2016;Zhou et al., 2017;Zhou et al., 2018a;Zhou et al., 2018b;Zhou et al., 2019). For HNSCC, recent studies have revealed the altered expression of ncRNAs in the development and progression of HNSCC (Salyakina and Tsinoremas, 2016;Sannigrahi et al., 2018), and several miRNAor lncRNA-related signatures were identified to improve clinical outcome (Irani, 2016;Wong et al., 2016;Cao et al., 2017;Liu et al., 2018;Diao et al., 2019). However, previous signatures often focus on one type of RNAs, and the joint predictive power of multiple types of RNAs was not investigated yet.
In this study, we tried to investigate the joint predictive power of multi-type RNAs as novel prognostic biomarkers by integrating mRNA expression profiles, miRNA expression profiles, lncRNA expression profiles, and clinical data in a large number of HNSCC patients.

Patient Data Set
RNA-Seq data (HTSeq), miRNA expression data (Illumina HiSeq), and corresponding clinical data were derived from the TCGA database (https://cancergenome.nih.gov/). Ensembl gene id of mRNAs, miRNAs, and lncRNAs were derived from HUGO Gene Nomenclature Committee (HGNC) database (https://www. genenames.org/). After cross-referenced by Ensembl gene id and tumor barcodes and removing patient samples without survival information and genes with zero expression values in more than 10% samples, a total of 19,163 mRNAs, 3,931 lncRNAs, and 1,854 miRNAs in 489 patients were obtained. All patients were randomly split into two equal patient cohorts: discovery data set (n = 245) and validation data set (n = 244). Another independent validation data set including 270 HNSCC patients was obtained from the Gene Expression Omnibus (GEO) database under the accession number GSE65858 (https://www.ncbi.nlm.nih.gov/ geo/query/acc.cgi?acc=GSE65858). Clinical features of HNSCC patients used in this study can be seen in Table 1.

Identification of Survival-Related a Multi-Type RNA Prognostic Signature
To identify survival-related genes, univariate Cox proportional hazards analyses were used to identify candidate prognostic mRNAs, miRNAs, and lncRNAs. Candidate prognostic mRNAs, miRNAs, and lncRNAs were retained only if they have significant p values (p < 0.05). Then these candidate prognostic mRNAs, miRNAs, and lncRNAs were fitted in a multivariable Cox regression analysis to identify independent survival-related genes. Finally, multi-type RNA prognostic signature was constructed as the linear combination of expression values of each independent survival-related mRNAs, miRNAs, and lncRNAs, weighted by their estimated regression coefficients in the multivariate Cox regression analysis according to previous studies (Zhou et al., 2015a;Zhou et al., 2015b).

Statistical Analysis
Kaplan-Meier survival curve analysis and a log-rank test were used to compare differences in overall survival (OS) time between the high-risk group and low-risk group. Univariate and multivariate Cox regression analyses were performed on the individual clinical variables with and without the multi-type RNA prognostic signature in each data set. Hazard ratios (HRs) and 95% confidence intervals (CIs) were calculated. The timedependent receiver operating characteristic (ROC) curve at 3 and 5 years was then calculated to compare the sensitivity and specificity of survival prediction. Hierarchical clustering of the expression values of independent prognostic gene biomarkers was performed using the metric of Euclidean distance and complete linkage. The chi-square test was used to test the significance of survival status between two groups. All statistical analyses were performed using the R/Bioconductor (version 3.0.2).

Functional Enrichment Analysis
GO and KEGG functional enrichment analysis was performed using Bioconductor package "clusterProfiler" (Yu et al., 2012).

Identification of Independent Survival-Related mRNAs, miRNAs, and lncRNAs
To identify survival-related mRNAs, miRNAs, and lncRNAs, we performed univariate Cox regression analysis to evaluate the association between expression of each type of RNA and OS in the discovery data set. A total of 23 mRNAs, 15 lncRNAs, and 1 miRNAs were found to be significantly associated with OS, and were considered as candidate prognostic mRNAs, miRNAs, and lncRNAs. Then all these candidate prognostic mRNAs, miRNAs, and lncRNAs were fitted into multivariate Cox regression analysis, 15 of 39 genes were identified as independent prognostic gene biomarkers. Hierarchical clustering of the expression values of 15 independent prognostic gene biomarkers revealed two distinctive sample clusters in the discovery data set ( Figure 1A). The survival status of two distinctive sample clusters is significantly different (dead 57.8% vs. 23.5%, p = 9.349e -08 , chi-square test). Survival analysis suggested that the OS time between the two sample clusters was significantly different ( Figure 1B, p < 0.001, logrank test). Similar results also were observed in the validation data set. Two distinctive sample clusters also were obtained using hierarchical clustering analysis ( Figure 1C). These two distinctive sample clusters have significantly different survival status (dead 55.5% vs. 34.1%, p = 0.002, chi-square test) and survival time ( Figure 1D, p < 0.001, log-rank test). These results revealed the potential of these 15 candidate independent prognostic genes as biomarkers in the prognosis of HNSCC patients.

Establishment and Evaluation of a Multi-Type RNA Prognostic Signature in Predicting Survival in the Discovery Data Set
To establish a multi-type RNA prognostic signature for survival prediction, these 15 candidate independent prognostic genes were fitted in a multivariate Cox regression analysis in the discovery data set. Then a multi-type RNA prognostic signature (15SigRS) were constructed according to the expression of 15 prognostic genes and multivariate Cox regression coefficient as the weight using risk scoring method as described previously, as follows: 15SigRS = (0.5344*CDH6)+(1.0462*CYP19A1)+(0.4723* TRPA1)+(0.2764*PPARG)+(0.0068*KRT84)+(−0.2291*FGD3)+ (0.3113*ADGRE1)+(−0.7948*SLC25A45)+(0.4878*OXCT2)+ (−4.0659*OTUD7A)+(1.2231*FAM198B-AS1)+(−0.3978* LINC00968)+(1.8352*LINC01123)+(0.1240*ZBED5-AS1)+ (−0.0602*MIR4664). We computed a 15SigRS for each HNSCC patient and classified patients into the high-risk group or low-risk group with the cutoff point of median risk score (−0.04) in the discovery data set. Using the 15SigRS, 245 patients in the discovery data set were divided into high-risk (n = 123) and low-risk groups (n = 122). We found that the survival time of the high-risk group is significantly shorter than the low-risk group (Figure 2A, p < 0.001, log-rank test). The time-dependent ROC curves analysis for the15SigRS achieved an area under the ROC curve (AUC) of 0.781 at 3 years and 0.768 at 5 years ( Figure 2B). The distribution of risk scores and survival status of patients and expression patterns of 15 prognostic genes in the 15SigRS were shown in Figure 2C.

Independent Confirmation of the 15SigRs for Survival Prediction in the Validation Data Set and TCGA Data Set
To evaluate the robustness of prognostic performance of the15SigRS, the 15SigRS was tested in the independent validation data set. With 15SigRS and cutoff derived from the discovery data set, all 244 patients in the validation data set also were classified into the high-risk group (n = 119) and low-risk group (n = 125). As shown in Figure 3A, patients in the low-risk group showed a better outcome than those in the high-risk group ( Figure 3A, p < 0.001, log-rank test). The time-dependent ROC curves analysis for the15SigRS achieved an AUC of 0.658 at 3 years and 0.663 at 5 years ( Figure 3B). In univariate analysis, the HRs of high-risk group versus low-risk group for OS were 1.299 (p < 0.001, CI, 1.170-1.442) ( Table 2). A similar analysis also was performed in the TCGA data set. The patients of TCGA data set were segregated into a high-risk group (n = 242) and low-risk group (n = 247) with significantly different OS ( Figure 3C, p < 0.001, log-rank test). The timedependent ROC curves analysis for the15SigRS achieved an AUC of 0.681 at 3 years and 0.649 at 5 years ( Figure 3D). In univariate analysis, the HRs of high-risk group versus low-risk group for OS were 1.496 (p < 0.001, CI, 1.393-1.606) ( Table 2).

Further Confirmation of the 15SigRs for Survival Prediction in GEO Data Set With Microarray Platform
Further validation of the 15SigRS for survival prediction was performed using another independent data set (GSE65858) of 270 patients with microarray platform (Illumina HumanHT-12 V4.0). Finally, expression value of 9 mRNAs of the 15SigRS can be obtained from GSE65858. With the same score model, the 15SigRS could distinguish between patients with high and low risks of death ( Figure 4A, p = 0.021, log-rank test). The OS rate of patients in the low group were 69.9% at 3 years and 59.2% at 5 years, respectively, which is significantly higher than that (60.6% at 3 years and 37% at 5 years) in the high-risk group. The AUC of time-dependent ROC curves analysis is 0.581 at 3 years and 0.595 at 5 years ( Figure 4B). In univariate analysis, the HRs of high-risk group versus low-risk group for OS were 1.077 (p = 0.013, CI, 1.016-1.143) ( Table 2).
We next performed a stratification analysis of smoking and alcohol. A total of 279 patients with smoking information were firstly divided into two patient data sets: smoking-light data set (n = 119) and smoking-heavy data set (n = 160). Using the 15SigRS, patients in the smoking-light data set could be subdivided into a high-risk group and low-risk group with the significantly different outcome ( Figure 5A, p = 0.005, log-rank test). Similar results were observed when the 15SigRS was tested in the smoking-heavy data set ( Figure 5B, p < 0.001, log-rank test). Then 478 patients with alcohol information were divided into two patient data sets: alcohol-no data set (n = 154) and alcohol-yes data set (n = 324). Using the 15SigRS, patients in the alcohol-no data set could be subdivided into the high-risk group (n = 83) and low-risk group (n = 71) with the significantly different outcome ( Figure 5C, p < 0.001, log-rank test). Similar results were observed when the15SigRS was tested in the alcoholyes data set ( Figure 5D, p < 0.001, log-rank test). Multivariate and stratification analysis shows that the predictive power of the 15SigRS was independent of other clinicopathological factors for survival prediction in a patient with HNSCC.

Performance Comparison of the 15SigRs With the Single RNA Type-Based Signatures
We then performed a comparative analysis for predictive performance of the 15SigRS with other single RNA type-based signatures. We performed ROC analysis and computed AUCs for 15SigRS and the other three types of RNA signatures in three data sets, respectively. As shown in Figure 6A, the15SigRS achieved a better prediction performance with an AUC value of 0.79 in the discovery data set, which is higher than other three types RNA signatures (mRNA-based signature AUC = 0.777, lncRNA-based signature AUC = 0.574, and miRNA signature AUC = 0.539). The 15SigRS also performed well in the validation data set and TCGA data set compared with other three types of RNA signatures (Figures 6B, C). Taken together, the 15SigRS generated by our approach has a prior performance in prognostic prediction than other single RNA type-based signatures.

Functional Characteristics of the 15SigRs
To further explore the potential function of the 15SigRS, we first calculated the Pearson correlation coefficient between expression levels of mRNAs and lncRNAs in the 15SigRS and identified ranking top 5% mRNAs as lncRNA-related mRNAs. Then we performed GO and KEGG functional enrichment analysis for these lncRNA-related mRNAs. Results of GO enrichment analysis suggested that these lncRNA-related mRNAs are enriched in immune-or cell differentiation-related GO terms ( Figure 7A). Results of KEGG enrichment analysis suggested that these lncRNA-related mRNAs are enriched in immune-or metabolism-related KEGG pathways ( Figure 7B).

DISCUSSION
The molecular landscape has highlighted that HNSCC is a heterogeneous disease characterized by different molecular subgroups and clinical features (Leemans et al., 2018). Despite improvements in diagnosis and treatment for HNSCC patients, different patient subgroups with different molecular features and same TNM stage might benefit from effective personalized treatment options. Therefore, it is critical to identify reliable molecular biomarkers for distinguishing different risk patient subgroup. Although increasing efforts have been made to meet this need, previously reported gene signatures involved in only one type RNA such as mRNAs, lncRNAs, and miRNAs. Cooperative roles among different RNA molecules have been unveiled in cancer development and progression (Zhou et al., 2016a;Zhou et al., 2016b;Pan et al., 2019;Zhu et al., 2019). Therefore, in this study, we performed a systematic analysis to examine the joint predictive power of a multi-type RNA signature in the prognosis of HNSCC patients through integration analysis of mRNA, miRNA, and lncRNA expression profiles and clinical data in a large number of HNSCC patients.
Because of the limitation in available HNSCC patient data with paired mRNA profiles, miRNAs, lncRNA profiles, and clinical data, TCGA HNSCC patient data were first split randomly into two independent patient data sets for the purpose of discovery and independent validation. Then we identified 15 RNA genes (including 10 mRNAs, 4 lncRNAs, and 1 miRNA) as independent biomarkers and constructed a 15-RNA signature (15SigRS) which can classify patients into the high-risk group and low-risk group with a significantly different outcome in the discovery data set. Furthermore, the 15SigRS was further validated in the independent patient data set which revealed the performance robustness in survival prediction. Further multivariate Cox regression analysis and stratification analysis demonstrated the independence of predictive performance of the 15SigRS relative to conventional clinicopathological factors, such as age, gender, stage, grade, race, smoking, and drinking, both in discovery data set and validation data set.
Among 15 RNAs in the signature, several RNAs have been reported to be associated with cancer development and prognosis. For example, ADGRE1 encodes F4/80 antigen which was expressed in immune cells and used as a monocytemacrophage marker in mice (Waddell et al., 2018). KRT84 has been reported to be up-regulated in squamous cell carcinoma and involved in metabolic pathways (Koringa et al., 2016).   Sancisi found that CDH6 was highly expressed in thyroid tumor patients and could be as a regulator of invasiveness in thyroid tumors (Sancisi et al., 2013). The pan-cancer analysis suggested that hsa-mir-4664 was over-expressed in eight cancers (Hu et al., 2018). Low LINC00968 expression has recently reported associated with poor prognosis in breast cancers by attenuating drug resistance (Xiu et al., 2019). To gain a global view for the biological function of the 15SigRS, we performed a GO and KEGG function enrichment analysis which indicated that the 15SigRS may be involved in immune-or metabolism-related biological function. These are several limitations in our study that need to be noted. First, only some of 15 RNAs in the 15SigRS have been experimentally studied, and other remaining RNAs should be investigated in further experiments which may provide new therapeutic target in HNSCC. Second, the 15SigRS was validated in only one independent patient data set because of data limitations, and more patient data sets were expected to validate the performance of the 15SigRS for accelerating the clinical application. Taken together, our study identified a novel multi-type RNA signature associated with the clinical outcome of HNSCC patients. This signature may be a novel independent molecular prognostic marker for selecting high-risk patients which may benefit from more individualized treatment.

AUTHOR CONTRIBUTIONS
SW conceived and designed the experiments. SW, XD, and DX performed the experiments and analyzed the data. SW wrote the paper. All authors read and approved the final manuscript.