Genome-Wide Identification of a Novel Eight-lncRNA Signature to Improve Prognostic Prediction in Head and Neck Squamous Cell Carcinoma

Objectives: LncRNAs are essential survival prognostic indicators with important biological functions in tumorigenesis and tumor progression. This study aimed to establish a long non-coding RNA (lncRNA) signature that can effectively predict the prognosis of patients with head and neck squamous cell carcinoma (HNSCC) and explore the potential functions of these lncRNAs. Materials and Methods: We re-annotated RNA sequencing and obtained exhaustive RNA-seq data of 269 patients with comprehensive clinical information from the GEO database. Then an 8-lncRNA signature capable of predicting the survival prognosis of HNSCC patients and a nomogram containing this signature were established. Weighted Co-expression Network Construction (WGCNA), Gene Set Enrichment Analysis (GSEA), and Gene Ontology (GO) enrichment were then applied to predict the possible biological functions of the signature and each individual lncRNA. Results: Eight lncRNAs associated with survival in HNSCC patients, including AC010624.1, AC130456.4, LINC00608, LINC01300, MIR99AHG, AC008655.1, AC055758.2, and AC118553.1, were obtained by univariate regression, cox LASSO regression, and multivariate regression. Functionally, patients with high signature scores had abnormal immune functions via GSEA. AC010624.1 and AC130456.4 may participate in epidermal cell differentiation and skin development, and MIR99AHG in the formation of cellular structures. Other lncRNAs in the signature may also participate in important biological processes. Conclusions: Therefore, we established an 8-lncRNA signature that can effectively guide clinical prediction of the prognosis of patients with HNSCC, and individuals with high signature scores may have abnormal immune function.


INTRODUCTION
Head and neck squamous carcinoma (HNSCC), the most common and malignant carcinoma affecting the head and neck region, is also the sixth common cancer worldwide (1,2). In the past few decades, multidisciplinary therapy based on various combinations of surgery, chemotherapy, and radiotherapy has been applied for the management of HNSCC. In addition to the approved immune checkpoint inhibitors such as the second-line treatment (2016), no further progress has been made, resulting in ∼50% of HNSCC patients still dying from the disease (3)(4)(5). Meanwhile, the prognostic model currently used for HNSCC patients is based on clinicopathological parameters (CPPs), but many cases with the same clinical stage show different outcomes (5,6). Prognostic signatures based on mRNAs could not be applied clinically (7,8). Therefore, for HNSCC patients, an efficient prognostic model that can predict the survival prognosis of patients is urgently required.
Long non-coding RNAs (lncRNAs) are non-coding RNAs with more than 200 nucleotides in length. LncRNAs are important biomarkers for the diagnosis and prognosis of tumors because of higher tissue specificity and increased ease of detection compared with mRNAs (9)(10)(11)(12). Accumulating evidence indicates lncRNAs play vital roles in the progression and tumorigenesis of tumors, including HNSCC (12)(13)(14)(15)(16). For example, LncRNA MIR31HG promotes cell proliferation and tumorigenesis by targeting HIF1A and P21. NEAT1 promotes laryngeal squamous cell cancer by regulating the miR-107/CDK6 pathway. Overexpression of lncRNA H19 promotes the occurrence of HNSCC (17)(18)(19). With the development of the high-throughput sequencing technology, more and more lncRNAs have been discovered, and lncRNA signatures associated with HNSCC prognosis have been established, but the functions of lncRNAs in most signatures remain unknown (20)(21)(22)(23). Therefore, establishing an integrated lncRNA model associated with prognosis in HNSCC and predicting the functions of respective lncRNAs is of high importance for both patients and clinicians.
To identify lncRNAs associated with prognosis in HNSCC and guide clinical application, we integrated RNA-seq and clinical survival information of 269 patients from the GEO dataset and established a nomogram containing an 8-lncRNA signature. Functional enrichment was performed to predict the potential functions of the hub lncRNAs.

Patient Information Collection and Study Design
All HNSCC patients were collected from GSE65858 public database (https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc= GSE65858). Here are two criteria used to select desired samples: (1) patients with mRNA expression data and clinical data were selected; (2) survival time of patients was more than 10 days. The platform of mRNA expression analysis of GSE65858 was Illumina HumanHT-12 V4.0 expression beadchip (GPL10558). All selected expression datasets were log2transformed, then standardized.

Construction of lncRNA Expression Through Re-annotation
The Illumina probe sequences were obtained from the annotation file GPL10558 and uniquely mapped to the human genome (hg38) by NCBI blast without mismatch. Specific probes of lncRNAs were obtained by matching the chromosomal position of probes to the chromosomal position of lncRNA genes based on annotations from GENCODE (Release 29).
For the case where different probes correspond to the same gene, the expression of the gene is taken as the median. LncRNAs were selected based on the following criteria according to the expression values and the calculated median and standard deviation (SD). First, lncRNAs with nonzero values in more than 75% of the cases were included. Second, the median and SD of the lncRNA was required to be larger than 1.

FIGURE 2 |
The seed lncRNAs were extracted by 1,000 times Cox LASSO regression. (A) Highly consistency was demonstrated in the lncRNAs among the 8 extracted lncRNA sets. The left ordinate indicates the seed lncRNA set and the number of seeds lncRNAs found by every single iteration of LASSO. The right ordinate is the frequency of the seed lncRNA set disclosed through the 1,000 times Cox LASSO regression. The horizontal ordinate is the lncRNA name. The yellow block represents the occurrence of the particular lncRNA in the specific lncRNA set; (B) Totally 28 seed lncRNAs with >990 occurrences in the most common lncRNA set were filtered out for further analysis. The blue column indicates the frequency of each lncRNA occurs in the most common lncRNA set.

Cox Survival Analysis and Least Absolute Shrinkage and Selection Operator (LASSO) Regression With 10-fold Cross-Validation
The prognostic value of each lncRNA was firstly calculated in the univariate Cox analysis using R/survival package, and the lncRNAs with P < 0.01 were selected as seed lncRNAs for Cox LASSO regression with 10-fold cross-validation (CV). To identify the prognostic value of the lncRNAs, multivariate Cox regression was further performed using R/survival package based on each "significant" lncRNA disclosed in the above steps. A lncRNA with P < 0.05 was defined as significant. The corresponding hazard ratio (HR), 95% confidence interval (CI), and P-value were collected.

Development of an Individualized Prediction Model
The OS and Hazard ratios (HRs) were calculated by Kaplan-Meier algorithm and univariate Cox regression analysis, respectively. The log-rank method tested the differences between the survival curves. We used the following formula to construct a prognostic risk score model: risk score = expGene1 × βGene1 + expGene2 × βGene2 + exp Genen × βGenen (exp, prognostic gene expression level; β, multivariate Cox regression model regression coefficients). The gene signature score as a predictor for HNSCC patients was analyzed in the model. We find out the significant variables through the univariate Cox regression analysis. Candidate variables with a P-value < 0.2 on univariate analysis were included in multivariable model. Finally, multivariable Cox regression model began with the clinical candidate predictors as follows: T stage, M stage, and Score. The nomogram model was built by the coefficients of the multivariable Cox regression model.

Clinical Use
The R-script stdca was (https://www.rdocumentation.org/ packages/DecisionCurve/versions/1.4) used to do decision curve analysis (DCA) which can assess the clinical net benefit of different probability thresholds.

Gene Set Enrichment Analysis (GSEA)
The expression profile data were ranked according to the signature score, and the data were divided into high-risk group and low-risk group by the mean score. Then, we downloaded the h.all.v6.2.symbols.gmt (24) from the GSEA website, and analyzed our data using GSEA version 3.0.

Weighted Gene Co-expression Network Analysis
The "WGCNA" package in R was applied to performed coexpression network using the expression values of mRNA and lncRNAs screened above. Briefly, we constructed the weighted adjacency matrix using a power function based on a soft-thresholding parameter β. After that, the adjacency was transformed into topological overlap matrix (TOM), and average linkage hierarchical clustering was performed according to the TOM-based dissimilarity measure. In this study, we chose a minimum size (gene group) of 30 for the genes dendrogram and a cut-line (0.25) for module dendrogram and merged some modules.

Hub LncRNA and Module Function Annotation
To predict the molecular of each candidate lncRNA, lncRNArelated mRNAs were filtered out by WGCNA. The network visualization was performed by Cystoscope software version 3.5.1 (https://cytoscape.org/). An appropriate cutoff of p-values <0.05 and FDR <0.05 was used. The statistics and data visualization were performed by ClusterProfiler Package in RStudio.

Validation of MIR99AHG
We compared the expression of MIR99AHG and the relationship with overall survival in HNSCC and normal tissues with the available data from the gene expression profiling interactive analysis (GEPIA). GEPIA is an online tool that provides expression analysis functions for TCGA and GTEx data.

Patient Characteristics
The analysis procedure of the current study is shown in Figure 1.
The basic characteristics of the patients are listed in Table S1.
A total of 269 patients with HNSCC were retrieved from the GSE658585 dataset for further analysis. The male to female ratio was 4.72:1, and average age was 60.14 years. There were T3-4 (72.8%) and M0 (97.4%) cases; median survival time was 28 months, and 72.9% of all individuals had no HPV infection.

Identification of Eight lncRNAs for Predicting HNSCC Patient Survival
Using array re-annotation analysis, 1,506 lncRNAs were identified for prognostic significance in univariate Cox survival analysis, and 95 with P < 0.01 were filtered out and applied to 1,000 times Cox Lasso regression with 10fold CV. A total of 8 lncRNA groups were disclosed, and high consistency among the lncRNA sets was demonstrated (Figure 2A). In the most common lncRNA set, 28 lncRNAs were uncovered to show >990 occurrences and extracted for further analysis (Figure 2B).   An optimal cutoff value was selected to separate patients into low-risk and high-risk groups using the pROC package in R. Figure 3A shows that patients in the low-risk group had longer OS (p < 0.0001) than those of the high-risk group. Multivariate cox regression analysis included Score, Uicc_stage, T_category, N_category, M_category, and HPV_status as independent predictors. The results showed that factors with P < 0.05 were included in the prediction model (Table 3). Then, the model was also presented as a nomogram (Figure 3B). Figure 3C shows that calibration of the new prediction model was fitted; the results predicted by the model were basically consistent with the ideal results of the model incorporating the gene signature (C-index 0.75, 95% CI 0.73-0.78), which was more predictive than models not including it (C-index 0.68, 95% CI 0.65-0.71; Table 3). Figure 3D shows the DCAs for the prognostic prediction model and the model without the gene signature. The results showed that using the prognostic prediction model with the gene signature to predict the OS of patients could be of more benefit in the current model than the treat-all patients-or treat-none scheme. Compared with this model without the gene signature, the prognostic prediction model with the gene signature could bring greater benefits to patients.

GSEA of the Signature Score
The GSEA data showed that samples with high signature scores were mainly enriched in the hallmark of IL6 JAK SATA3 signaling, HALLMARK COMPLEMENT, and HALLMARK ALLOGRAFT REJECTION (P < 0.05; FDR < 0.2; Figure 4; Table 4).

Weighted Co-expression Network Construction (WGCNA)
A co-expression network was constructed using GSE65858, including 269 HNSCC samples with complete clinical data. The expression amounts of 5,000 genes including 4,992 mRNAs and 8 lncRNAs were analyzed for co-expression network constructing the "WGCNA" package. In the current study, to ensure a scale-free network, we selected β = 5 as the softthresholding power (Figure 5A) and identified a total of 13 modules (Figure 5B).

Identification of Hub lncRNAs in Modules and Function Annotation
Identifying modules most significantly related to clinical features is difficult. We found that the blue module was correlated with N category, T category, and Stage; N category showed the highest correlation (P = 5.1e-62; r = 0.58, Figure 5C).
To assess the functional involvement of the hub lncRNAs, their co-expression modules were determined via the WGCNA algorithm. We found AC010624.1 and AC130456.4 in the black module, LINC00608 and LINC01300 in the blue module, MIR99AHG in the brown module, AC008655.1 in the tan module, and AC118553.1 in the magenta module. All mRNAs in the modules, as well as mRNAs associated with the hub lncRNAs   assessed by the ClusterProfiler Package in the R software, are shown in Figures 6-10.

Validation of Hub lncRNAs
All hub lncRNAs were selected for validation using the TCGA and GTEx datasets. In the TCGA database, MIR99AHG expression was associated with overall survival (Figure 11A), with a difference in expression between cancer and adjacent tissues ( Figure 11B).

DISCUSSION
Using the mRNAs and lncRNAs selected after re-annotating the GSE65858 dataset of the GEO   curve analysis confirmed the prognostic significance and prediction superiority of the signature and nomogram. Further, GSEA of the signature score indicated that samples with high scores were mainly enriched in IL6/JAK/SATA3 signaling, complement, and allograft rejection, indicating HNSCC patients with poor prognosis might have dysfunctional immune systems.
To identify the potential functions and involved biological processes of each lncRNA in the signature, WGCNA was performed. The black module containing AC010624.1 and AC130456.4, the blue comprising LINC00608 and LINC01300, the brown containing MIR99AHG, the tan encompassing AC008655.1, and the magenta containing AC118553.1 were obtained. GO enrichment analysis of all genes and lncRNA associated genes in modules was performed. Compared with previously established signatures (20)(21)(22)(23), the current signature was more effective in predicting the prognosis of patients with HNSCC; more importantly, it indicated that the poor prognosis of high-score patients may be associated with changes in immune function. Furthermore, the possible biological functions of the lncRNAs contained in the signature were assessed, which provides new insights into possible treatment directions for HNSCC patients.
Previous studies have shown that the IL6/JAK/SATA3 pathway plays an important role in HNSCC (25)(26)(27), corroborating the current GSEA data. Drugs targeting the IL6/JAK/SATA3 pathway with low side effects are still under investigation (25)(26)(27). It can be inferred from the above signature that patients with high scores may benefit more from targeted drugs against the IL6/JAK/SATA3 pathway, which has a high guiding significance in future clinical applications. In addition, immune cells in the microenvironment such as M1/M2 macrophage, CD8+ T lymphocyte, NK cells, etc. can affect the tumor growth, progression, and cachexia of HNSCC by secreting IL-6. M2 macrophage can promote the proliferation of HNSCC, while CD8+ T lymphocyte, M1 macrophage, and NK cells can inhibit the progression of HNSCC via IL-6 (28). Therefore, we predicted the immune-related cells in the microenvironment of HNSCC through CIBERSORT and explored the relationship between lncRNAs in signature and these immune cells. It was found that lncRNA associated with good prognosis of HNSCC, such as AC010624.1 and MIR99AHG, was negatively correlated with NK cells resting and M2 macrophage (Figures S1A-D). Similarly, LINC01300 related to poor prognosis of HNSCC was negatively correlated with CD8+ T lymphocyte and M1 macrophage (Figures S1E,F). It provides a basis for the functional study of these lncRNAs in the IL6/JAK/STAT3 pathway and immune microenvironment, but the mechanism among lncRNAs and IL-6 remains to be explored. Besides, it had been reported HNSCC patients with high tumor mutational burden (TMB) appears worse prognosis (29). Our results showed AC010624.1 was negatively related with TMB, while LINC01300 was positively related with TMB (Figures S1G,H). LncRNAs in the signature play an important role in maintaining immune function in HNSCC, and the mechanisms need to be explored in the future.
WGCNA results showed that the black module containing AC010624.1 and AC130456.4 was closely related to EGFR mutation in clinical phenotypes, while GO enrichment data indicated that these two lncRNAs may both participate in epidermal cell differentiation and skin development, which is consistent with the potential function of the black module. EGFR is elevated in ∼90% of HNSCC patients and considered a negative prognostic factor for patients with HNSCC (30)(31)(32). Combining these findings, we hypothesized that Similarly, the brown module containing MIR99AHG was closely related to EGFR mutation in clinical phenotype. GO enrichment results indicated that the brown module was involved in the formation of cellular structures, such as the apical part of the cell, actin cytoskeleton, and apical plasma membrane, and MIR99AHG might also participate in this biological function. Previous studies have reported that MIR99AHG, also known as MONK, is a good prognostic indicator of breast cancer, lung squamous cell carcinoma, and colorectal cancer (33)(34)(35)(36). Our results also supported these conclusions. We further investigated the potential biological process by which MIR99AHG acts and may affect EGFR gene mutation. There is a great need to further explore the role of MIR99AHG in the prognosis evaluation and treatment of HNSCC and other cancers. Finally, we used the TCGA database for validation. Patients with high MIR99AHG expression had better OS, and MIR99AHG was differentially expressed between HNSCC and adjacent tissues. In the future, the biological significance of MIR99AHG in HNSCC should be assessed.
The blue module containing LINC00608 and LINC01300 was associated with T and N stages, and GO enrichment results showed the blue module may be involved in protein processing and presentation. The potential functions of both LINC00608 and LINC01300 were consistent with the blue module. The potential function enriched for AC008655.1 was the same as which the tan module, both of which might be related to the response to xenobiotics. The magenta module and AC118553.1 may be involved in the regulation of the extracellular matrix. LINC01300, AC008655.1, and AC118553.1 have not been previously reported. For the first time, we predicted the biological processes that may involve these lncRNAs, which deserve further investigation for predicting the prognosis of patients with HNSCC.
However, the limitations of this study should be mentioned. First, we did not apply an additional data set to validate the novel signature, which needs to be validated in large cohorts. Secondly, alcohol and tobacco consumptions are known prognostic factors of HNSCC (37)(38)(39), but the present signature did not fully consider the impact of these factors. This should be investigated in the future. Thirdly, HNSCC originated from different tissues, but the dataset we applied does not contain this information. Our signature lacked information of origination. Finally, we did not perform molecular biology experiments and clinical specimen validation to further explore the biological functions of hub lncRNAs involved in HNSCC.
In conclusion, we established an 8-lncRNA signature and a nomogram for predicting the prognosis of patients with HNSCC, and speculated that patients with a high signature score may have dysfunctional immune regulation, which may be a new direction of treatment for patients with HNSCC. Functional enrichment was performed to predict the potential functions of the lncRNAs contained in the signature, and for the first time, various biological functions and processes potentially altered by multiple lncRNAs were revealed. The 8-lncRNA signature is of great significance in evaluating patient prognosis and exploring new therapeutic targets for patients with HNSCC.

DATA AVAILABILITY
Publicly available datasets were analyzed in this study. This data can be found here: https://www.ncbi.nlm.nih.gov/geo/query/acc. cgi?acc=GSE65858.

AUTHOR CONTRIBUTIONS
ZL conceived the study. BY conducted all statistical analyses. JS reviewed relevant literature and drafted the manuscript. LX, YC, XC, XQ, YL, and YT provided guidance to the study. All authors read and approved the final manuscript.