The Identification of Stemness-Related Genes in the Risk of Head and Neck Squamous Cell Carcinoma

Objectives This study aimed to identify genes regulating cancer stemness of head and neck squamous cell carcinoma (HNSCC) and evaluate the ability of these genes to predict clinical outcomes. Materials and Methods The stemness index (mRNAsi) was obtained using a one-class logistic regression machine learning algorithm based on sequencing data of HNSCC patients. Stemness-related genes were identified by weighted gene co-expression network analysis and least absolute shrinkage and selection operator analysis (LASSO). The coefficient of LASSO was applied to construct a diagnostic risk score model. The Cancer Genome Atlas database, the Gene Expression Omnibus database, Oncomine database and the Human Protein Atlas database were used to validate the expression of key genes. Interaction network analysis was performed using String database and DisNor database. The Connectivity Map database was used to screen potential compounds. The expressions of stemness-related genes were validated using quantitative real‐time polymerase chain reaction (qRT‐PCR). Results TTK, KIF14, KIF18A and DLGAP5 were identified. Stemness-related genes were upregulated in HNSCC samples. The risk score model had a significant predictive ability. CDK inhibitor was the top hit of potential compounds. Conclusion Stemness-related gene expression profiles may be a potential biomarker for HNSCC.


INTRODUCTION
Head and neck squamous cell carcinoma (HNSCC) was one of the most common cancers worldwide, with~870,000 new cases and~440,000 deaths in 2020 (1). It involves cancers of the oral cavity (40%), pharynx (25%), and larynx (15%). The major risk factors for HNSCC include betel nut chewing, alcohol consumption, tobacco consumption, and HPV infection (2).
Cancer stemness (CS) postulates that the growth of a tumor is fueled by limited numbers of dedicated stem cells capable of self-renewal. It represents the degree of oncogenic dedifferentiation (3,4). Many studies have revealed the molecular mechanisms of CS. Redox and JAK/STAT3 signaling pathways were found to influence breast cancer stem cell state (5,6). The Osteopontin-CD44 signaling pathway was found to enhance cancer stem cell phenotypes in glioma (7). Prostaglandin E2 was found to promote colorectal cancer stem cell expansion (8). Epithelial-mesenchymal transition (EMT) induced mitochondrial fusion through regulation of the miR200c-PGC1a-MFN1 pathway, directing the asymmetric division of mammary stem cells (9). Previous studies have shown that Oct4, Sox2 and Nanog were associated with oral cancer stem-like status (10,11). Increasing evidence points to the significance of stemness-related biomarkers in solid tumors, but their role in the risk of HNSCC has not been evaluated. It is essential to identify stemness-related biomarkers and underlying mechanisms associated with CS of HNSCC.
A one-class logistic regression (OCLR) machine learning algorithm published recently was a useful method to quantify cancer stemness index. In this algorithm, a predictive model was constructed and trained in stem cell samples, which can be applied to tumor samples (4). Previous studies have shown the application of the OCLR algorithm in some cancers (12)(13)(14). Weighted gene co-expression network analysis (WGCNA) identifies gene modules related to clinical traits through clustering thousands of genes, then uses intramodular connectivity and gene significance based on the correlation of a gene expression profile with a sample trait to identify stemnessrelated genes for further research. Hence WGCNA identifies significant genes in the biological process more efficiently (15). The least absolute shrinkage and selection operator (LASSO) is a regression analysis method that performs both variable selection and regularization. It enhances the prediction accuracy and interpretability of the resulting statistical model (16,17).
In this study, a systematic analysis of the relationship between the stemness-related genes and the risk of HNSCC was provided. A flow chart outlining the whole study is shown in Figure 1A.

Study Subjects
Some 44 newly diagnosed head and neck squamous cell carcinoma (HNSCC) patients without any treatment before the operation were enrolled from the Stomatology Hospital of Nanjing Medical University (Nanjing head and neck cancer cohort, NJHNCC). Surgically resected tumors and adjacent normal oral epithelial tissues were obtained from these patients. Clinical information was obtained through medical records while demographic information including age, gender, smoking, alcohol and Betel nut chewing history was collected from each patient using a standard questionnaire. Frozen tumors and matched normal specimens were confirmed by two independent pathologists. Samples were frozen in liquid nitrogen. HNSCC tissues with the malignant cell purity of over 70% were selected for RNA extraction and sequencing. The study was approved by the institutional review board of Nanjing Medical University. Clinical and demographic information is provided in Table 1.

Total RNA Extraction and RNA Sequencing
Total RNA was extracted from 44 matched tumor-normal samples using the RNeasy Mini Kit (Qiagen, Hilden, Germany). The quality and quantity of extracted RNA were assessed using the NanoDrop 2000 (Thermo Fisher Scientific, Wilmington, DE, USA), Qubit 2.0 Fluorometer (Life Technologies, CA, USA) and 1% agarose gel electrophoresis. RNA integrity was assessed using the RNA Nano 6000 Assay Kit (Agilent Technologies, CA, USA).
A total of 3 mg of high-quality RNA per sample was used for ribosomal RNA removal by the Epicentre Ribo-zero rRNA Removal Kit (Epicentre, USA) and the sequencing library was prepared using the rRNA-depleted RNA by the NEBNext Ultra Directional RNA Library Prep Kit for Illumina (NEB, USA)

Deriving Stemness Index (mRNAsi) Using Machine Learning
Briefly, the original algorithm was obtained from the previously published study (4). We derived stemness signatures using the one-class logistic regression (OCLR) machine learning algorithm trained on gene expression profiles of stem cell samples from the PCBC database. We applied the stemness signatures to RNA-seq data of NJHNCC and obtained stemness index (mRNAsi). The higher mRNAsi, the higher degree of stemness. The workflow to generate stemness index was as follows: a) download gene expression profiles of stem cell samples from PCBC database (syn2701943); b) map Ensembl IDs to HUGO; c) identify stem cell sample; d) train a one-class logistic regression model; e) perform leave-one-out cross-validation of the one-class logistic regression model; f) export stemness signatures; g) load and preprocess NJHNCC-RNAseq data; h) calculate raw stemness index using Spearman's correlation; i) standardization of raw stemness index; and j) export stemness index in the interval (0,1). The process was done in R 3.6.1.
The flowchart of deriving stemness index (mRNAsi) was shown in Supplemental Figure 1. The R code was shown in Supplemental File R code. The stemness signatures of the predictive model were shown in Supplemental Table 1.
Wilcoxon test was used to determine the significance of mRNAsi differences between tumor and normal samples. The representative stemness markers of HNSCC were obtained from a published study (21). The correlations between mRNAsi and HNSCC stemness markers were calculated based on the Linear regression (21).

Differential Gene Expression Analysis
The R limma package was used to screen differentially expressed genes between normal and tumor samples. The Wilcoxon signed-rank test was used to identify the differentially expressed genes. The RNA-seq data was transformed by log 2 (x + 1) for normalization. The screening criteria was |log 2 fold change (FC)| >1 and the false discovery rate (FDR) <0.05.

Weighted Gene Co-Expression Network Analysis
WGCNA was conducted with differentially expressed genes. Differentially expressed gene profiles were screened to delete outliers. A correlation matrix was constructed with the absolute values in transcriptome data. An adjacency matrix was constructed by the soft threshold. A topological overlap matrix was then constructed. The average linkage hierarchical clustering was conducted on TOM-based dissimilarity. Clusters were detected by the dynamic hybrid cut method.
Module eigengene was regarded as the dominant component in the principal component analysis of gene expression for the module by eigengene network clustering. In general, the correlations between the modules and mRNAsi were evaluated by Pearson's correlation. The correlation between genes and mRNAsi was evaluated by gene significance (GS), which was defined as the Pearson's correlation between gene expression and mRNAsi. The correlation between genes and module was evaluated by module membership (22), which was defined as the Pearson's correlation between gene expression and module eigengenes. The candidate genes were identified with the threshold: correlation coefficients of GS >0.5, correlation coefficients of MM >0.8.

Least Absolute Shrinkage and Selection Operator Analysis
We constructed a diagnostic risk score model for HNSCC by candidate genes using the LASSO analysis. The diagnostic risk model was constructed with the following formula: risk score = expression level of Gene 1 × b1 + expression level of Gene 2 × b2 + …+ expression level of Gene n × bn. b is the coefficient calculated by LASSO analysis. The receiver operating characteristic (ROC) curves were used to assess the diagnostic value.
The characteristic of LASSO regression is that when the generalized linear model is established, it includes onedimensional continuous dependent variable, multi-dimensional continuous dependent variable, non-negative number dependent variable, binary discrete dependent variable and multiple discrete dependent variable. In addition, whether the dependent variable is continuous or discrete, LASSO regression can solve it. In general, LASSO regression has a wide range of applications. In addition, LASSO regression can also perform variables screen and reduce the complexity of the model, which refers to controlling the complexity of the model through a series of parameters to avoid overfitting. The complexity of LASSO regression is controlled by l. The larger the l, the greater the penalty for the linear model with variables. Thus, a model with fewer variables is finally obtained.

Analysis of Gene Expression
The expressions of stemness-related genes between normal and tumor samples in RNA-seq data of NJHNCC were evaluated. The validation of stemness-related genes in the TCGA and GEO database was performed. The validation of stemness-related genes in the Oncomine database (http://www.oncomine.org) was performed with threshold: P = 0.05, fold change = 2 and gene rank = 10%. The expression of the proteins encoded by the stemness-related genes was analyzed using clinical specimens from the Human Protein Atlas database (https://www. proteinatlas.org) (22).

Network Analysis of Co-Expressed Genes
Co-expressed genes were obtained from the WGCNA gene module with the threshold: GS >0.5 and MM >0.8. The relationships of co-expressed genes were evaluated in RNA-seq data of NJHNCC using Pearson's correlation. The proteinprotein interactions of co-expressed genes were evaluated in the STRING database (https://string-db.org/). The causal interactions of co-expressed genes were evaluated in the DisNor database (https://disnor.uniroma2.it/). The neighbor genes indicated possible upstream or downstream genes.

Functional Annotation of Co-Expressed Genes
Co-expressed genes obtained from the WGCNA gene module were used to perform biological function analysis. Gene Ontology (GO) (18) and Kyoto encyclopedia of genes and genomes (KEGG) enrichment analyses were conducted. Biological process (BP), cellular components (CC) and molecular function (MF) were included in GO enrichment analysis. The threshold was P <0.05.

Screening of Potential Compounds
The connectivity map (CMap) database (https://clue.io/cmap) was a comprehensive resource to explore the underlying associations among genes, chemicals, and biological conditions. We utilized the CMap to screen potential compounds that targeting HNSCC stemness genes. We evaluated co-expressed genes and obtained relative compounds.

Quantitative Real-Time Polymerase Chain Reaction (qRT-PCR) Assay
The primary HNSCC tissues and adjacent normal oral epithelial tissues were obtained from 35 HNSCC patients who received radical surgery in the Stomatology Hospital of Nanjing Medical University from January 2019 to June 2020. Frozen tumors and matched normal specimens were confirmed by two independent pathologists. Patients who received chemotherapy, radiotherapy, or any other medical intervention were excluded. Tissue specimens were stored in the −80°C freezer. The study was approved by the institutional review board of Nanjing Medical University. We extracted the total RNA using TRIzol reagent (Invitrogen, USA) and reagent Kit (TAKARA, Japan) according to the manufacturer's protocol. The cDNA was reversed by PrimeScript ™ RT reagent Kit (TAKARA, Japan). The qRT-PCR was performed on a 7300HT system (ABI, USA) using SYBR Premix Ex Taq II kit (TAKARA, Japan). GAPDH was the reference gene. The relative expression level was calculated by the 2 −DDCt method. The sense and antisense primers used are listed below:

Obtaining and Evaluation of Stemness Index (mRNAsi)
After constructing the predictive model in PCBC stem cell samples and applying it in NJHNCC, we obtained the stemness index (Supplemental Table 2). The difference of mRNAsi between 44 tumor and 44 matched normal samples from NJHNCC was evaluated ( Figure 1B). The mRNAsi of 44 matched normal samples were significantly lower than 44 tumor samples (Paired t-test, P = 0.0399). Representative stemness markers (CD44, SOX2) were selected. The good fitness between mRNAsi and stemness markers was observed ( Figures 1C, D).

Identification of Candidate Stemness-Related Genes by WGCNA
Some 2,925 differential expressed genes were obtained by differential gene expression analysis of RNA-seq data from NJHNCC (Wilcoxon test, FDR <0.05), including 1,294 upregulated genes and 1,631 downregulated genes (Figure 2A and Supplemental Table 3).
To identify biologically significant gene modules related to stemness, WGCNA was performed. Unsupervised hierarchical clustering showed that gene modules were obtained ( Figure 2B with soft threshold b = 10 and scale-free R 2 = 0.9 (Supplemental Figure 2). Focusing on cancer stemness, the correlations between modules and mRNAsi were evaluated ( Figure 2C). The black module turned out to be the most significant module positively correlated with mRNAsi (cor = 0.43, P = 0.01). The tan module exhibited a negative correlation with mRNAsi (cor = −0.69, P <0.01). Thus, the link between gene module and cancer stemness was established.

Identification of Stemness-Related Genes by LASSO and Construction of a Predictive Model for Diagnosis
Some four genes from candidate stemness-related genes were identified using LASSO in RNA-seq data from NJHNCC: TTK, KIF14, KIF18A, and DLGAP5. The coefficients of stemness genes and the minimize l method screening out four genes were shown ( Figures 2E, F). The diagnostic risk scores for HNSCC were calculated using the following formula: Risk score = −1.081 × TTK exp + 1.364 × KIF14 exp + 1.651 × KIF18A exp + 0.965 × DLGAP5 exp . The coefficients were obtained from LASSO.

Evaluation of Stemness-Related Genes' Expression and Risk Score Model
The expression of stemness-related genes was evaluated in gene expression profiles from NJHNCC ( Figure 3A), GEO database ( Figure 3B), TCGA database ( Figure 3C). Four stemness-related genes were significantly high-expressed in tumor samples (P <0.01). In the Oncomine database, the stemness-related genes were upregulated not only in HNSCC but also in other cancers ( Figure 3D). All of them were in the top 1% of gene rank and each of them had evidence from multiple cancers.

Detected Expression of Stemness-Related Genes in HNSCC Tissues
We analyzed the expression of the proteins encoded by the stemness-related genes using clinical specimens from the HPA database. TTK ( Figure 4A) was moderately positive and KIF14 ( Figure 4B) was strongly positive in HNSCC metastatic lymph nodes. KIF18A ( Figure 4C) was moderately positive and DLGAP5 ( Figure 4D) was strongly positive in HNSCC tissue relative to their expression levels in normal tissue.

Network Analysis of Co-Expressed Genes
We explored the correlation between eight co-expressed genes by Pearson's correlation in RNA-seq data of NJHNCC. The result showed that the correlations between the expression of coexpressed genes were significant (P <0.05). Each of the stemness genes was highly correlated with other genes ( Figure 5A). Also, we explored the relationship of co-expressed genes in the STRING database and it showed that they had significant correlations with other genes ( Figure 5B and Supplemental Figure 3).
To investigate the causal interactions of co-expressed genes, we inferred neighbors of them in the DisNor database. The result revealed that PLK1, CENPE and TP53 might be downstream genes. DNA damage might lead to the upregulation of TTK. BUB1 might upregulate mitotic activity ( Figure 5E).

Biological Function Analysis of Co-Expressed Genes
To explore the biological function of co-expressed genes, GO and KEGG analyses were performed. The results showed that co-expressed genes were enriched in biological processes including chromosome segregation, nuclear division and organelle fission in GO analysis and cell cycle pathway in KEGG analysis (Figures 5C, D).
In addition, genes in the module that negatively related to stemness were also analyzed and the result showed that they were related to extracellular matrix organization (Supplemental Figures 4A, B).

Screening of Potential Compounds for Stemness
We screened potential compounds targeting the stemness of HNSCC in the CMap database. The top compounds were exhibited, along with mechanisms of action ( Figure 6A). The top mechanism of action was the CDK inhibitor ( Figure 6B).

The Expression of Stemness-Related Genes in Tissues
The expressions of stemness-related genes in HNSCC and adjacent normal oral epithelial tissues were evaluated by qRT-  Figure 5). GAPDH was the reference gene. The expressions of TTK, KIF14, KIF18A and DLGAP5 were upregulated in HNSCC tissues compared to normal tissues (P <0.001), which were consistent with our findings.

DISCUSSION
A recent study has reported that cancer stem cell plays an important role in cancer growth, progression and therapy resistance (23). It indicates that the role of cancer stemness in HNSCC is worth investigating. In this study, we used the machine learning algorithm to learn the gene signature of stem cells and applied it in RNA-seq data of NJHNCC. Then, we identified stemness-related genes of HNSCC and constructed a diagnostic risk score model.
From normal tissue to precancerous lesions and cancer, one essential signature is the process of oncogenic dedifferentiation. The outcome that the stemness of tumor samples was higher than normal samples corroborated this. Some certain mutations contribute to the oncogenesis of dedifferentiated cells rather than mature cells, indicating that cell dedifferentiation status could be one of the oncogenic factors (23). In stemness index analysis, it was associated with stemness marker CD44 and SOX2. The results showed that the stemness index had the ability to distinguish the degree of cancer stemness efficiently.
Next, we obtained candidate stemness-related genes using WGCNA and stemness index. Then we constructed the diagnostic risk score model for HNSCC using LASSO. The predictive model was externally validated and it indicated a robust predictive ability.
With regard to the stemness-related genes, TTK was reported to regulate colon cancer progression via PI3K/AKT pathway (24). Kinesins are a family that plays many roles in intracellular transport or cell division. KIF14 and KIF18A were involved in the progression from prometaphase to metaphase stage of mitosis and cytokinesis (25). DLGAP5 was involved in the progress of centrosome-independent mitotic spindle assembly, which is essential for the survival and proliferation of SMARCA4/BRG1 mutant cells. This mechanism was reported in non-small cell lung cancer (26). Among the stemness-related genes, TTK was the cancer driver gene with a significant driver level according to the OncoVar database (27). Though the majority of somatic mutations are harmless, somatic alterations in specific cancer driver genes confer a selective growth advantage for the cells and can lead to cancer (28). It indicates the possibility of investigating cancer stemness in terms of somatic alterations and related cancer driver genes.
With regard to the co-expression network of stemness-related genes, PLK1 promotes EMT and metastasis in gastric carcinoma (29). PLK1 has been identified as a potential target to enhance the therapeutic sensitivity of paclitaxel-resistant prostate cancer (30). CENPE is the largest kinesin protein. The role of CENPE in microtubule kinetochore capture contributes to chromosome congression and alignment (25). Cell division cycle protein was reported to impair ionizing radiation-induced cell apoptosis and promote EMT in nasopharyngeal carcinoma. The expression of CDC6 was upregulated steadily after acute ionizing radiation exposure (31). It indicates that ionizing radiation could lead to oncogenic dedifferentiation, but the mechanism remains unclear.
The signaling pathway that stemness-related genes and coexpression network are most involved in is PI3K/AKT pathway. PI3K/AKT pathway has been reported to regulate EMT (32). Also, the oncogenesis of hepatocellular carcinoma could be activated via PI3K/AKT pathway (33). With regard to cancer stemness, research on PI3K/AKT pathway is still lack. Signaling pathways including Hedgehog, Notch, and Wnt pathways have been reported to maintain cancer stemness status (34). Among stemness-related genes and their co-expression network, TTK and PLK1 have been reported to involve in these pathways. In B-cell acute lymphoblastic leukemia, activation of Notch signaling leads to cell-cycle arrest and  suppresses PLK1 (35). PLK1 inhibition or depletion enhances the level of cytosolic and nuclear b-catenin in human prostate cancer cells (36). Cell lines harboring activating mutations in the CTNNB1 gene, encoding the Wnt pathway signaling regulator b-catenin, were on average up to five times more sensitive to TTK inhibitors than cell lines wild-type for CTNNB1 (37). EMT has also been involved in the generation of cancer stem cells (38). High expression of the EMT-TFs ZEB1in cancer cells activated the expression of stemness factors SOX2 (39). The ZEB1/miR-200/BMI1 pathway was involved in pancreatic cancer stem cells, which explained how EMT enabled stemness (40). Mesenchymal and stemness traits are known to characterize cancer stemness cell, which is responsible for tumor metastasis and resistance to conventional therapy (41,42). In stemness-related genes and co-expressed genes we identified, PLK1 and CDC6 have been reported to involve in the progress of EMT (29,31). The role of other stemness-related genes in EMT is worth investigating. The potential compounds targeting the HNSCC stemness were screened. Top potential drugs and mechanisms of action were found. The CDK inhibitor was the top hit, following by the PLK inhibitor. To date, High-dose cisplatin, given concurrently with radiotherapy as part of a definitive chemoradiotherapy regimen, is the standard of care, with established survival benefits for patients (43). The compounds targeting CDK and PLK inhibitors may bring novel insights into HNSCC chemotherapy.
In summary, the comprehensive transcriptome profiling of HNSCC using the OCLR algorithm was conducted and stemness-related genes were identified. We also constructed a risk score model for HNSCC. The co-expression network of stemness-related genes was analyzed. CDK and PLK inhibitors  play an important role in HNSCC chemotherapy. The application of machine learning in HNSCC transcriptome research brings new perspectives into tumor research. Considering gaps of cell based exploration remained in this study, more experiments need to be conducted to evaluate the effect of stemness-related genes, which will strengthen the elucidation of the diversity and heterogeneity in cancer research. We foresee conducting more researches to further investigate cancer stemness.

DATA AVAILABILITY STATEMENT
The datasets presented in this study can be found in online repositories.

ETHICS STATEMENT
The study was approved by the Institutional Review Board of Nanjing Medical University. The patients/participants provided their written informed consent to participate in this study.