Identification of Candidate Biomarkers and Analysis of Prognostic Values in Oral Squamous Cell Carcinoma

Objectives: Oral squamous cell carcinoma (OSCC) is the most common oral cancer with a poor prognosis owing to limited understanding of the disease mechanisms. The aim of this study was to explore and identify the potential biomarkers in OSCC by integrated bioinformatics analysis. Materials and Methods: Expression profiles of long non-coding RNAs (lncRNAs), microRNAs (miRNAs), and messenger RNAs (mRNAs) were downloaded from The Cancer Genome Atlas (TCGA) and differentially expressed RNAs (DERNAs) were subsequently identified in OSCC by bioinformatics analysis. Gene ontology (GO) and Kyoto Encyclopedia of Genes and Genomes (KEGG) pathway analysis were used to analyze DERNAs. Then, the competing endogenous RNA (ceRNA) network was constructed in Cytoscape and the protein -protein interaction (PPI) network was established in the STRING database. We established a risk model to predict the overall survival of OSCC on the basis of DElncRNAs with Kaplan–Meier analysis and combined with logrank p test. Furthermore, we identified potential biomarkers by combining univariate Cox regression with overall survival rate, which were then validated in Gene Expression Omnibus (GEO), OSCC cell lines and OSCC specimens. Results: A total of 1,919 DEmRNAs, 286 DElncRNAs and 111 DEmiRNAs were found to be dysregulated in OSCC. A ceRNA network included 46 DElncRNAs,7 DEmiRNAs and 10 DEmRNAs, and the PPI network included 712 DEmRNAs including 31 hub genes. Moreover, a 7 lncRNAs risk model was established and four genes (CMA1, GNA14, HCG22, HOTTIP) were identified as biomarkers on overall survival in patients with OSCC. Conclusions: This study successfully constructed a ceRNA network and a PPI network which play a crucial role in OSCC. A risk model was established to predict the prognosis, and four DERNAs are revealed with overall survival in patients with OSCC, suggesting that they may be potential biomarkers in tumor diagnosis and treatment.


INTRODUCTION
Oral squamous cell carcinoma (OSCC) is one of the most common oral cancers worldwide (1) and has the most severe impact on the quality of life of patients. The most significant risk factors in OSCC are cigarettes, alcohol and betel nut consumption, which seem to have a synergistic effect. One recent study showed that it may be associated with the infection of human papillomavirus (HPV) (2). Although medical equipment and treatment methods are more advanced in recently years, the 5-year overall survival rate of OSCC remains only 40-50% (3) on account of relatively low responsiveness to treatment, severe drug resistance (4), diagnosis at terminal stage, etc., whereas the 5-year overall survival rate can rise markedly to more than 85% in patients diagnosed with early-staged tumors (5). Therefore, early diagnosis is very important for improving the prognosis of patients with OSCC.
As is known, more than 70% of the human genome can be transcribed to functional products (6). Among them, long non-coding RNAs (lncRNAs), RNA transcripts longer than 200 nucleotides and those with limited protein-coding potential are in the majority. LncRNAs affect various aspects of cellular homeostasis in OSCC, including proliferation, survival, migration, and genomic stability (7). Moreover, cancer-specific lncRNA expression patterns appear more tissue-and stage-specific than those of protein-coding genes, indicating the potential development of lncRNAs as powerful alternative biomarkers and therapeutic targets (8). Meanwhile, recent research demonstrated that lncRNA combined with mRNA biomarkers may improve diagnosis (9). However, there are fewer effective lncRNAs and mRNAs biomarkers used for diagnosis in OSCC.
In 2011, Salmena et al. put forward a competing endogenous RNA (ceRNA) hypothesis (10). MiRNA as a regulatory molecule regulates mRNA expression by targeting the 3'-UTR (11), which typically inhibits the translation and the stability of mRNAs (12). However, lncRNAs can compete with the miRNA binding to the mRNA to regulate gene expression. Recently, increasing evidence indicated that this hypothesis was closely related to the development and initiation of OSCC. For example, lncRNA TUG1 through sponging miR-524-5p to mediate DLX1 expression promotes proliferative and migratory potentials in OSCC (13). In addition, proteinprotein interaction (PPI) also plays a crucial role in various cancers. PPI is composed of proteins interacting with each other to participate in various biological processes such as Abbreviations: OSCC, oral squamous cell carcinoma; TCGA, The Cancer Genome Atlas; DERNAs, differentially expressed RNAs; DEmRNAs, differentially expressed mRNAs; DElncRNAs, differentially expressed lncRNAs; DEmiRNAs, differentially expressed miRNAs; GEO, Gene Expression Omnibus; ceRNA, competing endogenous RNA; PPI, protein-protein interaction; DMEM, Dulbecco's modified Eagle's medium; FBS, fetal bovine serum; MREs, miRNA response elements; ROC, receiver operating characteristic; GO, Gene ontology; KEGG, Kyoto Encyclopedia of Genes and Genomes; ncRNAs, non-cording RNAs. 3'-UTR, 3'-Untranslated region. biological signal transmission, regulation of gene expression, energy metabolism, and cell cycle regulation. For example, c-Myc, MMP-2, and GSK3β regulated the expression of MMP9 to accelerate OSCC progression and invasion (14). However, fewer studies have been reported on the ceRNA and PPI networks in OSCC.
In this study, we aim to analyze the differentially expressed profile of non-coding RNAs(ncRNAs) in the OSCC and establish a Cox regression model to predict the overall survival of patients with OSCC. Further, we analyzed and predicted the functions of the ceRNA and PPI networks. This study will contribute to understanding the molecular mechanism and provide new therapeutic targets for OSCC.

Data Preparation
All primitive data of OSCC (oral cavity, floor of mouth, palate, buccal mucosa, the anterior 2/3 of the tongue, gingiva, and so on) from The Cancer Genome Atlas (TCGA) database (https://portal.gdc.cancer.gov/) were downloaded through GDC Data Transfer Tool, including the RNA-Seq and miRNA-Seq of Transcriptome Profiling and Clinical data. Then, we excluded three samples because of their lowquality clinical data. Finally, 316 OSCC samples and 32 normal control samples were collected in our study. Gene expression profiles of OSCC were obtained from the Gene Expression Omnibus (GEO) database (https://www.ncbi.nlm. nih.gov/geo/), including GSE9844 (26 OSCC samples and 12 normal controls), GSE30784 (167 OSCC samples and 45 normal controls), and GSE74530 (6 OSCC samples and 6 normal controls).

Collection of OSCC Specimens
A total of 49 pairs of OSCC specimens and normal adjacent tissues (about more than 1.5 cm from the edge of the tumor) were collected at Nanfang Hospital, Southern Medical University (Guangzhou, China), and written informed consent was obtained from all patients. The tissue specimens were stored in RNA Wait and then frozen at −80 • C. All tumor tissues and normal adjacent tissues were pathologically confirmed as squamous cell carcinoma and normal tissues, respectively.

Differentially Expressed Gene Analysis
EdgeR (Version 3.8) package in R software was used to analyze and identify differentially expressed RNAs (DERNAs) and differentially expressed miRNAs (DEmiRNAs) with the thresholds of |log2 (fold change [FC])|>2.0 and FDR (adjusted P-value) <0.01 (15). Then, we used the annotation file in GTF format (Homo_sapiens.GRCh38.95.chr.gtf) to identify and annotate differentially expressed long non-coding RNA (DElncRNAs) with the thresholds of |log2FC|>2.0 and FDR<0.01. The heatmap and volcano were constructed by the gplots package in R software.

Functional Enrichment Analysis
We employed DAVID (https://david.ncifcrf.gov/) to obtain information for Gene Ontology (GO) including biological processes, the cellular component and molecular function. The Kyoto Encyclopedia of Genes and Genomes (KEGG) pathway analysis was used to annotate the potential functions. A significance level of P < 0.05 was set as the cutoff criteria and the plots were constructed by the gplots package in R software.

Protein-Protein Interaction Analysis
The DEmRNAs were enrolled in a protein-protein interaction (PPI) network through the STRING database (https:// string-db.org/) with a confidence score >0.9, and the PPI network was visualized in Cytoscape (Version 3.7.1) software. Moreover, genes with degree>=25 were selected as hub genes. Subsequently, module analysis (16) of the PPI network was performed using the Molecular Complex Detection (MCODE) tool of Cytoscape software, and GO and KEGG analysis of the modules was carried out using the DAVID database.

Cox Risk Regression Establishment and Validation
The lncRNAs raw data were transformed and normalized in a log 2 (x+1) manner (21). OSCC samples were randomly divided into a training set and a validation set. Univariate Cox regression was used to select prognosis-associated genes (p < 0.05). Subsequently, we performed Cox regression analysis combined with LASSO to establish a prognostic risk score model, and the penalty regularization parameter lambda (λ) was chosen through cross-validation with an n-fold equal to 10 by using the R package glmnet (21). Lambda.min was identified to pick out the variables. According to these variables, a stepwise regression was performed to establish the Cox model. Finally, a validation set and Kaplan-Meier survival curves along with a logrank p test were applied to validate its accuracy. In addition, receiver operating characteristic (ROC) analysis was used to estimate the predictive power of this signature.

Western Blot Analysis
Cells and OSCC tissues were lysed in RIPA lysis buffer. According to the procedure, proteins were separated by electrophoresis, transferred to membranes and then sealed with 5% skim milk. The primary antibodies including CMA1 (dilution 1:1,000), GNA14 (dilution 1:1,000) and α-tublin (dilution 1:1,000) were incubated in 4 • C for one night. Subsequently, goat anti-mouse and goat anti-rabbit secondary antibodies were incubated for 1 h at room temperature, and finally immunoreactive bands were visualized with a chemiluminescence system.

Biomarkers Screening and Validation
The status and survival time of OSCC patients were extracted. Subsequently, the mRNA was enrolled in the PPI and ceRNA networks, and lncRNA and miRNA identified in ceRNA were selected for screening biomarkers. We used univariate Cox regression to screen prognostic factors (p < 0.05), and those prognostic factors whose expression levels were significantly relevant to patients' overall survival (p < 0.01) were selected as primitive biomarkers. Ultimately, the gene expression profile from the GEO and OSCC cell lines and tissues were used to verify these biomarkers. In addition, a combination of two or more biomarkers was performed to predict OSCC overall survival according to the gene expression in TCGA.

Statistical Analysis
Statistical analyses were performed using SPSS23.0 software (IBM). The differences between groups were tested using a twotailed Student's t-test. The survival analysis between two groups was conducted by logrank test. P-values < 0.05 were considered statistically significant. Differences were considered significant if * p < 0.05; * * p < 0.01; * * * p < 0.001; or * * * * p < 0.0001.

Functional Analysis of DERNAs
GO and KEGG enrichment analysis were used to explore the potential function of DERNAs. The results indicated that these genes were mainly enriched in tissue development, cell differentiation and calcium binding (Figures 2A,B). Moreover, the majority of genes were located in the extracellular region. In addition, KEGG pathway analyses demonstrate that the most significant pathways were the calcium signaling, protein digestion and absorption and focal adhesion pathways (Figures 2C,D).

Protein-Protein Interaction (PPI) Network Analysis
A total of 712 proteins and 3,181 edges were selected in the PPI network. The confidence score is >0.9. A total of 31 hub genes were selected from PPI network with degree>=25 ( Figure 3A). Among them, 6 mRNAs including GNA14, GRM5, KRT3, KRT26, TACR1, and HTR2C were related to overall survival. In addition, according to module analysis, three modules were identified in the PPI network (Figures 3B-E). Surprisingly, most hub genes, including all of these associated with overall survival (Supplementary Figure 1), were enriched in module 2 indicating that module 2 plays a vital role in the PPI network. According to the GO terms analysis, three modules related to cell adhesion, tissue development, and protein binding played a crucial role in cancer. With respect to KEGG enrichment analysis, modules 1 and 2 regulated metabolic pathways such as protein digestion and absorption (Tables 1, 2). Moreover, modules 1 and 3 were significantly relevant to the PI3K-Akt signaling and calcium signaling pathways, which were associated with the occurrence and development of tumors ( Table 3).

Construction of ceRNA Network in OSCC
A total of 46 lncRNAs, 10 mRNAs and 7 miRNAs were enrolled in the ceRNA network ( Figure 4A). We employed miRcode to predict the potential miRNA target by DElncRNAs. As a result, 46 of 286 DElncRNAs and 7 of 111 DEmiRNAs formed 119 lncRNA-miRNA pairs ( Table 4). Then, miRDB, miRanda and TargetScan were used to predict the miRNA-mRNA pairs. Only the miRNA-mRNA interactions that exist in all three databases were brought into the ceRNA network ( Table 5). Finally, there were 10 DEmRNAs could target to the 7 miRNAs ( Figure 4B). Therefore, according to Table 6, the ceRNA network including 46 lncRNAs, 10 mRNAs and 7 miRNAs was completely constructed, and the lncRNA-miRNA-mRNA network was visualized in Cytoscape (Version 3.7.1).

Establishment and Validation of Cox Regression Model
A total of 316 OSCC samples were randomly divided into a training set and validation set. Subsequently, combined univariate Cox regression with a LASSO Cox regression model along with 10-fold cross-validation, 11 variables (AC073130.1, AFAP1-AS1, AQP4-AS1, C11orf97, HOTTIP, HOXA11-AS, LINC00460, LINC01234, LINC01391, SLC8A1-AS1, and SRGAP3-AS2) were identified (Figures 5A,B). Furthermore, a stepwise regression was performed NKX2-1-AS1 C8orf49 LINC00520 patients were opposite (Figures 5D,E). Correspondingly, the protective genes in high risk group expression level were lower than low group. On the contrary, risky genes were higher in high-risk group (Figure 5F). Kaplan-Meier curves along with logrank p test were used to evaluate its accuracy in the training and validation set. According to the risk formula, obvious differences in survival analysis were determined in high-and low-risk groups (Supplementary Figures 2A,B). Meanwhile, stratified analysis in disparate grades was further carried out and indicated that risk level was relevant to prognosis (Supplementary Figures 2C,D). Subsequently, the ROC was plotted, and its area Under the Curve (AUC) is 0.776 (Supplementary Figure 2E).

Screening Biomarkers
A total of 6 genes (GNA14, CMA1, DKK1, HOXC6, HCG22, and HOTTIP) were identified as primitive biomarkers. The results of univariate Cox regression indicated that there were 36 mRNAs ( Table 7-3) and 6 lncRNAs ( Table 7-1) regarded as prognostic factors (p < 0.05). However, none of the miRNAs were related to prognosis ( Table 7-2). Meanwhile, following the combined Kaplan-Meier curves with logrank p test, the genes were clearly associated with overall survival (p < 0.01) and selected as primitive biomarkers. Finally,4 mRNAs (Figures 6A-D) and 2 lncRNAs (Figures 6E,F) were identified. Subsequently, GEO expression profiles were used to verify the 6 biomarkers. As expected, most of the biomarker expression levels in GEO were consistent with TCGA. However, lncRNA HOTTIP upregulated in TCGA and there was no difference in GEO.

Validation for Biomarkers
A total of 2 mRNAs (GNA14 and CMA1) and 2 lncRNAs (HCG22 and HOTTIP) were differentially expressed in OSCC cell lines and OSCC tissues. Our results revealed that CMA1, GNA14, and HCG22 had low expression in OSCC cell lines and tissues. However, lncRNA HOTTIP was highly expressed in tumor tissues compared with adjacent normal tissues. Meanwhile, Kaplan-Meier analysis suggested that low expression of GNA14, CMA1, and HCG22 were related to poor survival (Figures 7A-C). High HOTTIP expression showed obviously poorer overall survival than those with low HOTTIP expression ( Figure 7D). Unfortunately, there was no difference for DKK1 and HOXC6 in OSCC tissues, indicating that the 2 mRNAs might not be biomarkers in OSCC (Supplementary Figure 3A). Meanwhile, the protein levels of CMA1 and GNA14 were detected in OSCC cell lines and tissues (Supplementary Figure 3B). Finally, a combination of these biomarkers can predict overall survival better (Supplementary Figures 3A,B).

DISCUSSION
In the past decades, the 5-year survival rate of OSCC has remained low (3) in spite of advances in surgical treatment and radiotherapy. Hence, it is vital to find and identify biomarkers for early diagnosis and prognosis of OSCC.
In this study, a total of 1,919 DEmRNAs, 286 DElncRNAs and 111 DEmiRNAs were identified. GO analysis revealed that the function of DERNAs is mainly associated with tissue development, cell differentiation and calcium binding, which play a vital role in tumorigenesis. In addition, KEGG pathways analysis showed that DERNAs mainly enriched in the protein digestion and absorption, calcium signaling and focal adhesion pathways. Among these pathways, the calcium signaling and focal adhesion pathways are significantly associated with cancers. For example, abnormal Ca2+ level is related to glioblastoma and gastric adenocarcinoma (22,23). In addition, research shows that focal adhesion is relevant to therapy resistance and plays a vital role in carcinogenesis, tumor progression and metastasis (24). In the present study, 712 mRNAs were enrolled in the PPI network, and module analysis was performed. The majority of these genes were relevant to the PI3K-Akt signaling and calcium signaling pathways, which are associated with occurrence and development of tumors (25,26). Meanwhile, the PI3K-Akt signaling pathway also played a significant role in OSCC (27). Six hub genes associated with overall survival-GNA14, GRM5, KRT3, KRT26, TACR1, and HTR2C-were enriched in module 2, indicating that module 2 plays a vital role in the PPI network and OSCC prognosis. To our surprise, GNA14 was identified as a potential biomarker by PCR assay and as a hub gene by bioinformatics analysis, which indicated that GNA14 may play a crucial role in OSCC carcinogenesis.
In our ceRNA network, 2 lncRNAs (HOTIP,HCG22) were identified as prognostic biomarkers. Recently, increasing studies showed that aberrant expression of HOTTIP is associated with various cancers. For instance, knockdown of HOTTIP suppresses growth and invasion and induces apoptosis of oral tongue squamous carcinoma cells (28). However, HOTTIP expression has no difference in GEO, which may be associated with racial difference. In addition, bioinformatics analysis confirmed that HCG22 is differentially expressed in various cancers (29), and the mechanism in OSCC should be researched.
Because of poor prognosis and high recurrence, we constructed a risk score formula by comprehensive analysis of lncRNA to predict overall survival in OSCC. Finally, 7 lncRNA were enrolled in Cox regression and it can predict overall survival accurately. In oral cancer, overexpression of the lncRNA AFAP1-AS1 is associated with the proliferation, invasion and survival of tongue squamous cell carcinoma via     (32). In addition, GNA14 was identified as biomarkers and a hub gene, suggesting that GNA14 was obviously relevant to OSCC. Our analysis shows that lncRNA HOTTIP and HCG22 are also biomarkers of OSCC. HOTTIP and HCG22 can interact with hsa-mir-21 in the ceRNA network, which promote oral cancer invasion via the Wnt/β-catenin pathway (33). Accordingly, exploring their mechanism in OSCC may provide new therapeutic targets. Furthermore, DKK1 and HOXC6 were excluded because there was no difference in mRNA level in OSCC tissues. The reasons may be racial difference or limited sample size.
We successfully constructed the ceRNA network, which reveals the potential mechanisms between lncRNA and mRNA. It may provide a new vision on therapeutic targets in OSCC by exploring the underlying mechanisms. At the same time, we also constructed a 7 lncRNA risk score model which is positively associated with overall survival in OSCC. We can put forward reasonable therapy at an appropriate time according to the risk model. Meanwhile, determining revisiting patients and followup improve the overall survival in OSCC based on risk level. Eventually, based on the 4 biomarkers, it may be beneficial to realize early-stage diagnosis and provide new therapeutic targets in OSCC.
Though the study might have crucial clinical importance, we still need to consider several limitations. First, in terms of sample numbers, both the TCGA database and clinical specimens which were collected at Nanfang Hospital are far from inadequate. We need to collect more information to continue verifying its accuracy. Second, the information from TCGA may show bias. Although we have validated it in cell lines and clinical specimens, we should continue to do further research. Third, the function and mechanism of biomarkers in OSCC need to be further studied in vivo and in vitro.

CONCLUSION
In summary, our study identifies that 2 mRNAs and 2 lncRNAs might be novel important prognostic factors and potential treatment targets for OSCC. Furthermore, we established a disordered lncRNA-miRNA-mRNA ceRNA network which is beneficial to understanding the relationship between lncRNA and mRNA and provides efficient strategies for subsequent functional studies of lncRNAs. Meanwhile, construction of the PPI network provides a new vision for OSCC treatment, and the risk score model is helpful for improving the overall survival in OSCC.

ETHICS STATEMENT
The study protocol was approved by the Ethics Committees of Nanfang Hospital of Guangdong Province (NO: NFEC-2018-027).

AUTHOR CONTRIBUTIONS
GH: design and initiation of the study, quality control of data and algorithms, data analysis and interpretation, manuscript preparation and editing. QW: data acquisition. ZZ: statistical analysis. TS: patient recruitment and clinical support and oversight. X-ZL: study concept, design and initiation of the study. All authors are revision and approval the final version of the paper.