A Comprehensive RNA Expression Signature for Cervical Squamous Cell Carcinoma Prognosis

Clinicopathological characteristics alone are not enough to predict the survival of patients with cervical squamous cell carcinoma (CESC) due to clinical heterogeneity. In recent years, many genes and non-coding RNAs have been shown to be oncogenes or tumor-suppressors in CESC cells. This study aimed to develop a comprehensive transcriptomic signature for CESC patient prognosis. Univariate, multivariate, and Least Absolute Shrinkage and Selection Operator penalized Cox regression were used to identify prognostic signatures for CESC patients from transcriptomic data of The Cancer Genome Atlas. A normalized prognostic index (NPI) was formulated as a synthetical index for CESC prognosis. Time-dependent receiver operating characteristic curve analysis was used to compare prognostic signatures. A prognostic transcriptomic signature was identified, including 1 microRNA, 1 long non-coding RNA, and 6 messenger RNAs. Decreased survival was associated with CESC patients being in the high-risk group stratified by NPI. The NPI was an independent predictor for CESC patient prognosis and it outperformed the known clinicopathological characteristics, microRNA-only signature, gene-only signature, and previously identified microRNA and gene signatures. Function and pathway enrichment analysis revealed that the identified prognostic RNAs were mainly involved in angiogenesis. In conclusion, we proposed a transcriptomic signature for CESC prognosis and it may be useful for effective clinical risk management of CESC patients. Moreover, RNAs in the transcriptomic signature provided clues for downstream experimental validation and mechanism exploration.


INTRODUCTION
Cervical cancer (CC) is still the fourth most common cancer in women (Ferlay et al., 2015). Despite developed countries are low epidemic areas of CC by virtue of easier accessibility of routine screening test and human papillomavirus (HPV) vaccination, CC is still the second leading cause of cancer death among women aged 20-39 years in the United States in 2015 (Siegel et al., 2018). At present, clinical stage is the leading predictive characteristic for CC prognosis, although useful, significant variability is observed and the 5 years survival rate is still poor for women with advanced CC (30-40% for stage III and 15% for stage IV). Theoretically, clinicopathological characteristics are macroscopic emergence of molecules (e.g., genes, proteins) and CC patients with homogeneous clinical status may have completely diverse molecular patterns. Therefore, identification of robust and accurate molecular biomarkers for CC patient prognosis is valuable and in urgent need.
By comprehensively characterizing various molecules (DNAlevel, RNA-level, protein-level) in 100s of CC samples, The Cancer Genome Atlas (TCGA) has provided a comprehensive way to understand CC (The Cancer Genome Atlas Research Network et al., 2017). Enormous multiple omics data make the discovery of potential biomarkers for CC diagnosis, treatment and prognosis possible. Several studies have investigated the molecular signatures for CC prognosis based on the expression of CC genome. Hu et al. (2010) profiled 96 cancer-related microRNAs (miRNAs) in 102 CC samples and firstly proposed a two-miRNA expression signature for predicting the overall survival (OS) of CC patients. How et al. (2015) measured the miRNA omics of CC samples by miRNA arrays and proposed a prognostic nine-miRNA expression signature in their training set. However, the prognostic value of the nine-miRNA expression signature could not be validated in an independent cohort (How et al., 2015). Liu et al. (2016), Liang et al. (2017), Ma et al. (2018), and Ying et al. (2018) proposed a seven-miRNA expression signature, a three-miRNA expression signature, a three-miRNA expression signature, and a 2 two-miRNA expression signature for CC prognosis based on TCGA miRNA sequencing data, respectively. Huang et al. (2012) profiled 1440 human tumor related gene transcripts using custom oligonucleotide microarrays in 100 CC samples and identified a prognostic seven-gene expression signature. Based on TCGA gene sequencing data, Li et al. (2017b proposed a two-histone family gene signature and further proposed another independent gene signature to predict the OS of CC patients. However, some limitations should be noticed: (1) Previous studies focused on single omics independently, and there lacks a whole transcriptomic analysis which may provide more comprehensive and robust discovery (Hu et al., 2010;Huang et al., 2012;How et al., 2015;Liu et al., 2016;Li et al., 2017bLiang et al., 2017;Ma et al., 2018;Ying et al., 2018). (2) Prognostic miRNA signatures identified based on the same data source without cross-references are very different (Liu et al., 2016;Liang et al., 2017;Ma et al., 2018;Ying et al., 2018). (3) For prognostic miRNAs, previous studies did not distinguish miRNA isoforms (3p-arm or 5p-arm). Thus, it is unclear which isoform should be further investigated by experiment (Hu et al., 2010;How et al., 2015;Liu et al., 2016;Liang et al., 2017;Ma et al., 2018;Ying et al., 2018). (4) Pathologically, CC includes cervical squamous cell carcinoma (CESC) and cervical adenocarcinoma (CADC). Because there are significant differences in prognosis between CESC and CADC (Jung et al., 2017), it is not appropriate to mix them for identification of prognostic biomarkers. (5) Semi-parametric survival analysis method such as Cox regression analysis is loosely used without checking the proportional hazards (PH) assumption (Hu et al., 2010;Huang et al., 2012;How et al., 2015;Liu et al., 2016;Li et al., 2017bLiang et al., 2017;Ma et al., 2018;Ying et al., 2018). (6) To identify prognostic signatures from high-dimensional omics data, classical multivariate Cox regression analysis (MCA) is usually impeded by the "curse of dimensionality" (i.e., low sample size and large number of variables), which leads to over-fitting and unstable estimation of regression coefficients.
To address these limitations, we submit the transcriptomic data of CESC patients to a Least Absolute Shrinkage and Selection Operator (LASSO) penalized MCA (Simon et al., 2011) to identify a transcriptomic signature for CESC prognosis.

Data Acquisition
Level 1 clinical data, level 3 transcriptomic sequencing data, and the corresponding metadata of CCs were retrieved and downloaded from TCGA 1 repository in January 2018. Search strategies can be obtained in Section I of the Supplementary Material. CC patients were included in this study by the following criteria: (1) CC patients diagnosed as CESC; (2) CESC patients with at least 60 days follow-up; (3) CESC patients have both clinical data, gene sequencing data, and miRNA isoform sequencing data; (4) CESC patients do not have prior other malignancies; and (5) CESC patients do not receive any preoperative neoadjuvant therapy.

Data Preprocessing
Clinical eXtemsible Markup Language files were parsed by R "XML" package and the R code can be achieved in Section II of the Supplementary Material. Details on sequencing data preprocessing can be also available in Section II of the Supplementary Material. Hierarchical clustering was used to cluster samples to detect sample outliers and guided principal component analysis (gPCA) (Reese et al., 2013) was adopted to evaluate batch effects of the sequencing data.

Identification of Prognostic Demographic and Clinicopathological Characteristics
Kaplan-Meier (KM) survival analysis with log-rank test was applied to evaluate the prognostic effects of age at diagnosis, clinical stage, menopause status, ethnicity, birth control pill usage, tobacco usage, and lymphovascular invasion for CESC. Furthermore, MCA with demographic and clinicopathological characteristics as covariates was adopted to evaluate their independence for CESC prognosis.

Univariate Survival Analysis of RNAs
The miRNA isoform sequencing data only include miRNAs while the gene sequencing data include both messenger RNAs (mRNAs) and long non-coding RNAs (lncRNAs). For convenience of description, we termed mRNA and lncRNA as gene and further termed miRNA, mRNA and lncRNA as RNA.
Associations between OS and RNA expression profiles were preliminarily evaluated by univariate Cox regression analysis (UCA). The proportional hazards (PH) assumption was tested by Schoenfeld residual (Grambsch and Therneau, 1994) and a unified multiple testing (Strimmer, 2008) was applied for tail area-based false discovery rate (FDR) estimation. RNAs with FDRs < 0.1 and PH assumption test P values >0.1 (Kleinbaum and Klein, 2012) were considered to be preliminarily associated with OS of CESC patients. Furthermore, RNAs with hazard ratios (HRs) > 1 were defined to be risky for CESC prognosis, and those with HRs < 1 were considered as protective.

Multivariate Analysis of Preliminarily Survival Associated RNAs
For miRNAs, stepwise MCA was applied to preliminarily survival associated miRNAs to construct an independent miRNAonly expression signature for CESC prognosis. The Bayesian information criterion (BIC) (Schwarz, 1978) was adopted for model selection.
Because the number of preliminarily survival associated genes was comparable to the number of samples, a LASSO penalized MCA with 10-fold cross validation and 1000 iterations was adopted to select genes by penalizing low regression coefficients exactly to zero. To alleviate the local minimum problem, we repeated the LASSO penalized MCA 10 times with different initializations and the model that achieved the minimal partial likelihood was adopted. Furthermore, stepwise MCA was applied to genes selected by the LASSO penalized MCA to construct an independent gene-only expression signature for CESC prognosis.
Finally, a transcriptomic signature for CESC prognosis was constructed by stepwise MCA with gene-only signature and miRNA-only signature as covariates.

Risk Score
A normalized prognosis index (NPI) defined as the standard form of a linear combination of the observed values weighted by the regression coefficients was adopted as a synthetical index for CESC prognosis. Specifically, where PI is a prognostic index vector and the jth element of PI is the prognostic index of the jth patient, i.e.
β i is the regression coefficient of the ith variable (in this context, the ith gene/miRNA); G ij is the observed value of the ith variable in the jth sample (in this context, the expression of the ith gene/miRNA in the jth sample). For miRNA-only signature, gene-only signature, the integrated RNA signature (i.e., transcriptomic signature), and the previously identified prognostic signatures, we termed the corresponding NPI as miRNA-NPI, gene-NPI, RNA-NPI, and pre-NPI, respectively.

Model Evaluation and Comparison
CESC patients were stratified into a high-risk group (NPI > 0) or a low-risk group (NPI < 0) based on NPI. OS between the high-risk group and the low-risk group was compared by KM survival analysis. MCA was used to evaluate the independence of various NPIs and clinical factors. The abilities of various NPIs to predict CESC patient survival outcome were assessed and compared by calculating the area under the curve (AUC) of the time-dependent receiver operating characteristic (ROC) at 3, 5, and 10 years, respectively.

Gene Ontology and Pathway Enrichment
Unlike other mRNA target prediction software just based on sequence alignment, miRTarBase provided experimentally validated mRNA targets of miRNA. Both strongly and weakly validated mRNA targets of the prognostic miRNA were obtained from miRTarBase (version 7.0) (Chou et al., 2018). Metascapae (Tripathi et al., 2015) 2 was adopted for gene ontology and pathway enrichment of the prognostic mRNAs and targets of the prognostic miRNA.

Statistical Analysis Tools
P-value less than 0.05 or adjusted P-value less than 0.1 was considered to be significant. All analyses were performed by R software. Non-parametric survival analysis, semi-parametric survival analysis, and PH assumption test were performed by survival and survminer packages. LASSO penalized MCA was conducted by textitglmnet package. Multiple test correction was conducted by fdrtool package. Time-dependent ROC analysis was conducted by timeROC package.

Available Data
TCGA CC dataset included 307 CC patients who had generated 312 samples for miRNA sequencing (including 307 primary CC samples, 2 metastatic CC samples, and 3 normal samples) and 309 samples for gene sequencing (including 304 primary CC samples, 2 metastatic CC samples, and 3 normal samples). Due to small number of metastatic and normal CC samples, we only analyzed the primary CC samples. Based on the inclusion criteria and lowexpressed RNA filtering (Supplementary Material Section III), 214 CESC samples covering 401 miRNAs and 13631 genes (mRNAs and lncRNAs) were retained. Hierarchical clustering showed that there existed a miRNA sample outlier and seven gene sample outliers. After removing sample outliers and scaling the expressions of miRNAs and genes to zero sample mean and standard deviation, 206 primary CESC samples were included for identification of prognostic signatures. Batch effect analysis showed that there was no obvious separation on the first two guided principal components for both miRNA isoform sequencing data and gene sequencing data (Figures 1A,C) with permutation test P-values of 0.598 and 0.947 (Figures 1B,D), respectively. These results indicated that there was no significant batch effect in the sequencing data.

Prognostic Demographic and Clinicopathological Characteristics
Of the 206 patients in TCGA-CESC cohort, 55 patients were deceased and 151 were alive at the time of last follow-up. The median OS time was 3097 days (95% CI: 2859-NA days). Demographic and clinicopathological characteristics for TCGA-CESC cohort are summarized in Table 1. Age at initial diagnosis (HR = 1.81, 95% CI: 0.92-3.58), clinical stage (HR = 2.14, 95% CI: 1.12-4.09), tobacco usage (HR = 2.36, 95% CI: 1.24-4.48), and lymphovascular invasion (HR = 13.70, 95% CI: 5.64-33.31) were negatively associated with OS of CESC patients revealed by KM survival analysis (Table 1 and Supplementary Figure  S1). MCA revealed that only lymphovascular invasion was an independent predictor for CESC prognosis (Table 1). However, information of lymphovascular invasion was not available in more than half of the CESC patients (n = 106). Considering age at initial diagnosis, clinical stage, and tobacco usage as covariates (i.e., without lymphovascular invasion), MCA revealed that only clinical stage was an independent prognostic clinicopathological characteristic ( Table 1).
To exclude potential effects of mutations on CESC prognosis, we also investigated the mutational patterns of the genes in the identified transcriptomic signature using OncoPrinter. Mutations of ZIC2, MTMR11, EGLN1, and TPST1 were found in two, five, one, and two of total 206 CESC samples, and none of these mutations were found in 197 CESC samples. MCA  further showed that the identified RNAs were independent predictors for CESC prognosis in the non-mutated samples ( Supplementary Table S3). Thus, the prognostic roles of the identified RNAs were merely caused by their expressions.
Moreover, MCA revealed that clinical stage, RNA-NPI, and pre-NPI were independent predictors for CESC prognosis (Table 3). However, when considering lymphovascular invasion  as covariate, RNA-NPI was the only independent prognostic index ( Table 3). These results demonstrated that the transcriptomic signature was a better predictor for CESC prognosis compared with clinicopathological characteristics and previous proposed signatures.

Angiogenesis Related Functions and Pathways
Sixty-four genes were validated as targets of hsa-miR-532-5p. Gene ontology and pathway enrichment analyses indicated that targets of hsa-miR-532-5p and the prognostic mRNAs in the transcriptomic signature were mainly associated with angiogenesis ( Figure 5). Angiogenesis is a main hallmark of tumor progression and may be an independent prognostic factor in CC (Bremer et al., 1996). In this study, EGLN1 (Egl-9 family hypoxia inducible factor 1; also known as PHD2) is a risky gene (HR = 1.79, 95% CI: 1.45-2.21) and it is closely related with angiogenesis by regulating the stability of HIF1 in non-CESC cancers (Chan and Giaccia, 2010;Lu et al., 2013). Targets of hsa-miR-532-5p such as runt-related transcription factor-3 (Peng et al., 2006;Kim et al., 2016), insulin-like growth factor-binding protein-5 (Rho et al., 2008;Lee et al., 2016), and von Hippel-Lindau (Kong et al., 2011(Kong et al., , 2014Lu et al., 2013) were proved to be associated with angiogenesis in many types of cancer. These results prompted that further experiments aimed to explore functional mechanisms of the transcriptomic signature could be focused on angiogenesis related pathways.

DISCUSSION
In this study, a novel transcriptomic signature for CESC patient prognosis was identified. Our proposed transcriptomic signature includes two non-coding RNAs (hsa-miR-532-5p and lncRNA DLEU1) and six mRNAs (RBM38, CXCL2, ZIC2, MTMR11, EGLN1, and TPST1). It is natural to wonder if there are any mRNA targets of the non-coding RNAs in the transcriptomic signature. Interestingly, CXCL2 was reported to be a direct target of hsa-miR-532-5p in hepatocellular carcinoma and this miRNA-gene interaction inhibited hepatocellular carcinoma cell proliferation and metastasis (Song et al., 2015). However, correlation analysis revealed that no significant correlation between hsa-miR-532-5p and CXCL2 was observed. Forty-six genes were predicted to be targets of lncRNA-DLEU1 by starBase v2.0 (Li et al., 2014), but none of them were in the transcriptomic signature. For identification of prognostic biomarkers, it is expected to construct signatures that include as many independent biomarkers as possible. Due to the independence among the biomarkers in the transcriptomic signature, it is hard to find possible biological interactions among them. To find possible mechanisms for the independent RNAs in the transcriptomic signature, it is wise to explore possible correlations between the independent RNAs and the remaining RNAs of CESC transcriptome. Thus, the transcriptomic signature could just provide the initial molecules rather than complete biological mechanisms for further experimental exploration. For the protective RNAs in the transcriptomic signature, hsa-miR-532-5p was shown to be involved in many cancers either as a tumor suppressor or an oncogenic-miRNA (Song et al., 2015;Zhang J. et al., 2018). However, the role of hsa-miR-532-5p in CESC remains unknown, and our results prompted that it may exert a tumor suppressor role in CESC due to its positive correlation with OS of CESC patients. RNA-binding protein 38 (RBM38) was originally recognized as an oncogene and it was frequently found to be amplified in prostate, ovarian and colorectal cancer, chronic lymphocytic leukemia, colon carcinoma, esophageal cancer, dog lymphomas and breast cancer (Ding et al., 2015). But recently, more and more studies suggested that RBM38 might act as a tumor suppressor (Feldstein et al., 2012;Xue et al., 2014). Ding et al. found that the association between the expression of RBM38 and cancer prognosis varied from cancers and databases (Ding et al., 2015). These studies suggested that the function of RBM38 might be multidimensional in cancers. However, no study was performed to investigate the possible roles of RBM38 in CESC, and our analysis prompted to assume that RBM38 may be a tumor suppressor in CESC. Zic family member 2 (ZIC2) was shown to be oncogenic in many cancers such as ovarian cancer (Marchini et al., 2012) and hepatocellular carcinoma (Lu et al., 2017). In cervical cancer, ZIC2 was rarely investigated. Chan et al. (2011) demonstrated that ZIC2 was up-regulated in CC cell lines and the up-regulation of ZIC2 may enhance the activity of the Hedgehog signaling pathway through nuclear retention of Gli1. Although ZIC2 may be a risk factor that was deduced from the results of Chan et al., the prognostic role of ZIC2 in CESC patients was not investigated. Our analysis showed that ZIC2 may be a protective factor, and further studies are needed to elaborate on the disputable roles of ZIC2 in CESC.
Among the risky RNAs in the transcriptomic signature, Tyrosylprotein sulfotransferase 1 (TPST1) is an enzyme responsible for catalysis of tyrosine sulfation. Previous studies revealed that TPST1 could sulfate the tyrosine of C-X-C motif chemokine receptor (CXCR4) (Seibert et al., 2008;Xu et al., 2013) and the tyrosine sulfation might contribute to nasopharyngeal carcinoma metastasis (Xu et al., 2013). However, expressions, functions, and mechanisms of TPST1 in CESC are not clear. Consistent with these previous studies, our analysis revealed that TPST1 was harmful for CESC prognosis. C-X-C motif chemokine ligand 2 (CXCL2) was demonstrated to be upregulated in many types of cancer such as chronic lymphocytic leukemia and bladder cancer. The up-regulation of CXCL2 could enhance the cell survival of lymphocytic leukemia (Burgess et al., 2012) and it was correlated with poor prognosis of bladder cancer (Zhang et al., 2016). Recently, Zhang et al. revealed that AKP1 could promote angiogenesis and tumor growth by up-regulating CXCL1, CXCL2, and CXCL8 in CC cells . Our results revealed that CXCL2 was a risk factor for CESC prognosis, which is in line with these previous experimental studies in non-CESC cancers. Egl-9 family hypoxia inducible factor 1 (EGLN1) was a key cellular oxygen sensor, which played important roles in tumor angiogenesis (Chan and Giaccia, 2010) and tumor metastasis (Kuchnio et al., 2015). Both tumor-promoting and suppressive roles of EGLN1 have been reported in different types of cancer (Chan and Giaccia, 2010). Although EGLN1 was shown to be low-expressed in advanced CC (Roszak et al., 2011;Kuchnio et al., 2015), in our analysis, the expression of EGLN1 was a risk factor for CESC patient survival. Thus, further functional mechanisms of EGLN1 in CESC cells should be carried out to illustrate the worse prognostic effect of EGLN1. Myotubularin related protein 11 (MTMR11) was rarely reported in cancer. The lncRNA DLEU1 played multiple roles in different cancers. Previous studies revealed that DLEU1 could promote progression of ovarian carcinoma  and gastric cancer (Li et al., 2017a), while Balas and Johnson (2018). showed that DLEU1 may be a tumor suppressor. There are versatile forms of interaction for lncRNA. DLEU1 could exert its functions by binding to proteins (Li et al., 2017a;Liu et al., 2018) or miRNAs .
Moreover, the up-regulated DLEU1 was shown to be associated with the survival of gastric cancer by promoting proliferation of gastric cancer cells (Li et al., 2017a). However, the potential roles of DLEU1 in CESC remain unclear, and our analysis revealed that DLEU1 may also exert as a tumor suppressor in CESC.
Some limitations of the current study should be noticed.
(1) For PH assumption test, it is difficult to estimate the type II error (i.e., the false negative rate). Thus, it is hard to choose a threshold of PH assumption test P-value in multiple testing. (2) The transcriptomic signature was identified based on TCGA datamining and further independent validations and mechanism explorations are in need. (3) Complete mechanisms could not be revealed by the transcriptomic signature itself. However, the transcriptomic signature may be of potential applications for clinical management of CESC patients. Specifically, we can measure the expressions of the transcriptomic signature in CESC patients and calculate the NPIs for the patients based on the measured expression values. Furthermore, CESC patients can be stratified as high-risk and low-risk based on their NPIs. Finally, for high-risk patients, more aggressive therapies such as high-dose chemoradiotherapy may be given. Moreover, hsa-miR-532-5p, RBM38, EGLN1, and DLEU1 were demonstrated as both oncogenes and tumor-suppressors in non-CESC cancers and their roles in CESC were unclear. This study provided experimental directions for these novel genes and miRNA.

CONCLUSION
We have identified a novel transcriptomic signature for CESC prognosis, including 1 miRNA, 1 lncRNA and 6 mRNAs. The transcriptomic signature was more comprehensive and predictive than the miRNA-only, the gene-only, and the previously identified signatures for CESC prognosis.

AUTHOR CONTRIBUTIONS
JX conceived and designed the analysis. JX, SG, YS, LG, and ZB performed the analysis. JX wrote the paper.

FUNDING
This work was supported by the Postdoctoral Science Foundation of Central South University.

ACKNOWLEDGMENTS
We thank The Cancer Genome Altas project made the genomic data of CESC available. All data obtained from TCGA keep to the rules for usage and publication of TCGA. We wish to thank the editors and reviewers for helpful comments. In addition, we also thank Zheng Yu from University of Washington for English advice.

SUPPLEMENTARY MATERIAL
The Supplementary Material for this article can be found online at: https://www.frontiersin.org/articles/10.3389/fgene. 2018.00696/full#supplementary-material FIGURE S1 | Kaplan-Meier plots for age at diagnosis (A), tobacco usage (B), clinical stage (C), and lymphovascular invasion (D). FIGURE S2 | The first row shows the densities, the second the distribution function and the last row the local and tail area-based false discovery rates of miRNAs.
FIGURE S4 | The first row shows the densities, the second the distribution function and the last row the local and tail area-based false discovery rates of genes.
TABLE S1 | MCA of RNA-NPI and miRNA-NPI.     METHODS | Details on data retrieval and preprocessing.