A Novel Ten-Gene Signature Predicting Prognosis in Hepatocellular Carcinoma

Hepatocellular carcinoma (HCC) has a dismal long-term outcome. We aimed to construct a multi-gene model for prognosis prediction to inform HCC management. The cancer-specific differentially expressed genes (DEGs) were identified using RNA-seq data of paired tumor and normal tissue. A prognostic signature was built by LASSO regression analysis. Gene set enrichment analysis (GSEA) was performed to further understand the underlying molecular mechanisms. A 10-gene signature was constructed to stratify the TCGA and ICGC cohorts into high- and low-risk groups where prognosis was significantly worse in the high-risk group across cohorts (P < 0.001 for all). The 10-gene signature outperformed all previously reported models for both C-index and the AUCs for 1-, 3-, 5-year survival prediction (C-index, 0.84 vs 0.67 to 0.73; AUCs for 1-, 3- and 5-year OS, 0.84 vs 0.68 to 0.79, 0.81 to 0.68 to 0.80, and 0.85 vs 0.67 to 0.78, respectively). Multivariate Cox regression analysis revealed risk group and tumor stage to be independent predictors of survival in HCC. A nomogram incorporating tumor stage and signature-based risk group showed better performance for 1- and 3-year survival than for 5-year survival. GSEA revealed enrichment of pathways related to cell cycle regulation among high-risk samples and metabolic processes in the low-risk group. Our 10-gene model is robust for prognosis prediction and may help inform clinical management of HCC.


INTRODUCTION
Hepatocellular carcinoma (HCC) is the sixth most common cancer and the fourth leading cause of cancer mortality worldwide (Bray et al., 2018). While the vast majority of HCC patients are no longer eligible for curative therapy at the time of diagnosis, those who have undergone curative resection or transplantation still face a 70% risk of recurrence in 5 years (Bruix et al., 2014;Mazzaferro et al., 2014; European Association for the Study of the Liver. Electronic address: easloffice@easloffice.eu, and European Association for the Study of the Liver, 2018). The dismal prognosis of HCC can be attributed to a number of factors and there exists a demand for a model that can effectively identify patients at a high risk of recurrence/metastases so that clinical actions could be taken proactively. Conventional prognostic models for HCC mainly involve integration of clinicopathological factors such as tumor size, number of lesions, microvascular invasion, and cirrhosis, supplemented by serum levels of certain single markers such as α−fetoprotein (AFP) and des−gamma carboxy−prothrombin (DCP) (Marrero et al., 2009;Chan et al., 2018). However, their specificity and sensitivity do not support distinguishment of meaningful patterns of prognosis, especially with the substantial heterogeneity of HCC.
With the advent of massively parallel sequencing, molecular characterization has identified key driver pathways in HCC and several schemes for subtyping HCC have been proposed according to genomic, transcriptomic, microRNA (miRNA), and proteomic profiles (Boyault et al., 2007;Hoshida et al., 2009;Guichard et al., 2012;Schulze et al., 2015). Over the past decade, gene signatures based on aberrant transcriptional profiles have gained widespread attention for demonstrating great promise in prognosis prediction for HCC. For example, Long et al. (2018) established a four-gene signature that could effectively recognize HCC patients at a high risk of death. Liu et al. (2018) on the other hand, identified a four differentially methylated gene pairs to predict recurrence. Several other reports also described mRNA expression signatures comprising various numbers of genes using similar approaches (Ke et al., 2018;Wang et al., 2018;Zheng et al., 2018;Chen et al., 2019;Liu et al., 2019;Qiao et al., 2019;Liu et al., 2020). However, research into molecular signatures for prognosis prediction is still in its early stage. For example, there is no consensus regarding the number and the identity of the genes taken into account by the models reported so far. Most of them displayed modest predictive capacity and were only validated retrospectively. Therefore, a considerable amount of evidence is required for this subfield to evolve and mature.
In this work, we developed a 10-gene prognostic signature using LASSO Cox regression model which outperformed previously reported HCC prognostic models and proposed a nomogram combining tumor stage and signature-defined risk group. Gene Set Enrichment Analysis (GSEA) was performed to gain a better understanding of the underlying mechanisms of our model.

Data Collection
For the 331 HCC patients of The Cancer Genome Atlas (TCGA) database, tumoral RNA-seq data were downloaded from the Genomic Data Commons (GDC) data portal 1 (TCGA) and 49 of the tumors also had mRNA expression data of paired normal tissue samples. Clinical data and mutational data were downloaded using the University of Santa Cruz (UCSC) Xena 2 and cBioPortal 3 platforms. For the 213-patient HCC cohort of the Gene Expression Omnibus database (GEO), microarray data as well as clinical data were downloaded with an accession number of GSE14520 4 and all of the tumors had mRNA expression data of paired normal tissue samples (Roessler et al., 2010). For the LIRI-JP cohort containing 240 HCC patients, RNA-seq data and clinical data were downloaded from the International Cancer Genome Consortium (ICGC) portal 5 . All data were downloaded from the public databases hence it was not required to obtain additional ethical approval for our study.

Identification of Genes Differentially Expressed Between the Tumor and Normal Tissue Samples
The raw count data of the 49 paired tumor and normal samples of the TCGA cohort were normalized using the Trimmed Mean of M-values (TMM) method and comparative analysis was conducted using the paired t-test to identify differentially expressed genes (DEGs) (Robinson and Oshlack, 2010). Any gene with a false discovery rate (FDR) of <0.05 and a |log2FoldChange| higher than a cutoff as determined using the formula: mean[abs(log2FoldChange)] + 2 × sd[abs(log2FoldChange)], was regarded as a candidate DEG. Data of the GSE14520 cohort were analyzed in a similar fashion. Genes that were consistently up-regulated or down-regulated in tumor tissue in both cohorts were confirmed as DEGs.

Development of the 10-Gene Signature
The 331-patient TCGA cohort was used as a discovery cohort to develop a gene signature for prognosis prediction. Only patients with an overall survival (OS) longer than one month were included for survival analyses. Univariate Cox regression analysis was performed to identify DEGs that are significantly associated with survival (as defined by a P value of <0.05). LASSO Cox regression analysis was subsequently conducted to select a panel of genes that are related with OS in HCC patients using the glmnet R package (Friedman et al., 2010). In order to select the optimal lambda parameters and corresponding coefficients in LASSO Cox regression, we performed 200 iterations of 10-fold cross-validation with binomial deviance minimization criteria on the discovery cohort. The parameters lambda via 1-SE (standard error) criteria was selected to screen for the optimal gene set. To determine the optimal gene composition, best subsets regression was adopted for gene selection (Farkas and Heberger, 2005). Finally, a 10-gene signature for predicting prognosis in TCGA discovery cohort was constructed where risk score could be calculated using the following formula: where n is the number of prognostic genes, Exp i is the expression level of the gene, and Coef i is the estimated regression coefficient of the gene.

Validation of the 10-Gene Signature
In order to validate the predictive capacity of the signature, the TCGA discovery cohort was divided into a training dataset and a validation dataset using the createDataPartition function in the caret R package. The 240 HCC patients from the ICGC database served as a separate validation cohort. Briefly, risk score was calculated for each patient. The surv_cutpoint function in survminer R package was introduced to determine the optimal cut-off value for dissecting the population into a high-risk subset versus a low-risk subset, according to the correlation between expression levels of the signature genes and patients' OSs in the training dataset. Kaplan-Meier (KM) survival curves combined with a log-rank test were used to test the differences in prognosis between highand low-risk groups using the survival R package. Timedependent receiver operating characteristic (ROC) analysis and Concordance index (C-index) were adopted to evaluate the performance of the prognostic signature for predicting 1-, 3-, and 5-year survival.

Identification of Independent Prognostic Markers
To identify independent prognostic markers, the 10-gene signature may predict prognosis and other clinicopathological factors such as age, gender, race, body mass index (BMI), AFP, residual tumor, tumor mutational burden (TMB), tumor grade, TNM stage, and vascular tumor invasion were subjected to univariate and multivariate Cox regression analyses.

Construction and Assessment of a Predictive Nomogram
All independent prognostic factors as revealed by multivariate analyses were combined to derive a nomogram for predicting the probability of 1-, 3-, and 5-year survival of HCC. The performance of the nomogram was evaluated using Harrell's concordance index (C-index) and calibration curves. Decision curve analysis (DCA) was employed to compare the reliability of the nomogram with that of tumor stage or risk group alone.

Gene Set Enrichment Analysis
Gene set enrichment analysis was performed on the high-risk and low-risk subgroups of the TCGA discovery cohort using GSEA v.3.0. Molecular Signatures Database v.7.0 was searched to identify enriched pathways associated with survival in the two risk groups, respectively. Gene sets with a P < 0.05 and a FDR < 25% were considered significantly enriched.

Statistical Analyses
Statistical analyses were performed using R software v3.6.0 (R Foundation for Statistical Computing, Vienna, Austria). If not specified otherwise, tests were two-tailed, and a P-value of <0.05 was considered statistically significant.

DEG Identification
A flow chart illustrating the study process is presented in Figure 1. All data were obtained from surgical tissue samples and post-operative survival data were used for all subsequent survival analyses. By analyzing the RNA-seq data of the 49 paired tumor and normal tissue samples of the TCGA cohort, the expression of 12,301 genes were found to be significantly altered (270 up-regulated and 12,031 down-regulated) in the tumor tissue. Similarly, 235 genes were up-regulated and 10,008 were down-regulated in the tumor samples according to the GSE14520 microarray dataset. By taking the intersection of the two datasets, 5,970 genes (50 up-regulated and 5,920 downregulated) were confirmed as differentially expressed in tumor tissue (Supplementary Figure S1A).

Development of the 10-Gene Signature
The 331 primary HCC tumors of the TCGA discovery cohort were divided into a training set (N = 199) and an internal validation set (N = 132). The baseline characteristics were summarized in Table 1. Clinicopathological features were in large balanced between the training and internal validation populations. By inputting the 5,970 DEGs identified above, a total of 987 genes were demonstrated to be significantly associated with OS for the training set using univariate Cox regression analysis. A LASSO Cox regression model was subsequently utilized to select from the 987 candidates for genes highly associated with survival as indicated by a P < 0.05 in univariate analyses. Twenty-six genes were identified with the lambda was 0.1 and subsequently used to construct an all subset regression model. Finally, ten genes, YBX1, TTC26, SLC41A3, RCBTB2, PON1, MAPK7, INPP5B, CCDC134, C16orf71, and BMI1, were identified to be associated with prognosis in HCC patients using the regsubsets function in leaps R package. The LASSO deviance profiles, the coefficient profile plots, and the best subset regression model were shown in Supplementary Figures S1B-D. A risk score for prognosis prediction is calculated as follows: risk score = 0.921 * Expr YBX1 + 0.149 * Expr TTC26 + 0.732 * Expr SLC41A3 + − 0.631 * Expr RCBTB2 + − 0.227 * Ex pr PON1 + 0.539 * Expr MAPK7 + 1.28 * Expr INPP5B + 0.645 * Expr CCDC134 + −0.478 * Expr C16orf71 + 0.861 * Expr BMI1 , where Expr stands for the expression level of each gene. Using the training set's survival data and the surv_cutpoint function of the survminer R package, a risk score of 0.969 was defined as the optimal cut-off value to dissect the population into a high-risk group (N = 100) and a low-risk group (N = 99), and this cut-off value was used for all subsequent stratification.

Validation of the 10-Gene Signature
The prognostic capacity of the 10-gene signature was validated in the training set, the internal validation set, as well as the entire TCGA discovery cohort, where the high-risk group included 100, 66, and 165 patients, respectively, as defined by the cutoff value of 0.969. As demonstrated by the time-dependent ROC curves, the area under the curves (AUCs) for 1-year, 3-year, and 5-year OS were 0.869, 0.851, and 0.869 for the training set, 0.810, 0.730, and 0.719 for the internal validation set, and 0.838, 0.798, and 0.837 for the entire discovery cohort, respectively. Patients in the high-risk groups in the three datasets also displayed significantly worse OSs than the low-risk groups (P < 0.001 for the training set, P = 0.0016 for the internal validation set, and P < 0.001 for the entire TCGA discovery cohort) (Figures 2A-C and Supplementary Figures S2A-C).
Additionally, a 240-patient ICGC cohort was used as an external validation set, where 118 patients were regarded as high-risk using the same cut-off risk score. Likewise, the AUCs for the 1-year, 3-year, and 5-year OS were 0.704, 0.774, and 0.741 and the high-risk group's prognosis was significantly worse than that of the low-risk group (P < 0.001) ( Figure 2D and Supplementary Figure S2D). Sub-group analysis showed that the 10-gene signature remained a robust prognosis predictor across subgroups stratified according to disease stage in the discovery cohort (Supplementary Figure S3).
Since several multi-gene signatures have been previously proposed for predicting HCC prognosis, their performance was evaluated in parallel to that of our 10-gene signature using timedependent ROC curves and C-indexes. The 10-gene signature unequivocally outperformed the other eight models in terms of both C-index and the AUCs for 1-, 3-, 5-year OS prediction (Figure 3 and Supplementary Table S1).

Identification of Independent Prognostic Markers
Univariate and multivariate Cox regression analyses were conducted on the TCGA discovery and the ICGC validation cohorts to evaluate the 10-gene signature-defined risk group The bold values were less than 0.05.
Frontiers in Cell and Developmental Biology | www.frontiersin.org   Table 2 and Supplementary Figure S4.

Construction and Assessment of a Predictive Nomogram
As tumor stage and the 10-gene signature were demonstrated to be independent prognostic factors for HCC, a nomogram incorporating tumor stage and risk group was built to predict 1-year, 3-year, and 5-year OS ( Figure 4A). Calibration plots showed that the nomogram was better at predicting short-term survival (1-and 3-year) rather than long-term survival (5-year), as indicated by agreement between the predicted survival and actual survival ( Figure 4B). The C-index of the nomogram (0.732, 95% CI: 0.686-0.778) was higher than that of either tumor stage (0.610, 95% CI: 0.559-0.660) or risk group (0.692, 95% CI: 0.652-0.732) alone. According to DCA curves, the nomogram also offered the highest net benefit among the three factors examined (tumor stage, risk group, and nomogram) ( Figure 4C).

Gene Set Enrichment Analyses
In order to unravel the molecular mechanism underlying the 10-gene signature, GSEA analysis was conducted on the TCGA discovery cohort. As shown in Table 3 and Figure 5, the G2M checkpoint (P = 0.002, FDR = 0.090), mitotic spindle (P = 0.012, FDR = 0.075), E2F targets (P = 0.008, FDR = 0.068), and DNA repair (P = 0.048, FDR = 0.125) pathways/terms were significantly enriched in the tumors of the high-risk group as defined by the 10−gene signature. In the low-risk group, the significantly enriched pathways were mainly related to various metabolic processes (including bile acid and xenobiotic metabolism), coagulation, adipogenesis, and peroxisome function.

DISCUSSION
Hepatocellular carcinoma (HCC) represents a major health threat worldwide, especially in East Asia. Even after curative resection, the long-term outcomes for HCC patients remain dismal. Developing a prognostic model has thus gained increasing attention. In this study, a 10-gene signature for HCC prognosis prediction was generated and validated using the mRNA expression data from three publicly accessible HCC cohorts. Of the 10 genes, only BMI1 was positively associated with survival, while the others were all negative prognostic markers. According to risk scores calculated based on the expression profiles of the signature genes, we were able to effectively classify patients into high-risk and low-risk groups, where the lowrisk subset exhibited a significantly more favorable prognosis pattern than the high-risk group. This 10-gene signature also proved to be an independent prognosis factor for HCC survival. A nomogram combining both tumor stage and risk group was proposed, which proved to be a better predictor than tumor stage or risk group alone. Additionally, it was noteworthy that patients in the low-risk group were more closely associated with alterations in metabolic pathways, while the high-risk group were more enriched in cell proliferation-related pathways such as G2M checkpoint. YBX1 is a widely known oncogene implicated in multiple malignancies (Xu et al., 2017). Accumulating evidence has shown that aberrant YBX1 expression is closely associated with tumor progression, drug resistance, metastasis and poor prognosis in cancers (Imada et al., 2013;Davies et al., 2014;Kosnopfel et al., 2014;Wu et al., 2015). In HCC, the expression of YBX1 is activated by lncRNAs, which in turns regulates the PIK3CA pathway (Zhao et al., 2017). The TTC26 gene encodes an intraflagellar transport protein, which transports motilityrelated proteins into flagella (Ishikawa et al., 2014). Wang et al. (2018) built a six-gene signature, of which TTC26 was also a positive prognostic gene. However, the exact role of TTC26 in HCC remains unclear. SLC41A3 (Solute Carrier Family 41 Member 3) encodes a protein with cation transmembrane transporter activity that may contribute to Mg 2+ transportation (de Baaij et al., 2016). THPA (The Human Protein Altas) database showed that HCC patients had the lowest expression level of SLC41A3 across different cancer types, and high-expression group had significantly worse survival. Nevertheless, the role of SLC41A3 in HCC remains largely undefined. RCBTB2 (regulator of chromosome condensation and BTB domain containing protein 2) also known as CHC1L, has been proposed as a tumor suppressor gene in prostate cancer (Latil et al., 2002). An earlier study evaluated the relationship between RCBTB2 expression and carcinogenesis of multiple myeloma, and a low expression of RCBTB2 is linked to pathogenesis and progression of multiple myeloma (Legartova et al., 2010). However, RCBTB2 gene has not never been reported in association with HCC to date. PON1 is a member of the paraoxonase family, and it is an antioxidant defensive factor that is relevant for the pathogenesis of several inflammatory diseases (Mackness and Mackness, 2015;Borovkova et al., 2017). A growing body of evidence has suggested that PON1 could serve as an important clinical indicator of cancer progression for a number of cancers such as lung cancer and breast cancer (Bobin-Dubigeon et al., 2015;Aldonza et al., 2017). Data in THPA database showed that the expression of PON1 in HCC is significantly higher than that in other tumors, and its down-regulated expression has been implicated to be a poor indicator for survival in patients with HCC (Yu et al., 2018). MAPK7 (mitogen-activated protein kinase 7) encodes extracellular-regulated protein kinase 5 (ERK5). Zen et al. (2009) suggested that MAPK7 is a probable target of 17p11 amplification and that the ERK5 protein product of MAPK7 gene promotes the growth of HCC cells by regulating mitotic entry. INPP5B (inositol polyphosphate-5-phosphatase B) encodes an inositol polyphosphate-5-phosphatase and regulates calcium signaling by inactivating inositol phosphates. OCRL1 is a homologue of INPP5B, shared the same domain structure and substrate specificity (Lowe, 2005). As yet it is not clear whether INPP5B is related to HCC. MAPK cascades are critical signaling pathways involved in regulation of cellular processes such as growth, proliferation, differentiation, migration, invasion, and apoptosis (Dhillon et al., 2007). Previous studies showed that CCDC134 (coiled coil domain containing 134) acts as an inhibitor of Erk1/2 and JNK/SAPK pathways and its silencing promotes cell migration and invasion in cancers such as gastric cancer (Liang et al., 2005;Zhong et al., 2013). However, no such evidence has been found in HCC. BMI1 is recognized as one of the most commonly activated oncogenes in various tumor types, including prostate, colorectal and lung cancers (Kim et al., 2004;Vrzalikova et al., 2008;Ganaie et al., 2018). The over-expression of BMI1 correlates with therapy failure in breast, prostate, lung cancer and HCC patients (Glinsky, 2007;Vrzalikova et al., 2008;Wang et al., 2008). Recent studies demonstrated that increased expression of BMI1 resulted in therapy failure and indicated poor prognosis of HCC (Wang et al., 2008;Ruan et al., 2013).
While constructed on cohorts mainly comprising Caucasian HCC patients, our 10-gene signature was also shown to be a reliable predictor of prognosis among Asian patients in the ICGC cohort and the cut-off risk score for differentiating patient's risk of death, as trained using the TCGA cohort, could be directly applied to other populations. Compared to several existing multi-gene models, our signature also demonstrated better performance in distinguishing patients at high risk. This might be partly attributed to the fact that in our study, both model construction and validation were carried out using RNAseq data (although microarray data of the GEO database served as a supplementary source to confirm DEGs identification in the beginning), while most of the other signatures were developed and validated using data generated on different platforms, RNAseq for development and microarray for validation or vice versa. Another explanation could be that our DEGs selection was strictly based on paired tumor and normal tissue samples obtained from the same patients, which was obviously not the case in most other studies. However, our study do have some limitations. First, this was a retrospective analysis based on public datasets, therefore should be viewed as hypothesis generating rather than conclusive. Second, the validity of our signature could potentially be challenged by the heterogeneity of HCC due to sampling bias. Therefore, the 10-gene signature may warrant further prospective validation and the correlation between expression levels of these genes at protein level and patients' prognosis is also worth exploring.

CONCLUSION
Collectively, we established a robust 10-gene signature and a nomogram to predict OS of HCC patients, which may help recognize high-risk patients potentially benefiting from more aggressive treatment.

ACKNOWLEDGMENTS
We thank the patients and investigators who participated in TCGA, GEO, and ICGC for providing data.