Identification of a Novel Four-Gene Signature Correlated With the Prognosis of Patients With Hepatocellular Carcinoma: A Comprehensive Analysis

Purpose Hepatocellular carcinoma (HCC) is a common solid-tumor malignancy with high heterogeneity, and accurate prognostic prediction in HCC remains difficult. This analysis was performed to find a novel prognostic multigene signature. Methods The TCGA-LIHC dataset was analyzed for differentially coexpressed genes through weighted gene coexpression network analysis (WGCNA) and differential gene expression analysis. A protein-protein interaction (PPI) network and univariate Cox regression analysis of overall survival (OS) were utilized to identify their prognostic value. Next, we used least absolute shrinkage and selection operator (LASSO) Cox regression to establish a prognostic module. Subsequently, the ICGC-LIRI-JP dataset was applied for further validation. Based on this module, HCC cases were stratified into high-risk and low-risk groups, and differentially expressed genes (DEGs) were identified. Functional enrichment analyses of these DEGs were conducted. Finally, single-sample gene set enrichment analysis (ssGSEA) was performed to explore the correlation between the prognostic signature and immune status. Results A total of 393 differentially coexpressed genes were obtained. Forty differentially coexpressed hub genes were identified using the CytoHubba plugin, and 38 of them were closely correlated with OS. Afterward, we established the four-gene prognostic signature with an acceptable accuracy (area under the curve [AUC] of 1-year survival: 0.739). The ICGC-LIRI-JP dataset also supported the acceptable accuracy (AUC of 1-year survival:0.752). Compared with low-risk cohort, HCC cases in the high-risk cohort had shorter OS, higher tumor grades, and higher T stages. The risk scores of this signature still act as independent predictors of OS (P<0.001). Functional enrichment analyses suggest that it was mainly organelle fission and nuclear division that were enriched. Finally, ssGSEA revealed that this signature is strongly associated with the immune status of HCC patients. Conclusions The proposed prognostic signature of four differentially coexpressed hub genes has satisfactory prognostic ability, providing important insight into the prediction of HCC prognosis.


INTRODUCTION
It is estimated that nearly 42,810 new cases and 30,160 estimated deaths of hepatocellular carcinoma (HCC) will occur in 2020, leading to enormous socioeconomic pressure for HCC patients and their families (1). HCC accounts for 85%-90% of all primary liver cancer patients, and its occurrence is strongly associated with chronic hepatitis B virus (HBV) or hepatitis C virus (HCV) infection, alcohol consumption, and nonalcoholic steatohepatitis (2). HCC has high interpatient, intertumoral and intratumoral heterogeneity (3). Patients with localized HCC usually have poor survival (with a 5-year overall survival [OS] rate of 30%), and this rate is less than 5% for HCC patients with distant metastasis (4). Currently, due to the complicated etiologic factors and the high heterogeneity of HCC, it remains difficult to accurately predict the prognosis of HCC patients. Although there were some similar studies published previously, they usually required many genes in their gene signatures, which may cause some difficulties in real-world practice (5,6). Therefore, it is urgent to find the gene signature involved with less genes for the convenience of real-world practice.
With the rapid development of genome technology, bioinformatics analysis has been adopted for microarray datasets to further explore the underlying molecular mechanisms of diseases and detect disease-specific biomarkers (7). Weighted gene coexpression network analysis (WGCNA) is utilized to further understand gene coexpression networks and gene functions (8). WGCNA detects modules of closely correlated genes among samples to relate modules to external traits, providing significant insights into predicting possible functions of coexpressed genes (9). Additionally, differential gene expression analysis is often utilized in transcriptomic datasets to investigate potential biological and molecular mechanisms and quantify differences between the gene expression levels of experimental and control cohorts (10).
To increase the reliability of screening highly related genes, both methods mentioned above were used in our analysis. First, the RNA-Seq dataset and HCC clinical information were downloaded from The Cancer Genome Atlas (TCGA) database. Second, WGCNA and differential gene expression analysis were performed to obtain differentially coexpressed genes. Then, a protein-protein interaction (PPI) network was constructed, and 38 differential coexpression hub genes with prognostic value were detected. Afterward, we built a prognostic four-gene signature and verified it in the International Cancer Genome Consortium (ICGC) database. Ultimately, functional enrichment analysis was conducted to investigate the underlying biological mechanisms.

MATERIALS AND METHODS
The detailed process of data downloading, prognostic signature construction and external validation is presented in Figure 1. The details of each step are illustrated in the following subsections.

Datasets Downloaded From the TCGA and ICGC Databases
First, RNA-Seq and corresponding clinical data for liver hepatocellular carcinoma (LIHC) were obtained from the TCGA database (https://portal.gdc.cancer.gov/). A list of 424 samples was obtained, including 374 LIHC and 50 normal liver tissues, and RNA-seq count data on 19645 genes were obtained. The Illumina HiSeq 2000 platform was used to generate and annotate all data to a reference transcript set of the human hg38 gene standard track. The edgeR package tutorial suggested that genes with low read counts do not merit further analysis (11). Hence, genes with a count per million (CPM) <1 were omitted from this analysis. Next, the function rpkm in the edgeR package was adapted for further filtering. Consequently, 13,924 genes were acquired for subsequent analysis. Second, the RNA-Seq data and clinical data of HCC patients were acquired from the ICGC database (https://dcc.icgc.org/). A total of 260 HCC samples, which mainly originated from the Japanese population with HBV or HCV infection, were acquired (12). We chose the normalized read count values of the ICGC-LIRI-JP cohort. As a result, 22,913 genes were obtained for the next analysis.

Identification of Key Coexpression Modules Using WGCNA
The gene coexpression network of the TCGA-LIHC dataset was built through the WGCNA package (8). To build a scale-free network, a soft-power b = 7 (Figures 2A, B) was used in the TCGA-LIHC dataset. Next, the adjacency matrix was created according to the formula aij = |Sij| b (aij: adjacency matrix between gene i and gene j, Sij: similarity matrix made by Pearson's correlation coefficient of all gene pairs, as well as b: soft-power value). Subsequently, we converted this matrix into a topological overlap matrix (TOM) and the corresponding dissimilarity (1-TOM). The hierarchical clustering dendrogram of the 1-TOM matrix was established to aggregate the genes with similar expression patterns into the same coexpression module. Afterward, the module-trait relations between modules and external traits were analyzed to identify functional modules from the coexpression network. Hence, the modules with the largest correlation coefficients were regarded as modules that highly correlated with clinical traits. We chose the module that was positively associated with LIHC for our subsequent analysis.

Identification of Differentially Coexpressed Genes
The limma package is often used to perform differential gene expression analysis of gene expression profiles and RNA-Seq datasets (13). Here, we applied the limma package in the differential expression analysis of the TCGA-LIHC dataset to identify differentially expressed genes (DEGs) between LIHC and nontumorous tissues. To minimize the false discovery rate (FDR) to the greatest extent possible, we adjusted the P-value with the Benjamini-Hochberg (BH) method. The filtering criteria for DEGs were |logFC|>1 and adj. P <0.05. Afterward, we took the intersection of genes between DEGs and coexpressed genes to improve the reliability of screening closely related genes, and these differentially coexpressed genes were used for the next analysis.

PPI Network Construction and Hub Gene Identification
The PPI network of differentially coexpressed genes was built through the Search Tool for the Retrieval of Interacting Genes (STRING) database (14). Then, we established a visual network of molecular interactions with combined scores ≥0.7 using Cytoscape (15). In addition, the degree values of all nodes in the PPI network were calculated using the CytoHubba plugin (16).  The top 40 nodes with the highest degree scores were selected and regarded as hub genes associated with LIHC. The forty hub genes related to LIHC were displayed using the CytoHubba plug-in. In addition, we conducted gene ontology (GO) and Kyoto Encyclopedia of Genes and Genomes (KEGG) pathway analyses of the 40 hub genes to explore their biological functions. Adj. P values <0.05 were considered significant.

Survival Analysis of Hub Genes and the Correlation Network
To analyze the prognostic roles of the differentially coexpressed hub genes in LIHC, we performed univariate Cox regression analysis of OS using the survival package based on the TCGA-LIHC dataset. LIHC patients without follow-up information or a survival time=0 days were excluded from our analysis, and the other patients in the TCGA-LIHC dataset were classified into two groups considering the median expression levels of the differentially coexpressed hub genes. Log-rank P<0.01 was considered significant. Additionally, the correlation network of these differentially coexpressed hub genes was established through the igraph package. The filtering criterion was a cutoff >0.75.

Construction of the Gene Signature in the TCGA Database
To decrease the risk of overfitting to the greatest extent possible, we used least absolute shrinkage and selection operator (LASSO) Cox regression analysis to build the prognostic module of LIHC (17,18). The LASSO algorithm is widely utilized to select and shrink variables using the glmnet package. We used the expression matrix of the differentially coexpressed hub genes with prognostic value as the independent variable, while the OS and status of patients in the TCGA-LIHC dataset were used as the response variables. Then, we determined the penalty parameter (l) of this module using tenfold cross-validation following the minimum criteria, namely, the l value corresponding to the minimum partial likelihood deviance.

Nomogram and Validation of the Expression Patterns of the Gene Signature
We calculated the risk scores of all LIHC patients using the expression level of every gene and the corresponding regression coefficient. The following formula was used: score= e sum (every gene's expression × corresponding coefficient) . Then, LIHC patients were divided into high-and low-risk cohorts based on the median value of the risk score. Subsequently, we constructed a nomogram of the prognostic signature to predict the survival of LIHC patients. Furthermore, we built calibration curves and time-dependent receiver operating characteristic (ROC) curves to evaluate the discrimination and accuracy of the prognostic multigene signature. The GSE112790 dataset was used to validate the expression patterns of the genes in the signature between LIHC and nontumorous tissues.

Distribution and Prognostic Value of the Gene Signature
To analyze the prognostic value of the gene signature, we performed Kaplan-Meier survival analysis between the low-and high-risk groups using the survminer package based on the TCGA-LIHC and ICGC-LIRI-JP datasets. Additionally, to explore distribution in the low-and high-risk cohorts, we performed principal component analysis (PCA) and tdistributed stochastic neighbor embedding (t-SNE) on the TCGA-LIHC and ICGC-LIRI-JP datasets using the stats and Rtsne packages, respectively. To determine whether the risk score acts as an independent indicator of the prognosis of LIHC patients, we performed univariate and multivariate Cox regression analyses among all available variables using the TCGA-LIHC and ICGC-LIRI-JP datasets.

Differential Gene Expression Analysis and Functional Enrichment Analysis
To acquire the DEGs between the low-and high-risk groups, we performed differential gene expression analysis using the limma package in the TCGA-LIHC and ICGC-LIRI-JP datasets. The Pvalue was adjusted using the BH method. The filtering criteria for DEGs were |logFC|>2 and adj. P <0.05. Afterward, we conducted GO and KEGG pathway analyses of the DEGs between the lowand high-risk groups in the TCGA-LIHC and ICGC-LIRI-JP datasets. To further analyze the relationship between the risk score and immune status, we calculated the infiltrating scores of 16 immune cells and 13 immune-related functions or pathways using single-sample gene set enrichment analysis (ssGSEA) (19).

Identification of Key Coexpression Modules Using WGCNA
To find the pivotal module in LIHC, the gene coexpression network was established in the TCGA-LIHC dataset. A list of 11 modules was generated ( Figure 2C). Next, the heatmap revealed the correlations between the modules and clinical traits (normal and LIHC) in the TCGA-LIHC dataset ( Figure 2D). Furthermore, the yellow module of the TCGA-LIHC dataset positively correlated with LIHC tissues (r=0.57, P=1e-37) and was used for our next analysis.

Selection of Differentially Coexpressed Genes
The heatmap displayed the expression patterns of fifty upregulated and fifty downregulated genes in the TCGA-LIHC dataset ( Figure 3A). The volcano plot indicated that 2708 DEGs had a conspicuous dysregulation between LIHC and nontumorous tissues in the TCGA-LIHC dataset ( Figure 3B). The Venn diagram showed the intersection of coexpressed genes (Table S1) and DEGs (Table S2); namely, 393 differentially coexpressed genes were identified ( Figure 3C). Figure 4A displays the PPI network of the differentially coexpressed genes with 241 nodes and 4792 edges. Subsequently, we quantified the degree scores of all nodes in this PPI network through the CytoHubba plugin (Table S3) and chose the top 40 nodes as hub genes that are closely correlated with LIHC ( Figure 4B). In addition, GO analysis showed significant enrichment in the mitotic nuclear division, organelle fission and spindles terms ( Figure S1A). KEGG pathway analysis showed enrichment in the cell cycle and oocyte meiosis pathways ( Figure S1B).

Survival Analysis and Correlation Network of the Differentially Coexpressed Hub Genes
Univariate Cox regression analysis of the differentially coexpressed hub genes demonstrated that 38 hub genes were closely associated with the survival of LIHC patients ( Figure 5A). The heatmap revealed that the 38 hub genes with prognostic value were significantly overexpressed in LIHC tissues ( Figure  5B). Additionally, the correlation network suggested that the differentially coexpressed hub genes closely interact with each other ( Figure 5C).

Construction of the Gene Signature and Nomogram in the TCGA Database
We used the LASSO Cox regression module to build a prognostic signature based on the expression matrix of the 38 differentially  . Then, we established a nomogram to predict the 1-, 2-, and 3-year OS probability of LIHC patients ( Figure 6C). The calibration curves of 1-, 2-, and 3-year OS probability showed satisfactory calibration of this nomogram (Figures 6D-F). Moreover, based on GSE112790, we confirmed that CDCA8, KIF20A, KIF2C and CEP55 were significantly overexpressed in LIHC tissues compared with nontumorous tissues ( Figure S2).  Figure 7A). Afterward, all LIHC patients were divided into a low-risk cohort (n=183) and a high-risk cohort (n=182) based on the median risk score ( Figure 7B). The high-risk cohort in the TCGA-LIHC dataset  had more deaths ( Figure 7C), a poorer tumor grade, a higher clinical stage and a higher T stage (Table 1). Consistently, Kaplan-Meier survival analysis showed that LIHC patients in the high-risk cohort experienced shorter survival than those in the low-risk cohort ( Figure 7D, P=1.14e-4). In the PCA of the TCGA-LIHC dataset, the first principal component (PC1) could explain 88.6% of total variance, and the PC1 scores were negatively correlated with the risk scores of patients ( Figure 7E), while the second principal component (PC2) could explain 5.4% total variance ( Figure S3A). Moreover, PCA and t-SNE analysis revealed that most LIHC patients in the high-and low-risk cohorts were distributed in two different directions ( Figure 7F).

Verification of the Four-Gene Signature Module in the ICGC Database
To validate the robustness of the four-gene signature module from the TCGA-LIHC dataset, we chose the ICGC-LIRI-JP dataset for further verification. First, we stratified LIHC patients from the ICGC-LIRI-JP dataset into high-risk and low-risk cohorts according to the median value of the risk score, which was calculated using the formula mentioned above. Consistent with the outcomes from the TCGA-LIHC dataset, the four-gene signature had an excellent AUC ( Figure  8A, 1-year survival: 0.752; 2-year survival: 0.751; and 3-year survival: 0.782). Moreover, the high-risk group correlated with a higher rate of mortality ( Figures 8B, C). Additionally, patients from the high-risk cohort experienced significantly shorter survival than those in the low-risk cohort ( Figure 8D, P=1.24e-3). In the PCA of the ICGC-LIRI-JP dataset, the PC1 could explain 79% of total variance, and the PC1 scores were positively correlated with the risk scores of patients ( Figure 8E), whereas the PC2 could explain 13% total variance ( Figure S3B). In addition, t-SNE analysis validated that most patients in the high-and low-risk cohorts were distributed in two different directions ( Figure 8F). In general, these outcomes in the ICGC-LIRI-JP dataset were similar to those in the TCGA-LIHC dataset.

Independent Prognostic Role of the Four-Gene Signature
To determine whether the risk score plays an independent prognostic role, we performed univariate and multivariate Cox regression analyses of the survival of LIHC patients. The univariate Cox regression analysis indicated that a higher risk score was closely correlated with worse survival in LIHC patients using the TCGA-LIHC ( Figure 9A

Differential Gene Expression Analysis and Functional Enrichment Analysis
Differential gene expression analyses were conducted in the TCGA-LIHC and ICGC-LIRI-JP datasets, and 499 and 185 DEGs between the high-and low-risk groups were obtained (Tables S4, S5), respectively. To explore the biological functions of the DEGs in the high-and low-risk groups, we again performed GO enrichment and KEGG pathway analyses. In the TCGA-LIHC dataset, GO enrichment analysis indicated significant enrichment in the organelle fission, nuclear division, chromosomal region and ATPase activity terms ( Figure 10A). GO enrichment analysis of the ICGC-LIRI-JP dataset showed similar outcomes to the TCGA-LIHC dataset ( Figure 10B). Additionally, KEGG pathway analysis of the TCGA-LIHC dataset showed significant enrichment in the cell cycle, oocyte meiosis and progesterone-medicated oocyte maturation pathways ( Figure 10C). In the ICGC-LIRI-JP dataset, KEGG pathway analysis also demonstrated the analogical outcomes of the TCGA-LIHC dataset ( Figure 10D).
To explore the correlation between the risk score and immune status, we calculated the infiltrating scores of 16 immune cells and 13 immune-related functions or pathways using ssGSEA. The scores of activated dendritic cells (aDCs), mast cells and follicular helper cells (Tfhs) were notably different between the high-and low-risk groups in the TCGA-LIHC dataset (all adj. P<0.001, Figure 11A). In the TCGA-LIHC dataset, the scores of cytolytic activity, type I interferon (IFN) response and type II  IFN response were obviously higher in the low-risk group, while the score of MHC class I was lower in the low-risk group (all adj. P<0.01, Figure 11B). Moreover, the ICGC-LIRI-JP dataset showed that aDCs, mast cells, MHC class I and type II IFN responses were significantly different between the two risk cohorts ( Figures 11C, D), which is consistent with the results of the TCGA-LIHC dataset.

DISCUSSION
As a common solid-tumor malignancy with high mortality, HCC has brought great socioeconomic pressure to HCC patients and their families. Owing to the complex etiological factors and high heterogeneity of HCC, it remains difficult to accurately predict the survival of HCC patients. Thus, it is urgent to detect effective prognostic biomarkers to monitor the progression and predict the prognosis of HCC patients. In this study, 393 differentially coexpressed genes were obtained through WGCNA and differential gene expression analysis. Then, these genes were used to construct a PPI network, and 38 hub genes were observed to be closely correlated with OS. Subsequently, we established a novel four-gene prognostic signature in the TCGA-LIHC dataset and built a nomogram based on this novel module, which showed acceptable accuracy and calibration. Afterward, the four-gene signature module was verified in the TCGA-LIHC dataset using the LASSO algorithm.
To improve the robustness of the signature, we used the ICGC-LIRI-JP dataset for further validation. The four-gene signature was still found to have independent prognostic value. Finally, ssGSEA revealed significant differences in aDCs, mast cells, MHC class I and type II IFN responses between the two risk cohorts. Several prior analyses have also shown that certain gene signatures may predict patient survival (20)(21)(22)(23)(24)(25)(26); however, our study has some differences and/or advantages compared with similar analyses. First, the gene signatures built in previous studies require many genes (20)(21)(22)(23), which possibly leads to some difficulties in real-world practice. Our novel signature requires only 4 genes, and the predictive ability of our signature is acceptable, which increases the feasibility of the use of our signature in real-world practice. Second, in our study, we simultaneously used WGCNA, differential gene expression analysis, PPI network construction, univariate Cox regression analysis and LASSO Cox regression analysis, and these methods were rarely used together in one study for the construction of a prognostic module of HCC, which is a novel point of our study. Third, some previous studies did not verify their gene signature (24)(25)(26) using other datasets; however, we used two datasets (the ICGC-LIRI-JP dataset and GSE112790) for external validation, which is helpful to enhance the reliability of our findings. Interestingly, we observed that most differentially coexpressed hub genes (38/40) were significantly associated with survival time according to the results of the univariate Cox regression analysis. This finding suggests the possibility of establishing a prognostic signature using these differentially coexpressed hub genes.
The prognostic module proposed in our analysis was composed of CDCA8, KIF20A, KIF2C and CEP55, all of which are often reported as being dysregulated in HCC tissues (27)(28)(29)(30). First, cell division cycle associated 8 (CDCA8) is regarded as a significant oncogene that is involved in the pathological development of various cancers, including HCC (27) and pancreatic ductal adenocarcinoma (31). Wu et al. reported that CDCA8 is obviously overexpressed at the mRNA and protein levels in HCC tissues, and the authors validated this finding at the mRNA level using realtime quantitative PCR (RT-qPCR) (32). Similarly, CDCA8 is closely correlated with cell division and growth in HCC, and CDCA8 is strongly associated with the pathological grades and T stages of HCC (33). Second, kinesin family member 20A (KIF20A) and KIF2C are the members of the kinesin superfamily proteins, both of which are closely regulated by E2F1. The depletion of KIF20A or KIF2C results in deforming microtubule structures, influencing cell motility and inhibiting cancer metastasis (34). A recent study suggested that KIF20A and KIF2C are obviously upregulated in HCC tissues, and higher expression of KIF20A and KIF2C correlates with worse survival (including OS and disease-free survival [DFS]), higher tumor stages and poorer pathological grades (35). Moreover, by conducting basic experiments, this study also showed that the downregulation of KIF20A and KIF2C can effectively inhibit the proliferation of HCC cells and increase G1 arrest in HCC cells (35). In addition, Lu et al. observed that high KIF20A expression was associated with more high-grade   (36). KIF2C contributed to cell proliferation, adverse invasion, and metastasis in vitro and in vivo by performing both gain-and loss-of-function assays, and the authors further suggested that KIF2C plays an important role in mediating the crosstalk between Wnt/b-catenin and mammalian target of rapamycin complex 1 (mTORC1) signaling in the pathogenesis of HCC (37). Third, centrosomal protein 55 (CEP55) contributes to the carcinogenesis of many cancers and regulates PI3K/AKT signaling (38). Yang et al. showed that CEP55 is upregulated in HCC tissues, and CEP55 overexpression correlates with poor tumor grades and high T stages; the authors also showed that CEP55 acts as an independent predictor of the OS of HCC patients using multivariate analysis (39). In addition, CEP55 was found to promote cell migration and adverse invasion via the regulation of the JAK2-STAT3-MMP signaling pathway in HCC, and the knockdown of CEP55 strongly suppressed HCC cell migration and invasion (40). Several limitations to our analysis exist. 1) The TCGA-LIHC dataset provides multiple HCC tissue samples, and the ICGC-LIRI-JP dataset and GSE112790 were applied for external validation. However, these datasets were obtained from public databases, and additional real-world datasets are required to validate the clinical utility of the four-gene prognostic signature.
2) Although we utilized comprehensive bioinformatics approaches to construct and validate this prognostic signature in HCC, it may not be very accurate for HCC patients with different grades and stages. 3) We did not verify the correlation between the risk score and immune status by conducting basic experiments, which is a significant issue that deserves further investigation in the future.

CONCLUSION
This comprehensive analysis proposes a novel prognostic signature of four differentially coexpressed hub genes that has satisfactory prognostic value. This model was an independent predictor of OS in the TCGA-LIHC and ICGC-LIRI-JP datasets, providing insight into the prediction of HCC prognosis. Nevertheless, additional studies are required to further explore the underlying mechanisms of these differentially coexpressed hub genes and tumor immunity.

DATA AVAILABILITY STATEMENT
The original contributions presented in the study are included in the article/Supplementary Material. Further inquiries can be directed to the corresponding author.

AUTHOR CONTRIBUTIONS
ZM had full access to all of the data in the manuscript and takes responsibility for the integrity of the data and the accuracy of the data analysis. Concept and design: All authors. Acquisition, analysis, and interpretation of data: ZM. Drafting of the manuscript: ZM. Critical revision of the manuscript for important intellectual content: All authors. Statistical analysis: All authors. Supervision: All authors. All authors contributed to the article and approved the submitted version.

FUNDING
This study is supported by the Public Welfare Technology Application Research Program of Huzhou (No. 2019GY35, 2019GY01) and Young Talents Project of Huzhou Central Hospital (NO. 2020YC09), without the involvement of commercial entities. The funder had no role in the design or performance of the study; the collection, management, analysis, and interpretation of the data; the preparation, review, and approval of the manuscript; or the decision to submit the manuscript for publication.