Construction and Validation of a Reliable Six-Gene Prognostic Signature Based on the TP53 Alteration for Hepatocellular Carcinoma

Background The high mutation rate of TP53 in hepatocellular carcinoma (HCC) makes it an attractive potential therapeutic target. However, the mechanism by which TP53 mutation affects the prognosis of HCC is not fully understood. Material and Approach This study downloaded a gene expression profile and clinical-related information from The Cancer Genome Atlas (TCGA) database and the international genome consortium (ICGC) database. We used Gene Set Enrichment Analysis (GSEA) to determine the difference in gene expression patterns between HCC samples with wild-type TP53 (n=258) and mutant TP53 (n=116) in the TCGA cohort. We screened prognosis-related genes by univariate Cox regression analysis and Kaplan–Meier (KM) survival analysis. We constructed a six-gene prognostic signature in the TCGA training group (n=184) by Lasso and multivariate Cox regression analysis. To assess the predictive capability and applicability of the signature in HCC, we conducted internal validation, external validation, integrated analysis and subgroup analysis. Results A prognostic signature consisting of six genes (EIF2S1, SEC61A1, CDC42EP2, SRM, GRM8, and TBCD) showed good performance in predicting the prognosis of HCC. The area under the curve (AUC) values of the ROC curve of 1-, 2-, and 3-year survival of the model were all greater than 0.7 in each independent cohort (internal testing cohort, n = 181; TCGA cohort, n = 365; ICGC cohort, n = 229; whole cohort, n = 594; subgroup, n = 9). Importantly, by gene set variation analysis (GSVA) and the single sample gene set enrichment analysis (ssGSEA) method, we found three possible causes that may lead to poor prognosis of HCC: high proliferative activity, low metabolic activity and immunosuppression. Conclusion Our study provides a reliable method for the prognostic risk assessment of HCC and has great potential for clinical transformation.


Conclusion:
Our study provides a reliable method for the prognostic risk assessment of HCC and has great potential for clinical transformation.
Keywords: hepatocellular carcinoma, TP53, prognostic, signature, biomarker BACKGROUND Hepatocellular carcinoma (HCC) is a major cause of cancer mortality due to its high incidence rate, high recurrence, and limited molecular targeted therapeutic options (1,2). Thus, there is an urgent need to discover novel biomarkers and design novel therapeutic strategies for HCC.
TP53 mutations occur in almost every type of cancer at rates from 38%-50%. Additionally, it has been shown that TP53 mutations were more frequent in cancer patients with lower survival rates among all cancer types studied (3). Wild-type TP53 monitors abnormal activities in cells, senses cell pressure or damage, and prevents the proliferation of damaged cells (4). However, under TP53 mutation, cells exhibiting DNA damage are capable of escaping apoptosis and transforming into carcinoma cells. In addition, TP53 mutant proteins lose their wild-type function and accumulate in the nucleus, a significant hallmark of malignant tumors (5). A large sample study of 10225 cancer patients found that TP53 mutations were more frequent in cancer patients with lower survival rates among all 32 cancer types studied (6). Professor Donhower's (6) study also found that in most TP53 mutant tumors, oncogene amplification increased, and tumor suppressor genes were deeply deleted. The expression patterns of RNA, miRNA and protein in TP53-mutated tumors are different from those in nonmutated tumors. The expression of cell cycle progression genes and proteins in TP53-mutated tumors is enhanced.
The TP53 mutation process is the most common mutation process in HCC. This gene is critical for maintaining the stable properties of the genome. Its loss of function can lead to a centrosome amplification process, aneuploid cell proliferation process and chromosome instability (7,8). Some related reports showed that the TP53 mutation process was not only related to the prognosis of HCC (9) but also correlated with serum alphafetoprotein (AFP), clinical stage, vascular invasion, tumor differentiation and Child-Pugh grade (10)(11)(12)(13). Therefore, TP53 mutation has a profound impact on the genomic structure, expression and clinical prospects of HCC. Understanding the effect of TP53 on the pathogenesis of HCC is critical to develop more effective treatments for HCC.
The more we learn about TP53, the more we can understand the basic biology of HCC and develop better treatments. In this study, we speculated that the change in the RNA expression pattern caused by TP53 mutation may be an important reason for the difference in prognosis. Based on this hypothesis, we have successfully developed an approach capable of predicting accurate HCC prognosis. Considering the accuracy and universal applicability of the model, the six genes included may be likely targets to treat HCC.

Research Object
The clinical and transcriptome (fragments per kilobase of transcript per million (FPKM)) data of 374 HCC samples were acquired from The Cancer Genome Atlas (TCGA-LIHC) website (https://portal.gdc.cancer.gov/projects/TCGA-LIHC). Among them, there were 116 cases of TP53 mutation and 258 cases of wild-type TP53, and a detailed list of TP53-mutated samples was obtained from the cBioPortal website (https://www.cbioportal. org/). A total of 365 cases had complete prognosis information. We acquired the gene expression profile and clinic-related data of the LIRI-JP dataset from the International Genome Consortium (ICGC) database (https://dcc.icgc.org/releases/ current/Projects/LIRI-JP) for external validation (n=229). TCGA and ICGC were all based on the Illumina HiSeq platform. The work here fully complied with the publication guidelines of TCGA and ICGC. Our research was based on public databases; therefore, further approval from the local ethics committee was not needed. The detailed clinical information of the TCGA and ICGC cohorts is shown in Table 1.

Screening of Prognosis-Related Genes
We screened prognosis-related genes by univariate Cox regression analysis and Kaplan-Meier (KM) survival analysis.
The genes with P values less than 0.05 obtained by the two methods were used for subsequent study.

Construction and Validation of Prognostic Model
To improve the generalization ability of the established prognostic model, 365 HCC cases with complete prognostic information were randomly divided into two independent cohorts: a training cohort (n = 184) and a validation cohort (n = 181). The aforementioned process was implemented using the R package 'Caret'. In the training cohort (n = 184), we used least absolute shrinkage and selection operator (LASSO) penalty Cox regression analysis to further select prognostic genes. The LASSO algorithm with penalty parameter tuning performed via 10-fold cross-validation was applied to exclude genes that may be highly correlated with other genes. We performed 1000 10-fold cross-validations of data sets and selected genes with more than 900 repetitions (14,15). A subset of genes was determined by shrinking the regression coefficient using a penalty proportional to their size. The genes with nonzero regression coefficients were retained for subsequent multivariate Cox regression analyses (16,17). Finally, the regression coefficient obtained by multivariate Cox regression analysis was multiplied by the expression level of each gene to construct a prognostic model. All subjects were divided into two risk groups according to the median risk of the training cohort, and individuals with a risk score higher or lower than the median risk were divided into high-risk and low-risk groups, respectively. The Kaplan-Meier (KM) survival curve and the receiver operating characteristic (ROC) curve were used to assess the predictive ability of the model. To evaluate whether the prognostic model is independent of the traditional clinical features, we conducted univariate and multivariate Cox regression analyses of the prognostic model and the traditional clinical features. For internal and external validation, the testing cohort (n = 181), TCGA cohort (n = 365), external validation cohort (ICGC, n = 229), whole cohort (all included samples, n = 594) and subgroup (n = 9) survival analysis were adopted to validate the predictive capability and applicability exhibited by the prognostication signal in HCC. LASSO regression was implemented using the 'glmnet' R package. The survival curve was generated using Kaplan-Meier survival analysis and visualized using the R package 'survminer', and data were analyzed using the log-rank test. The ROC curves were drawn, and the corresponding AUC values were calculated using the R package 'timeROC'.

Gene Set Variation Analysis
We first used the 'Limma' R package to identify genes with differential expression (DEGs) between the high-risk group and the low-risk group in the TCGA and ICGC cohorts. Adjusted P value < 0.05 and absolute value of fold change (FC) >1 were suggested to indicate statistical significance. Then, we performed gene set variation analysis (GSVA) on the DEGs to identify prognosis-associated signaling pathways associated with the signature using the 'GSVA' R package. For GSVA, we chose Kyoto Encyclopedia of Genes and Genomes (KEGG) pathways as the reference and adj P value < 0.05 as the cutoff to screen significantly altered pathways.

Correlation Analysis Between Signature and Immunity
We conducted single sample gene set enrichment analysis (ssGSEA, 29 immune-related gene sets representing immune cell type, function, and pathway) to quantify the activity or enrichment levels of immune cells, functions, or pathways in the high-and lowrisk samples from TCGA and ICGC, respectively. The normalized enrichment score (NES) calculated from ssGSEA was calculated using the 'GSVA' R package. We used independent-samples t tests to explore the differences in immune infiltration levels and immune function between the high-and low-risk groups, and P < 0.05 was suggested to indicate statistical significance.

Gene Set Enrichment Analysis of TP53 in the TCGA Database
A brief workflow for this study is shown in Figure 1. Compared to the unaltered group, the patients with TP53 mutation had both significantly poorer overall survival and disease-free survival (Figures 2A, B). We carried out GSEA of HCC samples from wild-type TP53 (n=258) and mutant TP53 (n = 116) samples. The results showed that 7 of the hallmark gene sets (n = 50) were significantly enriched in TP53-mutated HCC samples (Table 2, Figure 2C). We extracted 1048 genes from these 7 gene sets for subsequent analysis.

Identification of Prognosis-Related Genes
By univariate Cox regression analysis and Kaplan-Meier (KM) survival analysis (patients were classified into low and high expression groups based on median gene expression data), a total of 51 genes were screened out, of which 49 genes were unfavorable to prognosis (HR>1) and 2 genes were favorable to prognosis (HR<1) ( Table 3).

Construction of a Six-Gene Prognostication Signal in the Training Cohort
A total of 184 HCC patients were enrolled in the training set. The flowchart of the prognosis-scoring model construction process is shown in Figure 3A. The aforementioned 51 genes were further reduced by Lasso penalty Cox regression analysis, and we subsampled the dataset 1000 times with 10-fold cross-validation. After that, 41 genes with zero Lasso regression coefficients were excluded with the optimal value of log l (-3.2), and 10 genes with nonzero Lasso regression coefficients were included in the multivariate Cox regression analysis. A novel risk score was calculated by multiplying the gene expression of each gene and its corresponding coefficient, which was obtained by multivariate Cox regression analysis. Risk score= EIF2S1*0.1055 + SEC61A1* 0.0082 + CDC42EP2*0.2127 + SRM*0.0155 + GRM8*0.1344 + TBCD*0.0600. The patients were grouped into high-risk or lowrisk categories using the median risk score of the training series as the cutoff point (0.81). There was a significant difference in RNA expression level between tumor (n = 374) and normal tissues (n = 50) of the 6 genes in the signature ( Figure 3B). Higher levels of the six mRNAs were associated with decreased overall survival (OS) in the TCGA cohort (n = 365) ( Figure 3C). The OS of HCC patients in the cohort exhibiting high risk was significantly lower than that of the cohort exhibiting low risk ( Figure 4A, P < 0.001). ROC curve analysis demonstrated the predictive ability of the risk score for 1-, 2and 3-year OS, with areas under the curve (AUCs) of 0.740, 0.757 and 0.756, respectively ( Figure 4B). Univariate and multivariate Cox regression analyses showed that only the risk score of the 6-gene signature was an independent prognostic element ( Figure 4C).

Internal Validation of the Prognostic Signature in the Testing and Entire TCGA Cohort
An identical risk score formula and cutoff value determined from the training set were used to analyze the testing cohort. Consistent with the results in the training set, the KM curves of the testing sets showed that the cohort exhibiting high risk had a worse prognosis than the low-risk group ( Figure 4D). The AUC values for the signature predicting OS at 1, 2 and 3 years were 0.770, 0.736 and 0.710, respectively ( Figure 4E), indicating good prediction accuracy. From the results of both univariate and multivariate Cox regression analyses, the risk score was found to be an independent poor prognostic indicator of HCC ( Figure 4F). To validate the accuracy of the risk model, we analyzed the model in the entire TCGA cohort, and the results were in line with the above-described results (Figures 4G-I).

External Validation of the Prognostic Signature in the ICGC Cohort
For the in-depth assessment of the reliability of the prognostication model, an external dataset from the ICGC database was employed to verify the six-gene signature. The risk score was determined for each case according to the six-gene signature, and the 229 patients were divided into a high-or low-risk cohort based on their risk score. The Kaplan-Meier plot indicated that patients in the cohort exhibiting high risk had significantly shorter overall survival than those in the low-risk group ( Figure 5A). The AUCs at 1, 2 and 3 years were 0.835, 0.786, and 0.809, respectively, indicating that the risk score played a significant role in the prediction of prognosis ( Figure 5B). As shown in Figure 5C, the six-gene expression level was increased significantly in the high-risk group. The risk of mortality tended to rise along with the risk score ( Figures 5D, E). Both univariate and multivariate analyses suggested that the risk score could be used as an independent prognostic indicator ( Figures 5F, G).

Gene Set Variation Analysis Between Different Risk Groups
We identified the differentially expressed genes (DEGs) between the high-risk and low-risk groups in the ICGC and TCGA   Figures 6A, C). Subsequently, we made use of these DEGs to perform gene set variation analysis (GSVA). KEGG pathway activities were scored per sample by GSVA between the different risk groups. We found a significant decrease in metabolism-related signaling pathway GSVA scores in the cohort exhibiting high risk ( Tables 4, 5 and Figures 6B, D). Furthermore, oocyte meiosis-associated genes exhibited higher enrichment scores in the cohort demonstrating high risk ( Tables  4, 5 and Figures 6B, D).

Differences in the Immune Landscape Between Different Risk Groups
We attempted to explore cases involving differences in prognosis in different risk groups from the perspective of the immune microenvironment. We used the ssGSEA score to quantify the activity or enrichment levels of immune cells and functions in the HCC samples ( Figures 7A, D). Similar results were obtained from two independent cohorts. In terms of immune cell infiltration, the infiltration abundance of aDCs, iDCs,    macrophages, NK cells, and Tregs in the cohort exhibiting high risk was significantly higher than that in the low-risk group (Figures 7B, E). In terms of immune function, the results revealed that the expression of immune checkpoints and MHC class I-related gene sets was upregulated in the high-risk group, while the expression of type I and II IFN response-related gene sets was upregulated in the cohort exhibiting low risk ( Figures  7C, F). We also detected the expression levels of PD1 (PDCD1), PDL1, TIGIT, CTLA4 and LAG3 in the high-risk and low-risk groups. This study reported that the risk score positively correlated with the expression levels of PD1 (PDCD1), PDL1, TIGIT, CTLA4 and LAG3. The expression levels of these five common immune checkpoints were all upregulated in the highrisk cohort (Figures 8A-C).

Integrated Analysis of the Prognostication Signal in the Whole Group
A total of 594 cases were covered in this pooled analysis.
The results were consistent with the conclusion from a previous study. In terms of OS, patients exhibiting high risk had a poorer OS than patients exhibiting low risk ( Figure 9A). survival rates of patients predicted by the risk score were 0.761, 0.750, and 0.772, respectively ( Figure 9B). Univariate and multivariate Cox regression analyses revealed that a high risk score indicated poor clinical outcome in HCC patients ( Figure  9C). In addition, TP53 and TNM stage were significant prognostic elements ( Figures 9D, F), and a positive correlation was found between risk score and TNM stage ( Figure 9E). Patients with TP53 mutations had higher risk scores ( Figure 9G).

Subgroup Survival Analysis
To validate the general applicability of the prognostication signal in HCC, we grouped the 594 patients according to their clinical characteristics. In each subgroup, cases exhibiting high risk were compared with those exhibiting low risk. The prognostication of cases with high risk was noticeably worse than that of cases with low risk in each subgroup ( Figures 10A-D). These results confirmed that the risk score had good stratification ability for the prognosis of HCC. A B D C FIGURE 6 | Gene Set Variation Analysis between high-and low risk groups in the TCGA and ICGC cohort (A) Differentially expressed genes (DEGs) between highand low-risk groups in the TCGA cohort (B) Gene Set Variation Analysis between high-and low risk groups in the TCGA cohort (C) Differentially expressed genes (DEGs) between high-and low-risk groups in the ICGC cohort (D) Gene Set Variation Analysis between high-and low risk groups in the ICGC cohort.

DISCUSSION
A majority of HCC patients are at the medium and advanced phases when diagnosed because there are no obvious symptoms in the early stage of HCC (18). Although many therapeutic strategies have been attempted, such as radiofrequency ablation, surgical resection and liver transplantation, the prognostication of HCC cases receiving these treatments is still unsatisfactory (19). Elucidating the molecular mechanism and process of HCC is very important for identifying new therapeutic targets and improving the clinical outcomes of HCC patients.
At present, the traditional methods for predicting prognosis based on a single biomarker or clinical data lack systematic evaluation, resulting in a lack of high sensitivity and specificity. Due to the complex molecular regulation mechanism of HCC, traditional clinical pathological staging cannot fully reflect the heterogeneity of tumors, and the ability to predict prognosis is poor. Therefore, the construction of a gene sequencing prediction model based on big data has great potential for clinical transformation and has become an urgent need to improve clinical efficacy.
The TP53 mutation process is one of the most common mutations in HCC and is considered to be the main driving factor of HCC (8,18). This gene plays a vital role in maintaining genomic stability, and its loss of function can lead to centrosome amplification, aneuploid cell proliferation and chromosome The differences of immune cell infiltration between high-and low risk groups in the TCGA cohort (C) The differences of immune function between high-and low risk groups in the TCGA cohort (D) The heatmap of immune cell infiltration and immune function in the ICGC cohort (E) The differences of immune cell infiltration between high-and low risk groups in the ICGC cohort (F) The differences of immune function between high-and low risk groups in the ICGC cohort. ***p < 0.001; **p < 0.01; *p < 0.05. ns, no significant.
instability (20). Compared with TP53 wild-type HCC patients, TP53 mutant HCC patients had shorter overall survival and relapse-free survival (13). We speculated that the poor prognosis of HCC patients with TP53 mutations may be partly due to the change in the RNA expression pattern caused by TP53 mutation. Following this hypothesis, we performed GSEA on HCC samples with and without TP53 mutation. We identified 7 gene sets that were positively correlated with TP53 mutation. After univariate Cox regression analysis, Kaplan-Meier (KM) survival analysis, Lasso penalty Cox regression analysis, and step-by-step multivariate Cox regression analysis, we selected six genes (EIF2S1, SEC61A1, CDC42EP2, SRM, GRM8, TBCD) to construct a prognostic risk score model.
To assess the reliability of the prognostic model, we conducted internal validation, external validation, integrated analysis and subgroup analysis. The AUC values of the ROC curves of 1-, 2-, and 3-year survival of the model were all greater than 0.7 in each independent cohort, which indicated that the signature composed of six genes had good performance in predicting the prognosis of HCC. The proposed signature was superior to a previously TP53-associated prognostic model developed by Long et al. (21) for the assessment of HCC OS.
To explore the regulatory mechanisms of the prognostication model, we performed GSVA on different risk groups. This work indicated that the high-risk cohort exhibited stronger proliferative activity and weaker metabolic activity, which was similar to the results of Gao et al. (22). Based on analysis of the tumor immune microenvironment, the cohort exhibiting high risk showed strong immunosuppression. Because of several important components representing immunosuppression, such as macrophages and Tregs (23,24), the infiltration abundance in the high-risk group was significantly higher than that in the low-risk group. Another important result was that the expression levels of immune checkpoints in the cohort with high risk were significantly higher than those in the cohort with low risk, which was consistent with the positive correlation between the risk score and the expression level of immune checkpoints. Interestingly, the risk score corresponding to TP53 mutation was also significantly increased. As suggested by Long et al. (25), the immune response of TP53 mutant HCC was significantly weaker. Therefore, we speculated that the poor prognosis of HCC patients in the cohort exhibiting high risk was associated with the tumor microenvironment of immunosuppression. Based on this evidence, we summarized the possible reasons that may lead to weak prognosis in the high-risk group: high proliferative activity, low metabolic activity and immunosuppression. To date, the six genes in the signature have not been reported in HCC. EIF2S1 participates in premature protein synthesis processes through the formation of a ternary complex with initiators tRNA and GTP (26). SEC61A1 refers to the main polypeptide conduction pathway in the endoplasmic reticulum membrane and the main subunit of the SEC61 complex. Its missense mutation can cause genetic immune-related diseases, such as plasma cell deficiency (27). CDC42EP2 is a member of the CDC42 subfamily that belongs to the Rho family, and the Rho family plays an important role in a variety of cellular processes covering skeletal myogenesis (28). SRM refers to a polyamine biosynthetic enzyme (29). Polyamines modulate the gene expression process through changes in DNA structures and the regulation of signal transduction pathways, and they are consequently associated with proliferation, tumor invasion, and metastasis (30,31). GRM8 pertains to 1 of eight G-protein coupled receptors in the glutamate family and couples to various intracellular second messenger mechanisms to modulate neuronal functions (e.g., neuronal excitability and development) (32). TBCD encodes TBCD (tubulin folding cofactor D), one of five tubulin-specific chaperones that critically impact microtubule assembly in all cells (33). Recently, variants in TBCD have been identified in patients with distinct progressive encephalopathy with an apparently wide clinic-related scope (34).
Our study used TP53 to find 6 new prognostic biomarkers of HCC, and the signature composed of these six genes showed good performance in predicting the prognosis of HCC. However, this work had limits that should be acknowledged. We developed and validated the prognostic risk mode by utilizing general databases, and the outcomes require in-depth confirmation through prospective research. Our major findings were generated from bioinformatics analysis, which demonstrates insufficient verification through in vitro and in vivo experimental processes. In future work, prospective laboratory studies to clarify the specific mechanisms of the six genes in our signature are warranted.

CONCLUSION
Our study provides a reliable method for the prognostic risk assessment of HCC and has great potential for clinical transformation.

DATA AVAILABILITY STATEMENT
Publicly available datasets were analyzed in this study. This data can be found here: The datasets analysed for this study were obtained from The Cancer Genome Atlas (TCGA) (https:// portal.gdc.cancer.gov/) and International Cancer Genome Consortium (ICGC) (https://icgc.org/).

AUTHOR CONTRIBUTIONS
JH designed this study. LW and YZ contributed to the conception of the study. JH performed the data analyses and wrote the manuscript. LW helped perform the analysis with constructive discussions. All authors contributed to the article and approved the submitted version.