Genomic Instability of Mutation-Derived Gene Prognostic Signatures for Hepatocellular Carcinoma

Hepatocellular carcinoma (HCC) is one of the major cancer-related deaths worldwide. Genomic instability is correlated with the prognosis of cancers. A biomarker associated with genomic instability might be effective to predict the prognosis of HCC. In the present study, data of HCC patients from The Cancer Genome Atlas (TCGA) and International Cancer Genome Consortium (ICGC) databases were used. A total of 370 HCC patients from the TCGA database were randomly classified into a training set and a test set. A prognostic signature of the training set based on nine overall survival (OS)–related genomic instability–derived genes (SLCO2A1, RPS6KA2, EPHB6, SLC2A5, PDZD4, CST2, MARVELD1, MAGEA6, and SEMA6A) was constructed, which was validated in the test and TCGA and ICGC sets. This prognostic signature showed more accurate prediction for prognosis of HCC compared with tumor grade, pathological stage, and four published signatures. Cox multivariate analysis revealed that the risk score could be an independent prognostic factor of HCC. A nomogram that combines pathological stage and risk score performed well compared with an ideal model. Ultimately, paired differential expression profiles of genes in the prognostic signature were validated at mRNA and protein level using HCC and paratumor tissues obtained from our institute. Taken together, we constructed and validated a genomic instability–derived gene prognostic signature, which can help to predict the OS of HCC and help us to explore the potential therapeutic targets of HCC.


INTRODUCTION
Hepatocellular carcinoma (HCC) is the sixth most prevalent cancer, and its mortality remains the fourth cancer-related deaths worldwide (Bray et al., 2018). The overall survival (OS) rate for HCC remains low because of its late stage at diagnosis and high recurrence after curative treatment (Bruix et al., 2014;Joliat et al., 2017;Zheng et al., 2017).
In the past decades, serum alpha-fetoprotein (AFP) is the traditional biomarker for diagnosis, outcome prediction, and evaluation of response to therapy in HCC. However, the level of AFP is influenced by other factors, such as cancer stage, tumor size, and underlying liver disease. Currently, AFP is lack of sufficient specificity and sensitivity for diagnosis of HCC (Bruix and Sherman, 2011;Kudo, 2015). As a result, the prognosis of HCC patients remains highly heterogeneous. Hence, some effective and sensitive new biomarkers are needed to improve the prognosis of HCC patients.
It is known that genetic mutation plays an important role in the carcinogenesis and progression of cancers. Many previous studies have revealed that genetic mutation of some vital genes has significant relevance to the development and prognosis of HCC. Activating mutations in CTNNB1 are negatively correlated with the prognosis of HCC related to alcoholic cirrhosis (Schulze et al., 2015). While inactivating mutations of TP53 are negatively affected the prognosis of HCC with hepatitis B virus infection (Nishida et al., 1993). Thus, genomic instability may serve as an important prognostic biomarker for cancers (Suzuki et al., 2003;Ottini et al., 2006). Several prognostic signatures based on genomic instability have been constructed for predicting survival of patients with breast cancer (Habermann et al., 2009;Bao et al., 2020) and ovarian cancer (Wang et al., 2017). To date, genomic instabilityderived gene biomarkers for HCC have not been explored. In the present study, we construct a genomic instabilityderived genes (GIGs) prognostic signature (GIGSig) to predict the OS of patients with HCC and explore the potential therapeutic targets of HCC based on The Cancer Genome Atlas (TCGA) and International Cancer Genome Consortium (ICGC) databases.
In addition, UBQLN4 is closely correlated with genomic instability and overexpressed in various malignancies (Jachimowicz et al., 2019). UBQLN4 may act as a biomarker for genomic instability. Therefore, we also analyzed the expression profile of UBQLN4 between genomic-unstable (GU)-like group and genomic-stable (GS)-like group with the aim to evaluate the genomic instability in different groups of patients with HCC.

Data Source
The transcriptomic profiling of RNA sequencing data, simple nucleotide variation data, and clinical data of HCC patients were collected from the TCGA database. The patients were randomly categorized into a training (n = 186) and test (n = 184) set. The demographic characteristics of the training, test and TCGA sets were shown in Table 1. The clinicopathological variables were not significantly different between the training and test sets. To externally validate the predictive value of the prognostic signature, the simple nucleotide variation data and the LIRI-JP data set from the ICGC database were used.

Identification of Genomic Instability-Derived Genes in the Cancer Genome Atlas and International Cancer Genome Consortium Data Sets
The mutations that cause amino acid changes are defined as gene mutations. The cumulative numbers of gene mutations for each sample in the TCGA and ICGC sets were calculated by the Perl software. Subsequently, samples were ranked and classified into GU-like group (the top 25% of patients) and GS-like group (the last 25% of patients) according to the number of gene mutations. Then, differentially expressed gene analyses between GS-like and GU-like group were performed using the Limma R package; | log2 fold change| > 1 and false discovery rate <0.05 were set as the cutoff values. Genes differentially expressed in both the TCGA and ICGC data sets were defined as GIGs. The Gene Ontology (GO) and Kyoto Encyclopedia of Genes and Genomes (KEGG) function analyses were performed to explore the potential molecular function of these GIGs. Based on GIGs, hierarchical cluster analyses of TCGA and ICGC sets were performed using Euclidean distances and Ward's linkage method.

Development and Validation of the Genomic Instability-Derived Gene Prognostic Signature
We then developed a GIGSig in the training set and validated in the test and TCGA and ICGC sets. The OS was selected as the primary efficacy end point in this study. Cox univariate analysis was performed in the training set to identify the OSrelated GIGs; P < 0.05 was selected as the cutoff value. Least absolute shrinkage and selection operator (LASSO) analysis was performed to avoid overfitting the prognostic signature. Cox multivariate analysis was performed to construct a GIGSig. Kaplan-Meier curves were plotted to compare the OS of highand low-risk groups. The receiver operating characteristic (ROC) curve was plotted, and the area under the ROC (AUC) was calculated to evaluate the predictive efficacy of this signature.

Genes Forward primers Reverse primers
The clinicopathological features and risk score were analyzed by univariate and multivariate Cox analyses to explore the risk factors that affected the survival of HCC patients. Subgroup analysis and the correlation analysis between the risk score and clinicopathological characteristics were performed.

Development and Validation of a Nomogram for Predicting OS of Hepatocellular Carcinoma
A nomogram that integrated the independent prognostic factors selected from multivariate regression analysis was developed to predict the OS of HCC patients at 1, 3, and 5 years. The prediction accuracy of this nomogram was validated by comparing the observed actual probability with the calibration curve.

Tissue Samples
Thirty HCC tissues and paired adjacent non-cancerous liver tissues were collected from patients who underwent liver resection in the First Affiliated Hospital of Sun Yat-sen University from June 2019 to December 2019. The samples were snap-frozen in liquid nitrogen and stored at −80 • C before RNA and protein extraction. This study was approved by the Ethics Committee of the First Affiliated Hospital of Sun Yat-sen University [Approval number (2021)158].

Quantitative Real-Time Reverse Transcription Polymerase Chain Reaction
Total RNA was isolated from the tissues using TRIZOL reagent (Invitrogen, CA, United States) according to the manufacturer's instructions. Complement DNAs (cDNAs) were reverse transcribed using SuperScript First-Stand Synthesis system (Invitrogen, Carlsbad, CA, United States). The synthesized cDNAs were used for real-time polymerase chain reaction (PCR) with reverse transcriptase-PCR System (Roche, LightCycler480 II, Switzerland), and the conditions were performed according to the manufacturer's instructions. The primers of GAPDH were used as internal loading control. All primers are shown in Table 2. Samples were normalized to internal housekeeping genes. All the values were standardized with 2 − CT method.

Western Blot
We extracted proteins by using RIPA lysis buffer with 1% proteinase inhibitor and quantified by a BCA kit (Thermo, United States). Equal amounts of proteins (20 µg) were separated by 7.5, 10, or 12.5% sodium dodecyl sulfate-polyacrylamide gel electrophoresis based on the molecular weight of the proteins and transferred onto polyvinylidene fluoride membranes (Millipore, United States). After blocking for 2 h with 5% skim milk powder at room temperature, membranes were incubated with primary antibodies at 4 • C overnight. The primary antibodies are shown in Supplementary Table 1. The membranes were then incubated with horseradish peroxidase (HRP)-conjugated goat antirabbit secondary antibody (1:5,000, Abcam, United Kingdom) for 11/2 h and visualized using the Immobilon Western Chemiluminescent HRP Substrate (Millipore, United States).

Statistical Analysis
All the statistical analysis was performed by R software (version 3.6.1) and Perl software (version 5.30). P < 0.05 was set as the cutoff value of significance.

Identification of Genomic Instability-Derived Genes and Development of a Genomic Instability-Derived Gene Prognostic Signature
A total of 363 samples of TCGA and 270 samples of ICGC data set were identified as gene mutation samples. Gene differential expression analysis identified 656 down-regulated and 58 upregulated genes in the TCGA set ( Figure 1A) and 1,853 down-regulated and 176 up-regulated genes in the ICGC set ( Figure 1B). Subsequently, 509 genes were identified to be differentially expressed in both the TCGA and ICGC sets, which were defined as GIGs in the present study ( Figure 1C). Gene Ontology analysis of these GIGs revealed that "extracellular matrix organization, " "collagen-containing extracellular matrix, " and "extracellular matrix structural constituent" were the most frequently involved biological terms under the biological processes, cellular components, and molecular functions, respectively (Supplementary Figure 1A). In the KEGG analysis, the main pathways related to these GIGs were the PI3K-Akt pathway, the Wnt pathway, and the MAPK pathway (Supplementary Figure 1B).
Based on the expression levels of the 509 GIGs, unsupervised hierarchical clustering analysis was performed on 374 samples of the TCGA set (Supplementary Figure 1C) and 237 samples of the ICGC set (Supplementary Figure 1D). The samples were categorized as the GS-like and GU-like groups. Somatic mutations counts were significantly higher in the GUlike group than that in the GS-like group in both TCGA (Supplementary Figure 1E) and ICGC sets (Supplementary Figure 1F). UBQLN4 is one of the key genes driving biological process of genomic instability. Therefore, the expression levels of UBQLN4 between the GS-like and GU-like groups were measured. Higher expression of UBQLN4 was identified in the GU-like group when compared to the GS-like group in both the TCGA (Supplementary Figure 1G) and ICGC sets (Supplementary Figure 1H).
Among them, 27 of 509 GIGs identified by the Cox univariate analysis were correlated with the OS in the training set ( Figure 1E), and 14 of 27 OS-related GIGs were suitable to establish a prognostic signature based on the LASSO analysis ( Figure 1D). Finally, 9 of 14 GIGs were selected by the Cox multivariate analysis to construct a prognostic signature ( Figure 1F).
The prognostic signature was presented as a risk score that was calculated by multiplying the gene expression levels and Cox regression coefficients.

Evaluation and Validation of the Genomic Instability-Derived Gene Prognostic Signature
Hepatocellular carcinoma patients of the training, test, TCGA, and ICGC sets were categorized into the high-and low-risk groups by the median value of the risk score in the training set. The heatmap analysis indicated that six GIGs were positively related to the risk score, and the other three GIGs were negatively associated with the risk score. Somatic mutation count and UBQLN4 expression of patients were positively related to the risk score (Figure 2). Furthermore, patients in the highrisk group have a higher somatic mutation count and higher UBQLN4 expression compared with those in the low-risk group in the training, test, TCGA, and ICGC sets (Figure 2). Clinical correlation analysis revealed that the risk score was related to pathological stage and tumor grade in the training and TCGA and ICGC sets (Figure 2).
Patients in the high-risk group had a significantly poorer OS than those in the low-risk group in the training (Figure 3A), test ( Figure 3B), TCGA (Figure 3C), and ICGC sets ( Figure 3D).
Cox univariate and multivariate analyses revealed that the risk score and pathological stage were independent prognostic factors for patients in the training, test, ICGC, and TCGA sets ( Table 3).

Performance Comparison of the Genomic Instability-Derived Gene Prognostic Signature With Previous Gene Signatures for OS Prediction in HCC Patients
To further evaluate the predictive efficacy of GIGSig on OS, we compared the AUC of OS at 3 years of GIGSig with other four published gene signatures using the same TCGA data set. These four signatures included a six-gene-based signature by Liu et al. (2019), a nine-immune-gene-based signature by Wang et al. (2020), a 6-immune-gene-based signature by Xu et al. (2021), and a 4-gene-based signature by Yan et al. (2019). The AUC of OS at the 3-year of GIGSig was 0.713, which was higher than   (Figure 4). These results suggested that the predictive performance of the present GIGSig is more accurate than the above four published gene signatures.

Subgroup Analysis
Subgroup analysis of the training, test, TCGA, and ICGC sets were performed based on age (>65 and ≤65 years), gender (female and male), tumor grade (grades 1-2  Figure 4), and ICGC set (Supplementary Figure 2), subgroup analysis showed that GIGSig performed well in predicting OS of HCC patients. While in subgroup analysis of the test set, except for patients in subgroups of age >65 years and G3-4, patients of other subgroups in the high-risk group have poorer OS than those in the low-risk group (Supplementary Figure 3).

Construction of a Nomogram for Predicting OS of Hepatocellular Carcinoma
As shown in Figure 5, we developed a 1-, 3-, and 5-year OS predictive nomogram by integrating the risk score and pathological stage. Calibration plots showed that our nomogram performed well in OS prediction.

Paired Differential Expression Analysis of Genes Included in Genomic Instability-Derived Gene Prognostic Signature
We performed paired differential expression analysis to explore the aberrant expression of the nine genes recruited in GIGSig between HCC samples and non-tumor liver samples in the TCGA and ICGC sets. Fifty paired tissues in the TCGA set and 198 paired tissues in the ICGC set were included.  The results showed that the mRNA expressions of seven genes in both TCGA and ICGC sets were significantly different between HCC samples and non-tumor samples, including six genes (CST2, EPHB6, MAGEA6, SEMA6A, SLC2A5, and SLCO2A1), and were highly expressed in HCC samples and one gene (PDZD4) expressed low in HCC samples (Figure 6).

Validation of the mRNA and Protein Expression Profile of Genomic Instability-Derived Gene Prognostic Signature in Clinical Tissues
We further validated the differential mRNA expression level of the above nine genes in 30 pairs of HCC tissues and adjacent noncancerous liver tissues collected from our hospital. The results showed that the mRNA levels of five genes (CST2, EPHB6, MAGEA6, SEMA6A, and SLCO2A1) were significantly elevated in the HCC tissues, whereas PDZD4 was down-regulated in the HCC tissues when compared with those in the paratumor tissues (Figure 6). These results were consistent with the results of TCGA and ICGC profile. In addition, the protein levels of the above nine genes in six pairs of HCC tissues and paratumor tissues were detected by using Western blot. As shown in Figure 7, the protein level of CST2 was significantly elevated in the HCC tissues of patients 1, 2, 4, 5, and 6, whereas it was decreased in patient 3. The protein level of EPHB6 was significantly elevated in the HCC tissues of patients 3, 4, 5, and 6 and decreased in patients 1 and 2. The protein level of MAGEA6 was significantly elevated in the HCC tissues of patients 1, 2, 4, and 5 and decreased in patients 3 and 6. The protein level of SEMA6A significantly elevated in the HCC tissues of patients 1, 2, 3, 4, and 5 and decreased in patient 6. The protein level of SLCO2A1 was significantly elevated in the HCC tissues of patients 1, 3, 4, and 5 and decreased in patient 2, and there was no significant difference in patient 6. The protein level of PDZD4 was significantly decreased in HCC tissues of patients 1, 3, 4, and 6 and increased in patients 2 and 5. The protein levels of RPS6KA2, MARVELD1, and SLC2A5 had no significant difference between the HCC tissues and paratumor tissues.

DISCUSSION
Genomic instability plays an important role in the development, progression, and recurrence of various cancers (Bartkova et al., 2005;Gorgoulis et al., 2005;Kronenwett et al., 2006;Mettu et al., 2010;Negrini et al., 2010). Therefore, genomic instability could be a promising biomarker to predict the oncological outcome of patients with cancers. In this present study, we constructed a novel survival predictive model for HCC patients by using nine genes derived from genome instability.
With a series of bioinformatics analysis by using the HCC database in TCGA, a novel OS predictive model, namely, GIGSig, based on the nine genes derived from genome instability was generated. Clustering analysis revealed that patients in the GUlike group have higher somatic mutation count and higher UBQLN4 expression than those in the GS-like group. Somatic mutation count and UBQLN4 expression level are positively correlated with genome instability process (Jachimowicz et al., 2019). Our results of clustering analysis revealed that these genes are associated with genome instability. In addition, GO and KEGG functional analysis revealed that the biological functions of these genes are mainly correlated with the development and progression of cancers. These results reveal that genes derived from genome instability may play a vital role in HCC prognosis and could be used to construct a prognostic signature.
According to GIGSig score, HCC patients could be divided into two distinct groups: the high-risk group and the low-risk group. Patients in the high-risk group had poorer OS, higher somatic mutation count, and higher UBQLN4 expression than those in the low-risk group. Subgroup analysis indicated that GIGSig can discriminate patients with poor OS, even though they were with early-stage tumor classified by conventional TNM staging. The AUC of GIGSig is the highest compared with the pathological stage, tumor grade, and the four published genebased prognostic signatures. These results collectively supported that our GIGSig is an effective prognostic signature to predict the OS of HCC patients.
In addition, the present signature was validated externally and internally, whereas the published gene-based signatures were validated only externally. We also performed subgroup analysis to further validate the predictive value of the signature for HCC patients in different statuses classified by clinical features. The results showed that the signature performed well. This result indicated that the gene signature we developed is more clinically meaningful compared with published articles.
To integrate GIGSig with other OS predictive factors, a nomogram model was proposed. Because the GIGSig score and pathological stage were independent OS predictors for HCC, we constructed a nomogram by integrating the GIGSig score and pathological stage to predict the OS of HCC. The GIGSigbased nomogram performed well in OS prediction, which was proved by the calibration curve. Therefore, our GIGSig-based nomogram may be a potential accurate predictive model for predicting the OS of HCC.
Nine genes in GIGSig were SLCO2A1, RPS6KA2, EPHB6, SLC2A5, PDZD4, CST2, MARVELD1, MAGEA6, and SEMA6A. These genes and their biological functions have been studied in some tumors. SLCO2A1 has been reported to promote the development of colon cancer by PGE2 uptake into the endothelial cells (Nakanishi et al., 2017). Metabolic abnormalities of EPHB6 promote cancer development and progression, which was proven in invasive melanoma (Hafner et al., 2003) and prostate, gastric, and ovarian cancers (Eph Nomenclature Committee, 1997). SLC2A5 is an oncogene and up-regulated in some cancers, that is, lung cancer (Weng et al., 2018) and acute myeloid leukemia (Zhao et al., 2018). Previous studies showed that CST2 was expressed higher in breast cancer tissues compared with normal mammary tissues (Egland et al., 2003). High CST2 expression led to a poor prognosis in gastric cancer patients . MAGEA6 was identified as an oncogene and can promote tumor progression via activating the AMPK signaling pathway (Pineda et al., 2015). PDZD4 is up-regulated in synovial sarcomas, and high expression of PDZD4 promotes the proliferation capacity of synovial sarcomas cells (Nagayama et al., 2004).
SEMA6A is down-regulated in lung cancer cells. High expression of SEMA6A suppressed cancer cell migration (Chen et al., 2019). Although in the work of Yu et al. (2012), MARVELD1 was down-regulated in HCC tissues, and high expression of MARVELD1 suppressed proliferation and enhanced chemosensitivity of HCC cells, our results showed that there was no significant differential expression of MARVELD1 in both TCGA and ICGC paired tumor and normal samples. Genetic variation of RPS6KA2 was related to the carcinogenesis of colorectal cancer (Slattery et al., 2011). RPS6KA2 was also identified as a cancer suppressor gene in epithelial ovarian cancer (Bignone et al., 2007).
We validated the expression profiles of the nine genes in the GIGSig at mRNA and protein levels using paired HCC and paratumor tissues from our department. The results showed that the mRNA expressions of six genes (CST2, EPHB6, MAGEA6, PDZD4, SEMA6A, and SLCO2A1) were differentially expressed between the HCC tissues and paratumor liver tissues from the TCGA, the ICGC, and our clinical samples. In addition, we also validated the protein expression level of these nine GIGs by using Western blot. The protein expression patterns were consistent with the mRNA. Notably, the expression profiles of these nine genes in HCC were somewhat different from their expression patterns in other cancers. This may be due to the different disease entities and the small sample size of our validated cohort.
Although preliminary in vitro experiments have verified the differential expression of these GIGs in HCC and paratumor liver tissues, the small sample size used for protein validations is the limitation of our research. Therefore, further in vivo and in vitro experiments are needed to explore the molecular mechanism of these GIGs in HCC.
To conclude, we constructed and validated a GIGSig that can help in predicting the OS of patients with HCC and may help us to explore the potential therapeutic targets of HCC.

DATA AVAILABILITY STATEMENT
The datasets presented in this study can be found in online repositories. The names of the repository/repositories and accession number(s) can be found below: TCGA data portal (https://portal.gdc.cancer.gov/), ICGC data base (https://dcc. icgc.org/).

ETHICS STATEMENT
This study was approved by the Ethics Committee of the First Affiliated Hospital of Sun Yat-sen University. Approval number (2021)158.

AUTHOR CONTRIBUTIONS
S-QL: conceptualization and manuscript revision. Z-BS, G-PZ, and YY: data curation. Z-BS: tissue validation, data analysis and figure plot, and manuscript writing. All authors approved the final version of the manuscript.

FUNDING
This work was supported by the National Natural Science Foundation of China (82072663).