Development and Validation of Epigenetic Modification-Related Signals for the Diagnosis and Prognosis of Hepatocellular Carcinoma

Background Increasing evidence has indicated that abnormal epigenetic factors such as RNA m6A modification, histone modification, DNA methylation, RNA binding proteins and transcription factors are correlated with hepatocarcinogenesis. However, it is unknown how epigenetic modification-associated genes contribute to the occurrence and clinical outcome of hepatocellular carcinoma (HCC). Thus, we constructed the epigenetic modification-associated models that may enhance the diagnosis and prognosis of HCC. Methods In this study, we focused on the clinical value of epigenetic modification-associated genes for HCC. Our gene expression data were collected from TCGA and HCC data sets from the GEO database to ensure the reliability of the data. Their functions were analyzed by bioinformatics methods. We used lasso regression, Support vector machine (SVM), logistic regression and Cox regression to construct the diagnostic and prognostic models. We also constructed a nomogram of the practicability of the above-mentioned prognostic model. The above results were verified in an independent liver cancer data set from the ICGC database and clinical samples. Furthermore, we carried out pan-cancer analysis to verify the specificity of the above model and screened a wide range of drug candidates. Results Many epigenetic modification-associated genes were significantly different in HCC and normal liver tissues. The gene signatures showed a good ability to predict the occurrence and survival of HCC patients, as verified by DCA and ROC curve analysis. Conclusion Gene signatures based on epigenetic modification-associated genes can be used to identify the occurrence and prognosis of liver cancer.


INTRODUCTION
Hepatocellular carcinoma (HCC) is one of the most common and fatal malignancies in the world, and the number of HCC cases increases gradually by approximately 4% every year (1,2). Although early HCC can be treated by tumor resection and liver transplantation, the median survival time of this disease is only a few months (3,4). The poor prognosis and short lifetime of HCC, to some extent, are dependent on late diagnosis and lack of an effective treatment plan because early HCC has no specific symptoms.
The diagnosis of HCC mainly depends on biopsy and imaging evidence (5,6). However, due to technical difficulties, the above diagnostic indicators may be affected by subjective factors, resulting in false positive or false negative rates (7). The development of new diagnostic technology can better assist traditional diagnostic methods and help to improve the detection rate of early liver cancer patients. Accurate judgment of patient prognosis is helpful for guiding clinical decisionmaking and the implementation of precision and personalized medicine. Currently, the prognosis is mainly judged by BCLC and TNM stages, which is insufficient to predict the outcome of patients (8). Therefore, it is necessary to identify effective diagnostic and prognostic biomarkers to help optimize the treatment system of HCC (9). In the past few decades, researchers have identified an increasing number of tumorigenesis mechanisms. One of the breakthroughs is the participation of epigenetic processes in the development of cancer (10).
Epigenetics is a dynamic and heritable modification of independent DNA sequences (11). Abnormal epigenetic changes can destroy the expression balance of oncogenes and tumor suppressor genes and promote tumorigenesis. Common epigenetic modifications include DNA methylation, RNAm6a methylation, and histone acetylation, which are considered the main mechanisms of regulation during cancer progression. Previous studies have focused on the functional exploration of single epigenetic-related genes but lack extensive exploration. Moreover, the value of these genes in the diagnosis and prognosis of liver cancer is still unclear. In this study, we collected five kinds of epigenetically related genes (ERGs), a total of 2397 genes, including RNAm6a modification-related genes, histone modification-related genes, DNA methylation modificationrelated genes, RNA binding proteins and transcription factors (12)(13)(14)(15)(16)(17). To explore them in HCC, we analyzed the differentially expressed epigenetic-related genes in HCC by WGCNA and constructed an apparent regulatory network. To optimize the treatment system of HCC, we integrated epigenetic-related genes to build a diagnostic and prognostic model and compared the signal differences between the high-risk group and the low-risk group by GSEA and GSVA.

Data Acquisition and Processing
The mRNA expression and patient clinical data were downloaded from the GEO database (GSE14520), TCGA database (TCGA-LIHC data set), GTEx database (GTEX-liver data set) and ICGC database (ICGC-JP data set) (18)(19)(20)(21). Gene difference analysis was performed in the form of count data which correction with Limma package. The downstream function analysis is in the form of TPM data. ERGs consisting of m6A-related genes, histone modificationrelated genes, RNA binding proteins, transcription factors and DNA methylases were collected based on previous literature and databases (Table S1) (12)(13)(14)(15)(16)(17). Differential expression analysis was performed using "limma" R package. Cytoscape software was used to construct the regulatory network of ERGs and target genes with a criterion of a correlation coefficient greater than 0.75.

Functional Analysis Based on the WGCNA
The "WGCNA" R package was used for weighted correlation network analysis (WGCNA), and an appropriate soft threshold was chosen to cluster genes with similar coexpression in the same module (22). Clinical data were combined with the above modules to identify clinically meaningful gene clusters. We conducted GO and KEGG analyses with the criteria of a P value < 0.05 and a q value < 0.05 using the R package "enrichplot."

Construction of the Diagnostic and Prognostic Model
SVM analysis depended on the "e1071" R package, and lasso regression was conducted to screen diagnostic markers. Diagnostic models were built by logistic regression. Univariate Cox regression analysis was conducted to identify survivalrelated genes with the criterion of P<0.001. Lasso-multivariate Cox regression analysis was used to establish a prognostic model which performed by "survival" and "glmnet" R package. Kaplan-Meier survival curves were constructed to compare survival-time differences between the high-risk and low-risk groups. We performed the receiver operating characteristic curve (ROC) and decision curve analysis (DCA) to show the accuracy of the model which performed by "survival" R package (23,24). The model stability was shown using calibration curves.

Functional Prediction Analysis of the High-PERs and Low-PERs Groups
Based on the median prognostic epigenetic risk score (PERs) in the above TCGA-LIHC data set, the samples were divided into a high-PERs group and a low-PERs group. GSEA_4.0.1 software was used to explore the biological function of the PERs for the two groups based on the hallmark gene set. Gene set variation analysis (GSVA) used the "GSVA" R package and the hallmark gene set to intersect the results.

Evaluation of Drug Effects
The immune response for each sample depends on the tumor immune dysfunction and exclusion (TIDE) database (25). The CTRP2.0 database, which contains the sensitivity data for 481 compounds over 835 CCLs, and the PRISM database, which contains the sensitivity data for 1448 compounds over 482 CCLs, were used to evaluate the efficacy of compounds according to Knearest neighbor (KNN) imputation. The top and bottom 10% of the high and low subgroups were used as cutting points. Specific methods were described by Shixue Dai et al. (26).

Western Blots and RT-qPCR
Human HCC samples were obtained via surgical resection of HCC patients from The Second Affiliated Hospital of Chongqing Medical University. The protocols were reviewed and supported by the Ethics Committee of The Second Affiliated Hospital of Chongqing Medical University Approval Number (2020): Institutional Review Board (IRB) (STUDY) No. 88, Chongqing, China. Tissue lysis was performed using RIPA lysis buffer (Beyotime Biotechnology, Shanghai, China). Protein was extracted from tissues by centrifugation at 12,000g and detected by BCA assays at 562 nm. A rapid gel preparation kit (Beyotime Biotechnology, Shanghai, China) was used to configure the gel distribution at a 140-V constant voltage and 230 mA constant current electroconversion. Primary antibody was incubated at 4°C overnight. Secondary antibody was incubated at 37°C for 1 h. Phosphate-buffered saline was used to wash test strips three times for 10 min each. High-sensitivity ECL (Bio-Rad, USA) solution was used for the exposure strip. ImageJ software was used to analyze the results. RNA extraction and qPCR were performed using RNAiso reagent and a Prime Script ™ Reverse Transcriptase kit (TaKaRa, Shiga, Japan) following the manufacturer's instructions. The specific antibodies and primers are detailed in the supplementary materials. Immunohistochemistry I m m u n o h i s t o c h e m i s t r y w a s p e r f o r m e d u s i n g a n immunohistochemistry kit (Solarbio, Beijing, China). Samples were prepared through gradient dehydration of paraffin sections and EDTA antigen thermal repair. Primary antibody was incubated overnight. After three times washes with PBS, secondary antibody was incubated at room temperature for 1 h. Samples were then subjected to DAB staining.

Statistical Analysis
SPSS and R software were used for statistical analysis. Unless otherwise indicated, a P value < 0.05 was considered statistically significant. Two-tailed Student's t tests were used to test the differences between different groups.

Identification of Hub ERGs in HCC
The analysis flowchart is shown in Figure 1. We integrated the 2397 ERGs mentioned above. First, based on the combination of TCGA-LIHC and the GTEX liver sequence data set, we used differential gene analysis because the TCGA-LIHC data set has a few samples of normal liver tissue. The cutoff criterion was FC > 1.5 and FDR < 0.01. Considering the advantages and disadvantages of sequencing and microarrays in evaluating gene expression, we further verified the above results in the largest and most comprehensive online HCC microarray data set, GSE14520, with a cutoff value of FDR < 0.01, which identified 492 differentially expressed ERGs (Figure 2A). To further obtain clinically significant ERGs in HCC, we applied WGCNA to analyze the above genes and obtained five modules in HCC; among them, the blue, black, green, and yellow modules were closely related to patient DFI, PFI and OS based on the widely validated TCGA clinical data, resulting in 410 hub ERGs ( Figure 2B).

KEGG and GO Analysis and Construction of the Epigenetic Factor Regulatory Network
Kyoto Encyclopedia of Genes and Genomes (KEGG) and Gene Ontology (GO) analyses, which can identify gene functions in a high throughput manner based on previous gene annotation, were performed to explore the functions of 410 hub ERGs. The KEGG pathways included ribosome, spliceosome and RNA transport ( Figure 2C). In addition, the GO analyses showed that biological process (BP) terms were mainly regulation of RNA splicing, ribonucleoprotein biogenesis, and RNA splicing; the cell component (CC) terms were ribosomal subunit, ribosome and spliceosomal complex; and the molecular function terms showed significant enrichment in structural constituent of ribosome, catalytic activity, acting on RNA and nuclease activity ( Figure 2D). Furthermore, to determine the regulatory targets of ERGs, we identified the differentially expressed genes in HCC according to the previous analysis and constructed an apparent differentially expressed regulatory network according to a correlation coefficient greater than 0.75 ( Figure 2E).

ERGs for the Diagnosis of HCC
To evaluate the diagnostic value of hub ERGs, we strictly filtered the range of differences, and the cutoff log2FC value was greater than 2.5 based on the TCGA-GTEX data set, which identified 26 ERGs ( Figure 3A). For the accuracy and refinement of the model, we performed support vector machine (SVM) and least absolute shrinkage and selection operator (lasso) analyses, and four genes were selected from the intersection to construct a diagnostic epigenetic risk score (dERS) model using multivariate logistic regression analysis. The model formula is = (1.245*TOP2A)+ (−1.144*GRHL2)+(1.667*-RNASE4)+(1.963*CDKN2A) ( Figures 3B-D). In the TCGA-GTEX data set, the receiver operating characteristic curve (ROC) showed an excellent AUC value (0.996), specificity (0.972) and sensitivity (0.969) which cutoff value was −2.51 ( Figure 3E). We also verified the above diagnostic model with the independent data set ICGC-JP ( Figure 3F). In addition, we also verified the above diagnostic model in the microarray data set (GSE14520), indicating that the above model is not only suitable for sequencing data, but also suitable for microarray data ( Figure 3G).

Experimental Verification of Diagnosis-Associated Genes
For the reliability of the data, we further explored the expression of genes related to diagnosis of HCC. Two genes (TOP2A and CDKN2A) were reported by previous studies in HCC (27,28). Furthermore, we performed RT-qPCR assays, which confirmed the expression of RNASE4 and GRHL2 in 30 pairs of liver cancer and adjacent tissues ( Figure 4A). At the protein level, Western blot experiments were conducted in which RNASE4 and GRHL2 expression was significantly downregulated in HCC ( Figure 4B). Immunohistochemistry was also performed in 10 paired of liver cancer and adjacent tissues to furthermore measure the expression of RNASE4 and GRHL2 ( Figure 4C).

ERGs for the Prognosis of HCC
To explore the value of the above hub ERGs in HCC, we conducted univariate Cox regression analysis to identify 27 survival-related genes (P<0.001). Next, we established a prognostic epigenetic risk score (PER) consisting of seven genes by lasso-Cox regression analysis ( Figure 5A). PERs were calculated for patients in the TCGA-LIHC data set and divided into high-PER and low-PER groups according to the 50% cutoff point. The high-PERs group had shorter survival than the low-PERs group which cutoff value was 10.52 ( Figure 5B). The AUC values for 1-, 3-and 5-year OS were 0.830, 0.720, and 0.657, respectively ( Figure 5C). In addition, we performed decision curve analysis (DCA), which showed that PERs provided good benefits to patients (Figures 5D-F). The calibration curve also showed excellent stability ( Figure 5G). We also compared the differences of clinical features between the high and low PERs groups which verified the reliability of the results from the inside of data set ( Table 1). The above survival-related model was also verified with an independent data set, the ICGC-JP data set, and the result was consistent with the TCGA-LIHC data set which cutoff value was 10.52 ( Figures 5H, I).

Analysis of the Clinical Characteristics for the PERs Groups
For clinical practicability, a nomogram was drawn to predict 1-year, 3-year, and 5-year survival ( Figure 6A). Through univariate and multivariate regression analyses, we identified PERs as an independent prognostic factor for HCC ( Figures 6B, C). The differences in clinical characteristics between the high-PERs and low-PERs groups were analyzed in the above data sets, and the accuracy of the model was further verified based on the data set. Based on the TCGA-LIHC data set, the differentially expressed signaling pathways between the high-PERs and low-PERs groups were identified by using the GSEA and GSVA algorithms to obtain the intersection (Figures 6D, E, and S1) (29,30), which helps elucidate the difference between the high-PERs and low-PERs groups.

Pan-Cancer Analysis
One of the advantages of bioinformatics analysis is that it can obtain a variety of tumor data for analysis. To explore the applicability of these prognostic models in other tumors, we collected 14 tumor (BLCA, BRCA, COAD, HNSC, KIRC, KIRP, LGG, LUAD, LUSC, OV, PRAD, SKCM, STAD, UCEC) data sets with sample sizes greater than 300 in TCGA database. First, we performed univariate Cox regression analysis which only a few tumors, such as KIRP, and OV were suitable for the above survival model ( Figure 7A). However, KM survival analysis showed that only LGG and KIRP had significant differences ( Figure 7B), which may be attributed to different statistical methods, but both of them indicated the liver cancer specificity of the above models

Bioinformatics Prediction of Drug Candidates
To further evaluate the effects of various drugs on the different PERs subgroups, we evaluated each patient's response to immunotherapy using the TIDE database. We found that the low-PERs group was more suitable for immunotherapy ( Figure 8A). To identify candidate drugs for the high-PERs group, we comprehensively analyzed the PRISM and CTRP2.0 databases and found that four compounds may be effective in the low-PERs group. The resulted based on the PRISM found that five drugs (ABT-737, gemcitabine, GSK461364, paclitaxel and SB-743921) was better response in low-PERs group ( Figure 8B) and the resulted based on the CTRP2.0 found that one drug (LY2606368) was better response in low-PERs group ( Figure 8C). These results can provide some guidance for clinical treatment.

DISCUSSION
Although the level of medical diagnosis and treatment has been improved in recent years, the accuracy of diagnosis and the survival rate and prognosis of HCC are still poor (31). Future research should identify clinically significant genes, predict their functions and explore their prognostic value based on bioinformatics. At present, there is a lack of effective biomarkers with high accuracy for the diagnosis and prognosis of HCC (32). However, in the past, bioinformatics research often focused on a single database or only focused on prognostic value, which had some limitations. In recent decades, scientific workers have identified many aspects of epigenetic modification regulating gene expression that interfere with tumor progression. DNA methylation, RNA m6A methylation, histone-related modification and so on are hot spots of tumor research. Previous studies have mainly focused on the impact of single epigenetic-related genes on tumor prognosis and function. Based on bioinformatics analysis, the above genes can be widely recognized for the diagnosis and prognosis of liver cancer, and functional prediction can be carried out, which can provide help for later experimental research. In addition, the regulation of gene expression by epigenetic regulators is closely related to transcription factors and RNA binding proteins; thus, we collected five kinds of ERGs: RNA m6A modification-related genes, histone modification-related genes, DNA methylationrelated genes, RNA binding proteins and transcription factors. Based on these genes, we successfully constructed a stable and reliable diagnostic and prognostic model for HCC, which was verified by an independent data set.
In this study, to ensure the accuracy of the analysis, we identified 492 differentially expressed ERGs according to the combination analysis of three online data sets. WGCNA can cluster disease genes according to the correlation of their intrinsic expression and identify key gene modules by combining them with clinical phenotypes (22). TCGA database is the most authoritative tumor database, including a variety of histological data and clinical information. OS, DFI, and PFI are recommended for survival-related analysis of the TCGA-LIHC data set. Thus, using WGCNA, a total of five coexpression modules were separated, and we chose the blue, black, green and yellow modules for the next study because they are closely related to survival and acquired 410 hub IRGs. KEGG and GO analysis can fully elucidate gene set functions compared with individual studies on gene function (33,34). We performed KEGG and GO analyses, and the results showed that the above genes were mainly enriched in RNA splicing and ribonucleoprotein complex biogenesis (35,36). Previous studies have shown that these physiological activities are closely related to tumorigenesis. Furthermore, we constructed a regulatory network of ERGs and target genes, which can preliminarily and comprehensively describe the regulatory relationship of these genes. Next, we investigated the diagnostic and prognostic value of the above-mentioned hub ERGs. Support vector machines can not only be used to build classification models but can also be used to screen important variables to build a simpler and more accurate model (37). Based on the TCGA-GTEX data set, we found important variables through SVM and lasso analysis and established a diagnostic model consisting of four ERGs for HCC by logistic regression with a specificity of 0.972 and sensitivity of 0.969. We also validated the above diagnostic model in an independent data set, the ICGC-LIHC data set, which had a specificity of 0.867 and a sensitivity of 0.861. There were a few differences, which may be attributed to the difference in the source of the included groups. The results showed the reliability and stability of the diagnostic model. Furthermore, we screened seven genes (CENPA, TTK, RNF2, GNL2, BUB1, CBX3, and DYNC1H1) that were significantly associated with prognosis of HCC. As a transcription factor, CENPA is up-regulated in a variety of tumors. It is considered to be one of the potential therapeutic targets for liver cancer, and is  associated with HBV infection (38). TTK is an RNA-binding protein which closely related to the proliferation and metastasis of HCC (39). RNF2 is a transcription factor, which has been reported to be related to the tumor progression (40). GNL2, an RNA-binding protein, is lack of reports in tumor and has been reported to be involved in the biological development (41). At present, BUB1 is mostly studied in liver cancer, involving multiple signal transduction pathways (42). Although the mechanism remains unclear, CBX3 is reported to be involved in the development of HCC (43). DYNC1H1 is reported to be involved in ERK related signaling pathway (44). Based on this, we established a prognostic risk model including ten ERGs by lasso-Cox analysis. The AUC values for 1-, 3-, and 5-year OS were 0.830, 0.720, and 0.657 in the TCGA-LIHC data set and 0.708, 0.701, and 0.690 in the ICGC-JP data set, respectively. These results support the reliability of the prognostic model. For the reliability of functional prediction, we also used two methods (GSEA and GSVA) to identify different signaling pathways between the high-risk group and the low-risk group.
The results showed that hypoxia, glycolysis, apoptosis, and other signaling pathways were enriched in the low-risk group and may be potential therapeutic targets. There are significant metabolic abnormalities of fatty acids in tumors. The significant increase in lipid droplets in tumor cells can promote the transformation of tumor mesenchymal cells. Reducing the accumulation of fatty acids in cells can reduce the migration and invasion of tumor cells (45). Hypoxia also participates in the formation of a protumor microenvironment (46). Glycolysis is regarded as one of the key metabolic characteristics of malignant tumor (47). Apoptosis is one of the main means of anti-tumor (48). These studies suggest that the differential expression of multiple signaling pathways in the high-PERs and low-PERs groups may be used for potential therapeutic strategies, which also confirmed the reliability of our prognostic model within the gene set. In addition, we used the TIDE database algorithm to evaluate our data and found that immunotherapy may be more reliable for low-PERs groups. The KNN algorithm combined with a pharmacodynamic database was used to screen the effective compounds for the low-PERs group on a large scale, which can promote the development of clinical treatment in the future (49).

CONCLUSION
We first comprehensively analyzed ERGs in liver cancer. Compared with previous bioinformatics studies that only constructed a prognostic model, we discuss the influence of gene sets on diagnosis and prognosis. These results can help elucidate HCC development and can contribute to future experimental research. More importantly, we constructed two novel biomarkers to guide the management of liver cancer. Although the major limitation of this study is the lack of verification by enough experimental data, we tried our best to ensure the reliability of the data through multiple data sets or multiple algorithms and conducted appropriate clinical sample validation. Overall, this study will be valuable for the treatment and diagnosis of HCC.

DATA AVAILABILITY STATEMENT
The original contributions presented in the study are included in the article/Supplementary Material. Further inquiries can be directed to the corresponding author.

ETHICS STATEMENT
The studies involving human participants were reviewed and approved by the ethics committee of The Second Affiliated Hospital of Chongqing Medical University. The patients/ participants provided their written informed consent to participate in this study.

AUTHOR CONTRIBUTIONS
MQL, DGW, SQ, XYJ and ZJL performed the research and wrote the draft. MQL and XYJ is responsible for the following manuscript modification. DW and SQ collected and analyzed the data. ZJL is the guarantor. All authors contributed to the article and approved the submitted version. The contributions of MQL, SQ and XYJ were the same, and the order of authors was random.