Bioinformatics Analysis Using ATAC-seq and RNA-seq for the Identification of 15 Gene Signatures Associated With the Prediction of Prognosis in Hepatocellular Carcinoma

Background Gene expression (RNA-seq) and overall survival (OS) in TCGA were combined using chromosome accessibility (ATAC-seq) to search for key molecules affecting liver cancer prognosis. Methods We used the assay for transposase-accessible chromatin with high-throughput sequencing (ATAC-seq) to analyse chromatin accessibility in the promoter regions of whole genes in liver hepatocellular carcinoma (LIHC) and then screened differentially expressed genes (DEGs) at the mRNA level by transcriptome sequencing technology (RNA-seq). We obtained genes significantly associated with overall survival (OS) by a one-way Cox analysis. The three were screened by taking intersection and further using a Kaplan–Meier (KM) for validation. A prognostic model was constructed using the obtained genes by LASSO regression analysis.The expression of these genes in hepatocellular carcinomas was then analysed. The protein expression of these genes was verified using the Human Protein Atlas(HPA) online datasets and immunohistochemistry. Results ATAC-seq, RNA-seq and survival analysis, combined with a LASSO prediction model, identified signatures of 15 genes (PRDX6, GCLM, HTATIP2, SEMA3F, UCK2, NOL10, KIF18A, RAP2A, BOD1, GDI2, ZIC2, GTF3C6 SLC1A5, ERI3 and SAC3D1), all of which were highly expressed in hepatocellular carcinoma. The LASSO prognostic model showed that this risk score had high predictive accuracy for the survival prognosis at 1, 3 and 5 years. A KM curve analysis showed that high expression of all 15 gene signatures was significantly associated with a poor prognosis in LIHC patients. HPA analysis of protein expression showed that PRDX6, GCLM, HTATIP2, NOL10, KIF18A, RAP2A and GDI2 were highly expressed in the hepatocellular carcinoma tissues compared with normal control tissues. Conclusions PRDX6, GCLM, HTATIP2, SEMA3F, UCK2, NOL10, KIF18A, RAP2A, BOD1, GDI2, ZIC2, GTF3C6, SLC1A5, ERI3 and SAC3D1 may affect the prognosis of LIHC.


INTRODUCTION
According to Global Cancer Statistics 2020, hepatocellular liver cancer has the seventh-highest incidence rate but the secondhighest mortality rate after lung cancer. As a common malignancy, with potentially fatal consequences, hepatocellular carcinomas have been widely studied (1). Although much research has focused on understanding hepatocellular cancer at the molecular level and therapeutic strategies have been developed, the biological mechanisms of hepatocarcinogenesis remain unclear. Due to slow progress in liver cancer research (2,3), the patient survival rate remains low (i.e. < 8 months) (4,5). Liver cancer is a more complex disease than other cancers, as its progression includes genetic modification processes, including gene mutations, gene deletions, translocations and DNA methylation (6). Therefore, early diagnosis and treatment are essential for improving the prognosis of liver cancer. To date, the only effective diagnostic method is detection of the serum tumour marker alpha-fetoprotein, which has an upper limit of normal value of 20 ng/mL (7). However, alpha-fetoprotein is nonspecific and has little statistical significance when detected in patients with different types of liver cancer (8). It is also ineffective for the diagnosis of early-stage liver cancer (9).
In eukaryotic cells, nuclear DNA and proteins combine to form chromatin, which then undergoes complex and orderly folding to form chromosomes (10). For genes to be expressed, chromatin must be in an open conformation. Open chromatin allows regulatory proteins to bind to DNA and regulate DNA function (11). The assay for transposase-accessible chromatin with high-throughput sequencing (ATAC-seq) enables highthroughput sequencing of open chromatin regions with the help of transposases. This simple method, which is very similar to ChIP-seq, requires only a small number of samples to obtain clear and reproducible sequencing results (12). ATAC-seq detects chromatin accessibility of related genes and indicates their regulatory mechanisms. Thus, genes with chromatin accessibility in promoters are more likely to be differentially expressed at the mRNA level and regulated by transcription factors (13).
Due to the complexity of gene expression regulatory mechanisms, it is crucial to be able to probe biological questions at different levels. Therefore, the integration and analysis of multi-omics is increasingly important. Differentially expressed genes (DEGs) can be analysed using RNA-seq and Chip-seq (14). Using Chip-seq, the regulatory role of specific transcription factors can be studied (15). ATAC-seq can shed light on the dynamics of chromatin accessibility. As chromatin accessibility is closely related to the binding of regulatory elements or transcription factors, it plays an important role in transcriptional regulation (12). Therefore, integration analysis can further explore the key factors of a biological process, as well as the target genes of a transcription factor. Currently, integration analysis studies combining ATAC-seq and RNAseq are uncommon, with no such studies conducted on hepatocellular carcinomas. Therefore, in this study, we constructed a 15 gene signatures for predicting the prognosis in hepatocellular carcinoma patients by analysing and integrating ATAC-seq and RNA-seq.

Data Sources
ATAC-seq data on LIHC were obtained from the database of the University of California Santa Cruz (UCSC) (https://xenabrowser. net/datapages/). In total, ATAC-seq data of 404 LIHC samples were obtained, 17 of which were from TCGA database. The data were downloaded in promoter peak data format, with normalized correction. The calibration process included count conversion to CPM, after a base 2 logarithmic transformation.
RNA-seq data on LIHC were downloaded from the TCGA database, with 371 tumour samples and 50 para cancer samples.

Chromatin Accessibility Analysis Using ATAC-seq
To explore the accessibility of chromatin, we first used the R package chromosome locator to show the peak regions on chromosomes. Peaks that could be mapped to TSS regions were aligned using the R-packaged ChIPseeker to construct a marker matrix. The nearest TSS region was selected for peak annotation. The annotation information was obtained from the R software. The relationship between open chromatin and promoter regions was revealed by UpSet plots.

Analysis of DEGs Using RNA-seq
To assets differential expression of mRNA, the Limma package of R software (version: 3.40.2) was used. Adjusted P values in the TCGA dataset were analysed to correct for false-positive results. DEGs were obtained by screening with |log2(FC)| > 1 (P < 0.05). Heat and volcano plots were plotted using the R package ggplot2.

Survival Analysis
The R package Survival was used for the survival analysis. The correlation between the expression levels of all known genes in LIHC and the overall survival (OS) of patients with hepatocellular liver cancer was analysed by a one-way Cox regression analysis, reporting hazard ratios (HRs) and their 95% confidence intervals (CIs). A Kaplan-Meier (KM) test was performed to analyse the difference between the survival of patients with high and low gene expression.

LASSO Model
To compare survival differences between multiple groups, the logrank test was used to test the KM survival analysis, and ROC analysis was used to compare gene prediction accuracy and risk scores. A LASSO regression was used for feature selection, with 10-fold crossvalidation. The R package glmnet was used for the above analyses.

Immune Cell Infiltration
The TIMER database (http://timer.comp-genomics.org/) was used to analyse the correlation of gene expression in LIHC with the level of immune cell infiltration.

Protein Expression Validation
Immunohistochemical staining maps of gene signatures for protein expression in both liver cancer tissue and normal tissue were downloaded from The Human Protein Atlas (HPA) database.

Identification of Chromatin Open Regions by ATAC-seq
We mapped the genomic coordinates from the Peak data to the 23 chromosomes of the human genome ( Figure 1A). As can be seen, most regions of each chromosome are covered, with some chromosomes, such as chr13, chr14, chr21 and chr22, having less coverage of the short arms. Figure 1B shows that most of the peaks are concentrated at a distance of 10-100 kb from the TSS. Among these, the binding site tend to be distributed more at the 3' end of the TSS. Figure 1C shows a large proportion of ATACpeaks are located close to TSS, which means that the TSS tends to bind to transcription factors.

Genomic Characterization and Enrichment Analysis
Using the annotation file, we annotated the genomic coordinates corresponding to Peaks. Figure 2A shows the proportion of different components. As can be seen, 44% of the binding sites are in the distal intergenic region, with only 10% bound within the 3 kb region upstream and downstream of the TSS, mainly because the TSS region constitutes a small proportion of the whole genome compared to other regions. Figure

DEG Screening
Differential expressed analysis of RNA-seq was performed on 371 LIHC tumour samples and 50 para tumour samples from TCGA database. Heat maps of the expression of each gene in each sample were drawn ( Figure 3A). Volcano maps show the upregulated genes (N = 2,371) and downregulated genes (N = 544) obtained from the screening ( Figure 3B). A one-way Cox analysis was used to derive 4,785 genes significantly associated with the prognosis of LIHC. The P values, risk factor HRs and CI column line table for the top 20 genes expressed, in addition to prognosis-related characteristics, are shown in Figure 3C. Finally, 190 overlapping genes were obtained by screening reproducible genes in ATAC-seq, DEGs in RNA-seq and prognosis-related genes ( Figure 3D).

KM Analysis of Overlapping Genes
Validation of 190 overlapping genes using the KM method yielded 126 genes that were significantly associated with OS in LIHC. Hazard coefficient HRs, with their 95% CIs and P values for 126 genes were derived by a log-rank test and univariate Cox proportional hazards regression ( Table 1).

LASSO Model Building
LASSO regression was applied to 125 genes for feature selection.
A prognostic model consisting of 15 gene signatures was obtained after 10-fold cross-validation ( Figures 4A, B). The complete gene names of 15 genes are shown in Supplementary  Table 1. Figure 4C shows the association between the risk score and survival time with survival status in the TCGA dataset. Figure 4D shows the distribution of KM curves of the risk model in the TCGA dataset. The gene signature model was divided into high-risk and low-risk groups according to the risk score, with a HR of 2.483 representing a risk factor. Figure 4E shows the ROC curves and AUC of the risk model at different times. The AUC values at 1, 3 and 5 years were 0.809, 0.723 and 0.706, respectively, indicating that the model has a strong predictive ability. shows the statistical significance (P value) of gene expression in different tissues and different stages. Fifteen genes were upregulated in the LIHC tissues. Most of these genes were not significantly expressed in different LIHC stages.

Prognostic Analysis of 15 Gene Signatures in LIHC Patients
The KM analysis of the survival prognosis of the 15 gene signatures in LIHC patients showed that high expression of all 15 gene signatures was significantly associated with a poor prognosis in LIHC patients ( Figure 6).

Validation of the Protein Expression of 15 Gene Signatures
The protein expression of the 15 gene signatures in hepatocellular carcinoma tissues and normal liver tissues was verified using the HPA online database (Figure 8). The results showed that PRDX6, GCLM, HTATIP2, NOL10, KIF18A, RAP2A and GDI2 were highly expressed in the hepatocellular liver cancer tissues compared to the normal liver tissues. SEMA3F, BOD1, SLC1A5 and ERI3 were not detected in cholangiocytes and hepatocytes. SAC3D1 was not detected in hepatocellular liver cancer tissues. In addition, UCK2 and ZIC2 were not detected in the hepatocellular carcinoma samples in the protein expression data.

GO and KEGG Pathway Enrichment Analysis of 15 Gene Signatures
In the GO and KEGG enrichment analyses of the 15 gene signatures, cellular processes, binding, metabolism and cancer were enriched (Figure 9).

DISCUSSION
Tumours usually have altered epigenetic patterns, and epigenetic regulators are frequently mutated in cancer (16). ATAC-seq is an innovative technique for epigenetic studies that cleaves nuclear chromatin regions that are open at a particular condition location by transposes and thus obtains the regulatory sequences of all active transcripts in the genome at that particular condition (12). In the present study, we performed genomic interval distribution statistics using ATAC-seq for peaks identified in different LIHC samples, while comparing chromatin open site differences between different samples. Furthermore, we conducted GO and KEGG enrichment analysis for genes adjacent to these differential sites. A portion of genes associated with the GO functional enrichment pathway was identified. Subsequently, RNA-seq was used to analyse DEGs in the LIHC samples, and a one-way Cox analysis was performed to analyse genes significantly associated with the prognosis in LIHC. In total, 190 key genes were obtained by overlapping the genes screened using the three methods. Further validation by KM analysis yielded 125 key genes. However, due to the large number of 125 key genes, featured genes needed to be further extracted.
To take account of the dimensional catastrophe problem, we used LASSO regression analysis for further analysis of the gene set. By using a penalty function to compress the regression coefficients of the independent variables, LASSO provides good shrinkage control and can compress the regression coefficients of some independent variables to 0 (17). Finally, we obtained a sparse model and then obtained genes with a higher significant correlation with the survival prognosis of liver cancer patients. According to 10-fold cross-validation, a penalty parameter, l = 0.0425, was finally selected, and then l was substituted into the regression equation of LASSO to ensure that the sum of the absolute values of the regression coefficients of all independent variables was less than or equal to the selected penalty parameter l. Finally, the regression coefficients of a large number of genetic variables were compressed to 0, and genes with regression coefficients except that were selected and subjected to LASSO regression. Fifteen gene signatures were identified: PRDX6, GCLM, HTATIP2, SEMA3F, UCK2, NOL10, KIF18A, RAP2A, BOD1, GDI2, ZIC2, GTF3C6, SLC1A5, ERI3 and SAC3D1. We generated a prognostic index for each sample for the risk scores of each gene in each sample and then divided the samples into high risk and low risk according to the prognostic index to analyse the overall survival time of the samples. The results revealed that the survival of patients classified as high risk was significantly worse than that of patients classified as low risk. The prognostic model had good predictive power.
Subsequently, we investigated the expression of each gene in LIHC patients and patients with various LIHC disease stages and found that each gene was significantly upregulated in cancerous tissues in LIHC patients compared with paracancerous tissues and that highly expressed genes were significantly associated with a poor patient prognosis. This finding suggested that these genes can be used as prognostic predictive biomarkers for LIHC and that the prediction model combining these genes performs well. We also analysed the correlation between gene expression and immune cell infiltration levels using TIMER data and found that other than PRDX6 and HTATIP2, all other 13 gene signatures were significantly and positively correlated with tumour purity and B cell, CD4+ T cell, CD8+ T cell, macrophage, neutrophil and dendritic cell infiltration levels. This finding suggested that these gene signatures may influence tumour progression by regulating the tumour microenvironment. PRDX6 is a unique bifunctional enzyme, which reduces both water-soluble and lipid-soluble peroxides and has unique phospholipase A2 activity (18). A previous study reported that inhibition of PRDX6 expression promoted apoptosis in Hepa1-6 cells (19). Another study reported that interventional treatment of primary liver cancer can reduce serum HTATIP2/TIP30 and B7-H4 levels, improve liver function and quality of life and prolong the survival times of patients (20). A number of studies reported that SEMA3F, UCK2, NOL10, RAP2A, ZIC2 and SAC3D1 are associated with prognosis and tumour immune infiltration in hepatocellular carcinomas (21)(22)(23)(24)(25). In addition, KIF18A was reported to be associated with tumour immune cell infiltration in adrenocortical carcinomas (26). SLC1A5 may serve as a potential target for enhancing antitumour immunity in the tumour microenvironment (27). According to the literature, GCLM gene polymorphisms are associated with hepatitis B virus-induced liver disease (28). There is limited research on BOD1, GDI2, GTF3C6 and ERI3' impact on hepatocellular carcinomas. BOD1 is a novel mitogenic protein required for chromosomal localization (29). It may act as a unique cytoplasmic interacting protein to regulate signal pathway, thereby having potential in the treatment of various diseases, including cancer (30). GDI2 belongs to a small family of chaperone proteins expressed mainly in hematopoietic, endothelial and epithelial cells (31). GDI2 expression is abnormal in many cancer types, including pancreatic, ovarian, gastric and oesophageal squamous cell carcinomas (32)(33)(34). There are few studies on GTF3C6 and ERI3. One previous study found that an integrated model based on a six-gene signature (35) or a novel ferroptosis-related gene signature (36) could predict OS in patients with hepatocellular carcinomas. Most previous studies mainly used only RNA-seq to perform bioinformatic analyses. In contrast, we used ATAC-seq and RNA-seq, which are mutually authenticating, further strengthens the findings of the present study.
In summary, we obtained 15 gene signatures associated with the survival prognosis of hepatocellular carcinoma patients based on ATAC-seq and RNA-seq integration analysis and LASSO regression analysis. Due to the limitations of the experimental conditions, it was not possible to obtain a detailed understanding of the specific mechanism of the action of each of the 15 hepatocellular carcinoma signature genes. Thus far, there have been few studies on BOD1, GDI2, GTF3C6 and ERI3' impact on hepatocellular carcinomas. The findings of our study provide

DATA AVAILABILITY STATEMENT
The original contributions presented in the study are included in the article/Supplementary Material. Further inquiries can be directed to the corresponding author.