Abstract
Hepatocellular carcinoma (HCC) remains one of the most lethal cancers around the world. Precision oncology will be crucial for further improving the prognosis of HCC patients. Compared with traditional bulk RNA-seq, single-cell RNA sequencing (scRNA-seq) enables the transcriptomes of a great deal of individual cells assayed in an unbiased manner, showing the potential to deeply reveal tumor heterogeneity. In this study, based on the scRNA-seq results of primary neoplastic cells and paired normal liver cells from eight HCC patients, a new strategy of machine learning algorithms was applied to screen core biomarkers that distinguished HCC tumor tissues from the adjacent normal liver. Expression profiles of HCC cells and normal liver cells were first analyzed by maximum relevance minimum redundancy (mRMR) to get a top 50 signature gene feature. For further analysis, the incremental feature selection (IFS) method and leave-one-out cross validation (LOOCV) were conducted to build an optimal classification model and to extract 21 potentially essential biomarkers for HCC cells. Our results provided new insights into HCC pathogenesis that might be valuable for HCC diagnosis and therapy.
Introduction
Hepatocellular carcinoma (HCC), with an annual incidence of 8.3 per 1,00,000 in population around the world, remains one of the most lethal malignancies in the digestive system. It is estimated that the 5-years survival rate for HCC patients is 18%, only a little bit higher than pancreatic cancer among all cancers, indicating that HCC is still one of the worst prognostic tumors worldwide (Siegel et al., 2020). Fortunately, with the development of modern cancer therapies, which integrate diverse neoadjuvant and adjuvant strategies with classic surgical resection, the survival rate of HCC has been gradually improving in the past few decades (McGlynn et al., 2020). However, the highly heterogeneous nature of HCC determines that a large proportion of patients receiving standardized treatment will inevitably relapse (Petrowsky et al., 2020). Thus, precision oncology, including novel predictive and therapeutic oncogenetic markers, signals in tumor immune microenvironment and microbiome, etc. will be crucial for further improving the prognosis of HCC patients (Nault and Villanueva, 2020).
Tumor heterogeneity is the biggest obstacle to the development of precision treatment for HCC, which is decided by heterogeneous HCC cells, a changeful, complex microenvironment, and their involuted interaction. With the rapid development of profiling technology, bulk DNA and RNA sequencing have provided a lot of information about molecular phenotypes and evolutionary characteristics of HCC. In 2020, Amanda J. Craig and her colleagues reviewed the most important and common genetic alterations of HCC, including mutations in the TERT promoter, TP53, and CTNNB1, copy number variations in multiple genes, and aberrations in DNA methylation at the genome level (Craig et al., 2020). Rebouissou and Nault (2020) discovered signal pathways that are frequently altered in HCC patients included telomere maintenance, including Wnt/β-Catenin, P53/cell cycle regulation, oxidative stress, epigenetic modifiers, AKT/mTOR signaling, and MAPK pathway. These findings have generated classification schemas of HCC molecular subtypes. However, these results still encounter many difficulties in real-world clinical applications. On the one hand, bulk sequencing could only detect the average condition of gene alterations or expression status, unable to distinguish the gene expression signature of diverse cells in cancer samples, or get a categorized gene feature between tumors and normal tissues. On the other hand, limited by sample size and traditional methods for differential expression analysis (like Limma and EdgeR), results of bulk RNA-seq often lack stability and repeatability among different batches of experiments. These drawbacks reduce the potential practical value of bulk RNA-seq results.
The advent of single-cell RNA sequencing (scRNA-seq) is a revolutionary development in the field of profiling. Since proposed by Professor F Tang in 2009 (Tang et al., 2010), researchers have immediately realized the infinite value of this technology. It enables the transcriptomes profiling of a lot of individual cells assayed in an unbiased manner, allowing researchers to sort and study the specific characteristics of a single cell or a group of cells individually (Stegle et al., 2015). This technology perfectly fits the research on cancer that has innegligible internal heterogeneity. Thus, scRNA-seq for tumor research has sprung up in the past decade, it has been applied in multiple cancers including HCC. Zhang et al. (2019) detected HCC specific immune cells for five HCC patients using scRNA-seq and discovered that CD45 immune cells, LAMP3(+) dendritic cells, and tumor-associated macrophages were specific infiltrating immune cells in HCC and were associated with patients’ poor prognosis. Ho et al. (2019) grouped HCC stem cells for two subgroups through scRNA-seq according to the expression of EPCAM; they also identified a CD24/CD44-enriched cell subpopulation within the EPCAM(+) cells which might indicate a novel stemness-related cell subclone of HCC. These studies illustrate the significance of scRNA-seq for deeply understanding the evolutionary differences among HCC patients, the heterogeneity between HCC tissues and normal livers, between HCC parenchymal cells and microenvironmental mesenchymal cells. Thus, the multi-dimensional interpretation of tumor heterogeneity by scRNA-seq will effectively solve the current clinical problem of chemoresistance and tumor recurrence and guide tumor immunotherapy and targeted therapy for HCC patients. However, limitations still exist in scRNA-seq research nowadays. First, it seems difficult for scRNA-seq to provide specific genetic markers to guide clinical diagnosis and treatment, which focus more on the alteration of organism and histology levels, making single-cell profiling too “microscopic.” Second, even though scRNA-seq provides a temporal map of the tumor microenvironment and cell development and many new clusters of tumor progenitors and immune cells were identified, it is still hard to conduct molecular biological research on these new discoveries to elucidate the pathogenesis underlying the course of diseases. In a word, these limitations were mainly caused by insufficient mining of scRNA-seq big data. Innovative algorithmic strategies are demanded to provide new biological implications for scRNA-seq.
Max-Relevance and Min-Redundancy (mRMR) algorithm provides a highly robust feature selection scheme in machine learning and has been applied in multi-omics medical research in recent years. However, during the process of continuously adding features, mRMR only considers the local optimal solution. Thus, after obtaining the feature set with the importance ranking from high to low through this algorithm, a secondary feature selection is usually followed. For example, Morgan et al. (2021) applied mRMR along with an explainable boosting machine (EBM) classifier for CT radiomics to predict local failure following chemoradiation for head and neck cancer patients. Gao et al. (2020) chose an mRMR plus Random Forest model to find the lncRNA signature in bulk RNA-seq for immunophenotype prediction in Glioblastoma. In scRNA-seq, Cheng et al. (2020) applied an mRMR plus Support Vector Machines (SVM) to screen core biomarkers that distinguish the discrepancy between GBM tumor and pericarcinomatous environment. Based on the above research, we believed that the algorithmic scheme centered on mRMR might be used to screen biomarkers between cancer and non-cancer in scRNA-seq as well. This might have biological significance in assisting tumor diagnosis and tumor tissue identification during biopsy as well as providing novel parenchyma and stromal biomarkers for a certain cancer type.
In this study, based on paired scRNA-seq results of HCC and adjacent normal liver cells from eight patients, we designed a new computational strategy, consisting of machine learning algorithms, to screen core biomarkers that could distinguish the discrepancy between HCC and normal liver tissue. Gene expression profiles of tumor cells and paired hepatocytes were analyzed by maximum relevance minimum redundancy (mRMR) to get a 50-hub-gene feature. For further screening and classification of the 50-gene-feature, a support vector machine (SVM) algorithm was adopted. Results yielded a gene set with 21 genes that might be essential biomarkers for HCC tumor patients.
Materials and methods
Single cell gene expression profiles of HCC tumors and normal liver tissues
Single-cell gene expression profile data of HCC was obtained and downloaded from Gene Expression Omnibus (GEO) database in NCBI, the accession number was GSE149614. In this dataset, >70,000 single-cell transcriptomes for 10 HCC patients were sequenced and further measured using Illumina NovaSeq 6000 platform (GPL24676). Here we extracted expression profiles of patient 8 as our training set and patient three to seven and 9-10 as our validation set. Patient No. 1 and 2 were excluded from our study for a lack of sequencing data of paired normal liver tissues. The number of expressed genes was counted in paired HCC and normal liver samples of each patient. We utilized this dataset to further establish our gene feature for the purpose of discriminating HCC cells from normal liver cells.
mRMR ranking of discriminative genes
To achieve the goal of best discriminating the two types of tissues using the least number of genes, the Max-Relevance and Min-Redundancy (mRMR) algorithm was applied (Peng et al., 2005). This algorithm aimed to find a gene set that had the biggest correlation between the selected genes and samples (Max-Relevance), but the least correlation between genes inside this gene set (Min-Redundancy). The redundancy between genes was minimized as genes with similar expression characteristics were removed, except for the most representative genes remained. This method was confirmed effective in finding core biomarkers in sequencing analysis, especially in scRNA-seq with large and spare expression data (Cheng et al., 2020). It effectively overcame the shortcomings of traditional differential expression analysis in bulk RNA-seq, helping us to get a smaller number of biomarkers with the highest representation.
The mathematical model of this algorithm was shown as followed. First, we defined all genes, selected genes, and to be selected genes as Ω, Ωs, and Ωt, respectively. The relevance (D) of gene g from Ωt with cell type t can be measured with mutual information (I)
And the redundancy R of the gene g with the selected genes in Ωs are
Now, our goal is to get the gene gj from Ωt so that D takes the maximum value (Dmax) and R takes the minimum value (Rmin), which can be expressed as the following function
After n rounds of evaluation, all genes (Ω) will be ranked as a new gene list
The subscript i here reflects the trade-off between relevance with tissue type and redundancy with selected genes. The smaller index i is, the better discriminating power the gene has, and the higher of the corresponding gene gi ranks.
Screening and optimization of single cell HCC biomarkers
We then constructed 50 support vector machines (SVM) classifiers and applied an incremental feature selection (IFS) method (Ye et al., 2017) using Top 50 mRMR genes to further screen optimized biomarker genes. The 50 gene sets are defined as
Each candidate gene set includes the top k genes in the mRMR gene set S.
To prevent overfitting and evaluate the generalization ability of prediction performance for each SVM classifier, the leave-one-out cross validation (LOOCV) (Cheng et al., 2017) was then applied. Here we briefly described the procedure of LOOCV. Supposed that a dataset has N samples, in each round of LOOCV, there are (N−1) samples adopted for training and the remaining one sample for testing. This process keeps running until all the N samples have been tested for one time after N rounds.
Since the positive and negative sample sizes are imbalance, the Matthews correlation coefficient (MCC) (Matthews, 1975), which considered both sensitivity and specificity, seems idealized for our IFS optimizing process. The calculation formula of MCC is shown as followed:where TP, TN, FP, and FN are the abbreviation of true positive, true negative, false positive, and false negative, respectively.
After all the above procedures, an IFS curve was finally formed. The x-axis of this curve denoted the number of genes in the SVM classifier (1–100) and the y-axis indicated the MCCs of it. Based on the IFS curve, an inflecting point that represented the usage of relatively few genes to get a relatively higher prediction accuracy was marked. The x-coordinate value of this inflecting point indicated number of genes in the final biomarker gene set and the y-coordinate value represented the prediction performance. This point was regarded as the most suitable SVM model for the final HCC biomarkers.
Biological significance analysis
For the gene signature predicted by the above algorithms, we then analyzed their expressed cell types, chromosomal location, and functions by GO, GENECARDS and literature reviewing. For GO, biological process (BP) was enriched and the p value as well as false discovery rate (FDR) based on hypergeometric distribution were calculated, FDR <0.05 was considered significantly enriched. GENECARDS database was available in https://www.genecards.org/. Literature reviewing was applied using NCBI pubmed databases (www.ncbi.nlm.nih.gov/pubmed/) to search publications for every gene in recent 10 years.
The workflow diagram of this study was shown in Figure 1.
FIGURE 1

Workflow diagram of this study.
Results and discussion
Identifying the most discriminative feature by mRMR algorithm
After mRMR algorithm was applied, a feature, composed of top 50 most significant genes was established. This gene set was listed in Supplementary Table S1. Based on the principle of the mRMR algorithm, we believed that this feature was the most relevant one to distinguish HCC tumor cells from normal liver cells and had the least redundancy among the elements inside this gene set.
Further screening for the optimal HCC biomarker genes by IFS method
Given that our aim was to discriminate sample groups most significantly using biomarker genes as few as possible, the feature of 50 genes formed by mRMR algorithm was obviously too large to possess a practical value. Thus, we needed to choose an optimized group from these 50 genes as the final marker. To achieve this goal, IFS method was adopted. In the first round, only rank-first gene in mRMR was included as feature gene, then an SVM classifier was built to predict the group of each sample and validation was achieved by LOOCV and quantified by MCC value. In the second round, the rank-second gene was added into the previous 1-gene-feature and the above steps were repeated. This process kept repeated for 500 times until all 50 mRMR genes were included in the SVM model and an IFS curve was formed. As shown in Figure 2, the best peak MCC was 0.974 when 21 genes were included in SVM model in patient No.8 (train set), this peak MCC was also detected in other patients for validation (Figure 2). This peak was also validated effective in other patients except for patient No. 3 and No. 4, with MCC value no more than 0.7 but acceptable. Thus, these 21 genes were adopted as our final optimal HCC biomarkers (Table 1).
FIGURE 2

The IFS curve of the top 50 mRMR genes. The x-axis was the included number of top genes and the y-axis was the prediction performance. The blue plot was the training set from patient NO. 8. The peak MCC was 0.974 when top 21 genes included to IFS model. These 21 genes were chosen as the optimal HCC biomarkers.
TABLE 1
| Rank | Gene | Rank | Gene | Rank | Gene |
|---|---|---|---|---|---|
| 1 | SPP1 | 8 | GPR18 | 15 | SLAMF6 |
| 2 | FCN3 | 9 | AKNA | 16 | TRGC1 |
| 3 | FCRL6 | 10 | FCMR | 17 | STAT4 |
| 4 | S1PR5 | 11 | AC092580.4 | 18 | SCML4 |
| 5 | CD8A | 12 | AIM1 | 19 | HBB |
| 6 | SAA1 | 13 | GZMM | 20 | PLAC8 |
| 7 | CD160 | 14 | IGHA1 | 21 | APOA2 |
The 21 optimal HCC biomarker genes got from IFS method.
We further applied t-distributed stochastic neighbor embedding (t-SNE) for predicted HCC and non-HCC cells to detect both the tumor purity and the robustness of our classifier based on the 21 genes. As shown in Figures 3A,B and Table 2, there were only a few false positive (red dots in Figure 3A) and false negative dots (black dots in Figure 3B) mixed with true positive and true negative samples. However, the proportion of those false dots was extremely low with true dots and hard to classify. These t-SNE plots suggested that the HCC cells might contain non-HCC cells and vice versa, but most cells from the corresponding group were acceptable and the algorithms we applied could get the robust single cell biomarkers even when there were little tissue purity issues.
FIGURE 3

The t-SNE plots of predicted HCC cells and normal liver cells. (A) The t-SNE plots of predicted HCC cells. The true positive cells (black dots) account for the vast majority, while the false positive cells (red dots) occupied only a few and was mixed in the true positive cells, which was difficult to distinguish. (B) The t-SNE plots of predicted normal liver cells. The true negative cells (red dots) account for the vast majority, while the false negative cells (black dots) occupied only a few.
TABLE 2
| Predicted HCC | Predicted non-HCC | |
|---|---|---|
| Actual HCC | 4,713 | 82 |
| Actual non-HCC | 105 | 4,728 |
Confused matrix of the 21 selected genes.
The biological functions of the 21 hub genes
The machine learning methods provided us with a new set of gene features for HCC to identify tumor from paired normal liver tissues. However, nothing was learned from the biological significance of this gene set. We first performed Gene Ontology (GO) enrichment analysis for biological process (BP) analysis of the selected 21 genes (Table 3). Results of GO enrichment showed that they were enriched in Natural Killer (NK) cells and T cells (Tc) related pathways, indicating that changes in immune microenvironment are the core difference between HCC tumors and normal liver tissues.
TABLE 3
| GO_Term | P | Adj.P | Genes |
|---|---|---|---|
| Lymphocyte mediated immunity | 2.49E-05 | 0.014115 | CD8A/CD160/GZMM/IGHA1/SLAMF6 |
| Leukocyte mediated cytotoxicity | 0.000165 | 0.030107 | CD160/GZMM/SLAMF6 |
| Positive regulation of natural killer cell mediated cytotoxicity | 0.000262 | 0.030107 | CD160/SLAMF6 |
| Positive regulation of natural killer cell mediated immunity | 0.000384 | 0.030107 | CD160/SLAMF6 |
| Alpha-beta T cell activation | 0.000402 | 0.030107 | CD160/GPR18/SLAMF6 |
| Adaptive immune response based on somatic recombination of immune receptors built from immunoglobulin superfamily domains | 0.000447 | 0.030107 | CD8A/GZMM/IGHA1/SLAMF6 |
| Cell killing | 0.000648 | 0.030107 | CD160/GZMM/SLAMF6 |
GO (BP) enrichment results of the 21 selected genes.
To explore the functions of these 21 genes in more depth from a biomedical perspective, we reviewed the location, basic functions, as well as related biological pathways and processes for each gene through Genecards database (https://www.genecards.org/). Literatures about the biological functions of these 21 genes in HCC and/or other malignant tumors were also thoroughly searched through Pubmed database. We finally divided these 21 genes into four categories (Siegel et al., 2020) Markers related to the malignant phenotype or clinical prognosis of HCC (Table 4); (McGlynn et al., 2020) Markers without reports in HCC but were related to the pathogenesis and/or prognosis of other malignant tumors (Table 5); (Petrowsky et al., 2020) Marker genes expressed by immune cells (Table 6); (Nault and Villanueva, 2020) Other genes that have not yet been studied in cancers, including FCMR, TRGC1, and HBB. Subsequent research was worth exploring the role of these markers in the pathogenesis of HCC and their applicating prospects in HCC diagnosis, monitoring and treatment. Furthermore, it was worth mentioning the immune-specific genes in Table 6. These genes were all markers for Natural Killer (NK) cells and T cells, indicating that changes in cytotoxic effects might play a vital role in the HCC immunity. These immune markers might be promising targets for enhancing the efficacy of HCC immunotherapy.
TABLE 4
| GeneName | Location | Function summaries | Related pathways | Reported functions in HCC |
|---|---|---|---|---|
| SPP1 | 4q22.1 | 1. Forming an integral part of the mineralized matrix and is to cell-matrix interaction. | 1. Cytokine activity | 1. Prognostic marker for HCC Zheng et al. (2018), Ouyang et al. (2020) |
| 2. Acting as a cytokine enhancing IFN-γ and IL-12, reducing production of IL-10, essential in type I immunity. | 2. Integrin binding | 2. Enhancer of cell growth Wang et al. (2019) | ||
| 3. Protein binding | ||||
| 4. Extracellular matrix binding | ||||
| SAA1 | 11p15.1 | 1. A major acute phase protein that is highly expressed in response to inflammation | 1. Heparin binding and chemoattractant activity | Lowly expressed in HCC patients, indicating worse prognosis Zhang et al. (2020) |
| 2. Major biomarker for diverse tumors | 2. Activated TLR4 signalling | |||
| 3. Signaling by GPCR | ||||
| STAT4 | 2q32.2-q32.3 | A member of the STAT family of transcription factors activated by cytokines | 1. DNA-binding transcription factor activity | Tumor suppressor in HCC that inhibit proliferation and promote apoptosis Li et al. (2016) |
| 2. Sequence-specific DNA binding | ||||
| 3. JAK-STAT signaling pathway | ||||
| PLAC8 | 4q21.22 | A highly conservative protein, physiology function unknown | 1. Chromatin binding activity | Downregulated in HCC, indicating poor prognosis when lowly expressed by promoting cell proliferation Zou et al. (2016) |
| 2. Differentiation of white and brown adipocyte |
Summary for markers related to the malignant phenotype or clinical prognosis of HCC.
TABLE 5
| GeneName | Location | Function summaries | Related pathways | Reported functions in other cancers |
|---|---|---|---|---|
| FCN3 | 1p36.11 | 1. Calcium-independent lectin activity, found in all human serum. | 1. Carbohydrate binding activity | Highly expressed in ovarian cancer and leukaemia patients Szala et al. (2013), Sokołowska et al. (2020). |
| 2. Functioning in innate immunity | 2. Antigen binding activity | |||
| 3. Related to innate immune. system and creation of C4 and C2 activators. | 3. Complement pathway in association with MASPs and sMAP | |||
| APOA2 | 1q23.3 | 1. The second most abundant protein in high density lipoprotein (HDL) particles | 1. Lipoprotein metabolism | A minimally invasive biomarker for detecting pancreatic cancer, bladder cancer, and metastatic renal cell cancer patients Vermaat et al. (2012), Chen et al. (2015), Sato et al. (2020) |
| 2. Stabilizing HDL structure by its association with lipids | 2. Signaling by GPCR | |||
| 3. Protein homodimerization | ||||
| 4. Activity of lipid binding | ||||
| AC092580.4 | 2p25.1 | Affiliated with the lncRNA class | NA | 1. Highly expressed in relapse AML patients Feng et al. (2018) |
| S1PR5 | 19p13.2 | 1. Receptor for a bioactive lysophospholipid S1P | 1. G protein-coupled receptor activity | 2. Highly expressed and promoting proliferation and invasion in clear cell carcinoma (ccRCC) and colon cancer patients Peng et al. (2020), Zhou et al. (2020) |
| 2. Both intracellular as a second messenger and extracellular as a receptor ligand | 2. Sphingosine-1-phosphate receptor activity | |||
| AKNA | 9q32 | 1. Centrosomal protein that plays a key role in cell delamination | 1. RNA polymerase II proximal promoter sequence-specific DNA binding activity | 1. A tumor suppressor in gastric cancer by modulating EMT Wang et al. (2020) |
| 2. An epithelial-to-mesenchymal transition (EMT) regulator | 2. Proximal promoter DNA-binding transcription activator activity | 2. An immune activator in cervical cancer Manzo-Merino et al. (2018) | ||
| 3. A transcription factor that specifically activates the expression of the CD40 | ||||
| AIM1 | 6q21 | NA | NA | 1. A classical tumor suppressor with high mutational frenquency in melanoma Ray et al. (1997) |
| 2. Suppressing tumor migration in prostate cancer Haffner et al. (2017) |
Summary for markers reported in other cancers.
TABLE 6
| GeneName | Location | Function summaries | Related pathways | Immune cell type | Reported in HCC |
|---|---|---|---|---|---|
| IGHA1 | 14q32.33 | Constant region of immunoglobulin heavy chains | 1. Cell surface interactions | B cells | Not Reported |
| 2. Response to elevated platelet cytosolic Ca2+ | |||||
| 3. Antigen binding activity | |||||
| 4. Immunoglobulin receptor binding activity | |||||
| FCRL6 | 1q23.2 | MHC class II receptor | protein phosphatase binding | NK cells and CTLs | Not Reported |
| CD8A | 2p11.2 | 1. A classic surface glycoprotein on most CTLs mediateing immune cell interactions | 1. Protein homodimerization activity | CTLs | Reported Sangro et al. (2020) |
| 2. A coreceptor for MHC class I molecule:peptide complex | 2. Coreceptor activity | ||||
| CD160 | 1q21.1 | 1. A transmembrane on immune cells, mainly NK cells and activated T cells | 1. Innate Lymphoid Cell Differentiation Pathways | NK cells and activated T cells | Reported Sun et al. (2018) |
| 2. Upon persistent antigen stimulation, it may contribute to CTL exhaustion | 2. Class I MHC mediated antigen processing and presentation | ||||
| 3. MHC class I receptor activity | |||||
| GPR18 | 13q32.3 | 1. A cannabinoid-activated orphan G protein-coupled receptor | 1. Peptide ligand-binding receptors | Tumor-infiltrating B lymphocytes (TIL-Bs) and CD8+ T cells | Not Reported |
| 2. Selected expressed on immune cells | 2. G protein-coupled receptor activity | ||||
| GZMM | 19p13.3 | A member of granzymes, serine proteases acctivity | 1. Serine-type endopeptidase activity | NK cells | Not Reported |
| 2. Endopeptidase activity | |||||
| 3. Creation of C4 and C2 activators | |||||
| SLAMF6 | 1q23.2-q23.3 | A type I transmembrane protein, belonging to the CD2 subfamily of the immunoglobulin superfamily | 1. Class I MHC mediated antigen processing and presentation | Mainly in NK cells, also existing in T and B cells | Not Reported |
| 2. Immunoregulatory interactions between a Lymphoid and a non-Lymphoid cell |
Summary for immune cell markers.
In summary, using an integrated machine learning strategy, mainly composed of mRMR and IFS, we analyzed scRNA-seq data from eight paired HCC tissues and normal liver tissues. A 21-gene-feature consisted of both cancer markers and immune cell markers was established. This feature was regarded as the core to distinguish HCC from normal liver tissue. Considering that the tissue obtained from clinical needle biopsy is often a mixture of tumor parenchyma, stroma and normal liver tissue, this 21-gene-feature might help both the clinical diagnosis of HCC and the identification of biopsy-obtained tissue types. Besides, given that these 21 genes, most of which had not been fully explored in HCC, were expressed in different parenchymal and mesenchymal cells, the following research might focus on their biological function and molecular mechanism in distinct HCC-related cell cluster.
Statements
Data availability statement
The datasets presented in this study can be found in online repositories. The names of the repository/repositories and accession number(s) can be found in the article/Supplementary Material.
Author contributions
JS and HZ designed the concept of this research. WL and JL conducted all the studies in this paper. WZ and WG validated and proofread data. WG wrote this paper. XJ and ZY revised this paper.
Funding
This article was supported by the following fundings 1) National Natural Science Foundation of China (No. 81803085); 2) Natural Science Foundation of Hunan Province, China (Nos 2019JJ40414, 2018JJ2612, 2019JJ50285, and 2020JJ4919); 3) Science and Technology Project of Changsha, Hunan, China (No. kq2004146); 4) Scientifific Research Project for Distinguished Young Scientists of Provincial Education Department of Hunan, China (No. 18B460).
Conflict of interest
The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.
Publisher’s note
All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article, or claim that may be made by its manufacturer, is not guaranteed or endorsed by the publisher.
Supplementary material
The Supplementary Material for this article can be found online at: https://www.frontiersin.org/articles/10.3389/fgene.2022.873218/full#supplementary-material
References
1
ChenH.ChenY.TsaiR.ChenM.ChenS.XiaoM.et al (2015). A sensitive and selective magnetic graphene composite-modified polycrystalline-silicon nanowire field-effect transistor for bladder cancer diagnosis. Biosens. Bioelectron.66, 198–207. 10.1016/j.bios.2014.11.019
2
ChengH.GarrickD.FernandoR. (2017). Efficient strategies for leave-one-out cross validation for genomic best linear unbiased prediction. J. Anim. Sci. Biotechnol.8, 38. 10.1186/s40104-017-0164-6
3
ChengQ.LiJ.FanF.CaoH.DaiZ.WangZ.et al (2020). Identification and analysis of glioblastoma biomarkers based on single cell sequencing. Front. Bioeng. Biotechnol.8, 167. 10.3389/fbioe.2020.00167
4
CraigA.von FeldenJ.Garcia-LezanaT.SarcognatoS.VillanuevaA. (2020). Tumour evolution in hepatocellular carcinoma. Nat. Rev. Gastroenterol. Hepatol.17 (3), 139–152. 10.1038/s41575-019-0229-4
5
FengY.ShenY.ChenH.WangX.ZhangR.PengY.et al (2018). Expression profile analysis of long non-coding RNA in acute myeloid leukemia by microarray and bioinformatics. Cancer Sci.109 (2), 340–353. 10.1111/cas.13465
6
GaoM.WangX.HanD.LuE.ZhangJ.ZhangC.et al (2020). A six-lncRNA signature for immunophenotype prediction of glioblastoma multiforme. Front. Genet.11, 604655. 10.3389/fgene.2020.604655
7
HaffnerM.EsopiD.ChauxA.GürelM.GhoshS.VaghasiaA.et al (2017). AIM1 is an actin-binding protein that suppresses cell migration and micrometastatic dissemination. Nat. Commun.8 (1), 142. 10.1038/s41467-017-00084-8
8
HoD.TsuiY.SzeK.ChanL.CheungT.LeeE.et al (2019). Single-cell transcriptomics reveals the landscape of intra-tumoral heterogeneity and stemness-related subpopulations in liver cancer. Cancer Lett.459, 176–185. 10.1016/j.canlet.2019.06.002
9
LiJ.LiangL.LiuY.LuoY.LiangX.LuoD.et al (2016). Clinicopathological significance of STAT4 in hepatocellular carcinoma and its effect on cell growth and apoptosis. Onco. Targets. Ther.9, 1721–1734. 10.2147/OTT.S100040
10
Manzo-MerinoJ.Lagunas-MartínezA.Contreras-OchoaC.LizanoM.Castro-MuñozL.Calderón-CoronaC.et al (2018). The human papillomavirus (HPV) E6 oncoprotein regulates CD40 expression via the AT-hook transcription factor AKNA. Cancers10 (12), E521. 10.3390/cancers10120521
11
MatthewsB. (1975). Comparison of the predicted and observed secondary structure of T4 phage lysozyme. Biochim. Biophys. Acta405 (2), 442–451. 10.1016/0005-2795(75)90109-9
12
McGlynnK.PetrickJ.El-SeragH. (2020). Epidemiology of hepatocellular carcinoma. Baltimore, Md: Hepatology.
13
MorganH.WangK.DohopolskiM.LiangX.FolkertM.SherD.et al (2021). Exploratory ensemble interpretable model for predicting local failure in head and neck cancer: The additive benefit of CT and intra-treatment cone-beam computed tomography features. Quant. Imaging Med. Surg.11 (12), 4781–4796. 10.21037/qims-21-274
14
NaultJ.VillanuevaA. (2020). Biomarkers for hepatobiliary cancers. Baltimore, Md: Hepatology.
15
OuyangG.YiB.PanG.ChenX. (2020). A robust twelve-gene signature for prognosis prediction of hepatocellular carcinoma. Cancer Cell. Int.20, 207. 10.1186/s12935-020-01294-9
16
PengH.LongF.DingC. (2005). Feature selection based on mutual information: Criteria of max-dependency, max-relevance, and min-redundancy. IEEE Trans. Pattern Anal. Mach. Intell.27 (8), 1226–1238. 10.1109/TPAMI.2005.159
17
PengR.WangY.MaoL.FangF.GuanH. (2020). Identification of core genes involved in the metastasis of clear cell renal cell carcinoma. Cancer Manag. Res.12, 13437–13449. 10.2147/CMAR.S276818
18
PetrowskyH.FritschR.GuckenbergerM.De OliveiraM.DutkowskiP.ClavienP. (2020). Modern therapeutic approaches for the treatment of malignant liver tumours. Nat. Rev. Gastroenterol. Hepatol.17, 755–772. 10.1038/s41575-020-0314-8
19
RayM.WistowG.SuY.MeltzerP.TrentJ. (1997). AIM1, a novel non-lens member of the betagamma-crystallin superfamily, is associated with the control of tumorigenicity in human malignant melanoma. Proc. Natl. Acad. Sci. U. S. A.94 (7), 3229–3234. 10.1073/pnas.94.7.3229
20
RebouissouS.NaultJ. (2020). Advances in molecular classification and precision oncology in hepatocellular carcinoma. J. Hepatol.72 (2), 215–229. 10.1016/j.jhep.2019.08.017
21
SangroB.MeleroI.WadhawanS.FinnR.Abou-AlfaG.ChengA.et al (2020). Association of inflammatory biomarkers with clinical outcomes in nivolumab-treated patients with advanced hepatocellular carcinoma. J. Hepatol.73 (6), 1460–1469. 10.1016/j.jhep.2020.07.026
22
SatoY.KobayashiT.NishiumiS.OkadaA.FujitaT.SanukiT.et al (2020). Prospective study using plasma apolipoprotein A2-isoforms to screen for high-risk status of pancreatic cancer. Cancers12 (9), E2625. 10.3390/cancers12092625
23
SiegelR.MillerK.JemalA. (2020). Cancer statistics, 2020. Ca. Cancer J. Clin.70 (1), 7–30. 10.3322/caac.21590
24
SokołowskaA.ŚwierzkoA.GajekG.GołosA.MichalskiM.NowickiM.et al (2020). Associations of ficolins and mannose-binding lectin with acute myeloid leukaemia in adults. Sci. Rep.10 (1), 10561. 10.1038/s41598-020-67516-2
25
StegleO.TeichmannS.MarioniJ. (2015). Computational and analytical challenges in single-cell transcriptomics. Nat. Rev. Genet.16 (3), 133–145. 10.1038/nrg3833
26
SunH.XuJ.HuangQ.HuangM.LiK.QuK.et al (2018). Reduced CD160 expression contributes to impaired NK-cell function and poor clinical outcomes in patients with HCC. Cancer Res.78 (23), 6581–6593. 10.1158/0008-5472.CAN-18-1049
27
SzalaA.SawickiS.SwierzkoA.SzemrajJ.SniadeckiM.MichalskiM.et al (2013). Ficolin-2 and ficolin-3 in women with malignant and benign ovarian tumours. Cancer Immunol. Immunother.62 (8), 1411–1419. 10.1007/s00262-013-1445-3
28
TangF.BarbacioruC.NordmanE.LiB.XuN.BashkirovV.et al (2010). RNA-Seq analysis to capture the transcriptome landscape of a single cell. Nat. Protoc.5 (3), 516–535. 10.1038/nprot.2009.236
29
VermaatJ.GerritseF.van der VeldtA.RoessinghW.NiersT.OostingS.et al (2012). Validation of serum amyloid α as an independent biomarker for progression-free and overall survival in metastatic renal cell cancer patients. Eur. Urol.62 (4), 685–695. 10.1016/j.eururo.2012.01.020
30
WangG.SunD.LiW.XinY. (2020). AKNA is a potential prognostic biomarker in gastric cancer and function as a tumor suppressor by modulating EMT-related pathways. Biomed. Res. Int.2020, 6726759. 10.1155/2020/6726759
31
WangJ.HaoF.FeiX.ChenY. (2019). SPP1 functions as an enhancer of cell growth in hepatocellular carcinoma targeted by miR-181c. Am. J. Transl. Res.11 (11), 6924
32
YeY.ZhangR.ZhengW.LiuS.ZhouF. (2017). Rifs: A randomly restarted incremental feature selection algorithm. Sci. Rep.7 (1), 13013. 10.1038/s41598-017-13259-6
33
ZhangQ.HeY.LuoN.PatelS.HanY.GaoR.et al (2019). Landscape and dynamics of single immune cells in hepatocellular carcinoma. Cell.179 (4), 829–845. e20. 10.1016/j.cell.2019.10.003
34
ZhangW.KongH.GaoX.DongZ.LuY.HuangJ.et al (2020). Immune infiltration-associated serum amyloid A1 predicts favorable prognosis for hepatocellular carcinoma. World J. Gastroenterol.26 (35), 5287–5301. 10.3748/wjg.v26.i35.5287
35
ZhengY.HuangQ.DingZ.LiuT.XueC.SangX.et al (2018). Genome-wide DNA methylation analysis identifies candidate epigenetic markers and drivers of hepatocellular carcinoma. Brief. Bioinform.19 (1), 101–108. 10.1093/bib/bbw094
36
ZhouH.YinX.BaiF.LiuW.JiangS.ZhaoJ. (2020). The role and mechanism of S1PR5 in colon cancer. Cancer Manag. Res.12, 4759–4775. 10.2147/CMAR.S239118
37
ZouL.ChaiJ.GaoY.GuanJ.LiuQ.DuJ. (2016). Down-regulated PLAC8 promotes hepatocellular carcinoma cell proliferation by enhancing PI3K/Akt/GSK3β/Wnt/β-catenin signaling./β-catenin signaling. Biomed. Pharmacother. = Biomedecine Pharmacother.84, 139–146. 10.1016/j.biopha.2016.09.015
Summary
Keywords
hepatocellular carcinoma, ScRNA-seq, biomarkers, MRMR algorithm, support vector machine
Citation
Li W, Liu J, Zhu W, Jin X, Yang Z, Gao W, Sun J and Zhu H (2022) Identification of biomarkers for hepatocellular carcinoma based on single cell sequencing and machine learning algorithms. Front. Genet. 13:873218. doi: 10.3389/fgene.2022.873218
Received
10 February 2022
Accepted
12 August 2022
Published
24 October 2022
Volume
13 - 2022
Edited by
Jian-Guo Zhou, University of Erlangen Nuremberg, Germany
Reviewed by
Ankur Sharma, Harry Perkins Institute of Medical Research, Australia
Andrew Dellinger, Elon University, United States
Updates
Copyright
© 2022 Li, Liu, Zhu, Jin, Yang, Gao, Sun and Zhu.
This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.
*Correspondence: Wenzhe Gao, gaowenzhexy@163.com; Jichun Sun, sunjichun_1978@163.com; Hongwei Zhu, zhw_0509@yeah.net
†These authors have contributed equally to this work and share first authorship
This article was submitted to Cancer Genetics and Oncogenomics, a section of the journal Frontiers in Genetics
Disclaimer
All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article or claim that may be made by its manufacturer is not guaranteed or endorsed by the publisher.