Robust Biomarker Screening Using Spares Learning Approach for Liver Cancer Prognosis

LncRNAs, miRNAs, mRNAs, methylation, and proteins exert profound biological functions and are widely applied as prognostic features in liver cancer. This study aims to identify prognostic biomarkers’ signature for liver cancer. Samples with inadequate tumor purity were filtered out and the expression data from different resources were retrieved. The Spares learning approach was applied to select lncRNAs, miRNAs, mRNAs, methylation, and proteins’ features based on their differentially expressed groups. The LASSO boosting technique was employed for the predictive model construction. A total of 200 lncRNAs, 200 miRNAs, 371 mRNAs, 371 methylations, and 184 proteins were observed to be differentially expressed. Five lncRNAs, 11 miRNAs, 30 mRNAs, 4 methylations, and 3 proteins were selected for further evaluation using the feature elimination technique. The highest accuracy of 89.32% is achieved as a result of training and learning by Spares learning methodology. Final outcomes revealed that 5 lncRNA, 11 miRNA, 30 mRNA, 4 methylation, and 3 protein signatures could be potential biomarkers for the prognosis of liver cancer patients.


INTRODUCTION
One of the largest organs in the human body is the liver, which is crucial for metabolism and is helpful in detoxification and maintaining homeostasis. Many ailments are concerned with the liver including hepatitis, fibrosis, genetic and metabolic issues, and liver cancer, which is one of the leading causes of cancer-related expiries (Bray et al., 2018;Dooley et al., 2018;Yang et al., 2019). Hepatocellular carcinoma (HCC) can occur in an ailing liver and encompasses numerous molecular cascades (Kanda et al., 2019). It is reported that more than 90% of liver cancers are HCC, which is an extremely assorted form of cancer verified by high-throughput sequencing and gene expression profiling, at both the molecular and histological levels (Calderaro et al., 2019).
Gene therapy has progressed as an effective source of dealing with disease-causing gene imperfections to attain a typical status. The approaches employed to treat illness by gene therapy consist of gene replacement, gene restoration, gene extension, gene muzzling, vaccination, and, currently, gene-editing technology (Alsaggar and Liu, 2016;Chew et al., 2016;Karimian et al., 2019). Thus, the identification of a gene that can be used as a potential biomarker is an important step in the treatment of liver cancer (West et al., 2019;Zheng et al., 2019;Lu et al., 2020;von Felden and Villanueva, 2020). GRAPHICAL ABSTRACT | The graphical abstract depicts the pipeline of methodology that represents the flow of work for identification of prognostic biomarkers for liver cancer.
Statistical approaches like artificial neural networks employing BI-RADS (Baker et al., 1995) and logistic regressions have been used in several reports to improve diagnostic performance. It is most beneficial to use statistical approaches as they enhance the identification of breast cancer, with BI-RADS, as well as along the medical and statistical information concerning infected persons' statistical threat aspects (Chhatwal et al., 2009). Regression processes go through overfitting once the prognostic covariates are involved in a large number. Similar situations lead to the wellfitting of a deterioration mockup into the drilling information. However, this doesn't go parallel with the cases of the real world. Variable selection becomes a necessity in an attempt to get exact predictions associated with covariates of a large number, for instance, BI-RADS qualifiers and statistical data. A very famous fact claims the unfavorability of regular step-bystep assortment methods in case of regression models that have numerous covariates (Houssami et al., 2004). On the other hand, sparse penalized methods, like the minimum complete reduction and assortment operative (LASSO), together have gained ample consideration. LASSO is a penalized regression technique that approximates the deterioration constants through enhancing the log-similarity purpose (or adding the squared remainders) having restriction that the addition of the total scores of the deterioration constants, kj = 1||βj||, is </= to a positive constant s. LASSO has one of the most fascinating characters that the approximated value belonging to deterioration constants is tenuous in nature, indicating a lot of components that are accurately 0. This proves that unnecessary covariates are automatically deleted by LASSO. It is believed that LASSO has numerous required characteristics that are compulsory for the deterioration mockups with a huge covariate count. Optimization algorithms for the rectilinear deterioration model as well as for general rectilinear mockups are available in large numbers, with good efficiency. As per our information, this work is the first attempt to build a calculated LASSO deterioration mockup that could assist in the diagnosis of breast cancer based on statistical and radiological findings.
This study is aimed to compare the productivity of graphical examination to forecast liver cancer dependent on whether the calculated LASSO deterioration or bit-by-bit calculated (SL) deterioration was employed, along with evaluating the practicality of integrating statistical data into the graphic breakdown for the sake of improving liver cancer diagnosis.

Gene Expression Databases
The Cancer Genome Atlas (TCGA) 1 catalog can be accessed to gain information regarding alterations in the gene, long noncoding RNAs (lncRNAs), methylations, miRNAs, mRNAs, CAN, mutational expression, and proteins involved in HCC. It is a freely accessible repository at the TCGA (Cerami et al., 2012;Gao et al., 2013). The cancer study "HCC were obtained from TCGA" and information type precedence "Mutation and CNA (DNA copy-number alterations)" were selected before analyzing genomic alterations of cell cycle control in the TCGA data on HCC. This did not require any statement of approval or informed consent for the reason that the information is retrieved from a public repository.

Genomic Alterations Summary
Genomic modifications of cell cycle control via tumor samples were summarized. Genomic modifications inclusive of mutations, CNA (amplifications and homozygous deletions), glyphs, and dye tagging were practiced to summarize the gene expression variations. It was the first step to understand various types of gene signaling in HCC. The shared exclusivity and co-occurrence among cell cycle control were studied as well. Discordant, gene-related happenings linked with a specific cancer are most of the time conflicting in a cluster of tumors, i.e., only single biological happening is expected to occur in every sample of cancer. Another condition is the simultaneous incidence that several genes are changed in each sample (Gao et al., 2013); this was an introductory way to collect information related to various gene signaling in HCC.

Mutations in Cell Cycle Control in HCC
Through the mutations of cell cycle control, the rate and position for all mutations within Pfam protein fields were specified. Colored bars denote an entire extent of cell cycle control proteins and the base of every bar in gray denotes the amino acid count. Protein domains are represented by the boxes colored in red, blue, and green. The lines and points signify the position and rate of genes. The frameshift or nonsense mutations are shown in red, missense mutations are in green, and the black color represents the in-frame deletions (Fang et al., 2015).

Survival Analysis
The survival analysis bears great importance in prognosis to highlight changes in the survival rate. Here, the differences in the overall survival were evaluated via survival analysis among samples having a single or more alteration as that of the inquired genes(s) and also the samples that have no variation.

Statistics
To carry out correlation analysis, a scattered graph of lncRNAs, methylations, miRNAs, mRNAs, CAN, mutational expression, or protein level in every sample was presented. The Kaplan-Meier approach having log-rank tests are carried out for comparing global and healthy survival of HCC that have at least a single modification or lack any adjustment within the inquired gene(s). Samples with up-regulation were recognized by the verge of Z > 2 (mean expression over 2 SDs). The standard was fixed at 0.05.

Acquisition of Patient's Data
LncRNAs appeared as potential features in the field of oncology. RNA-seq data are obtained from TCGA while the exploration of lncRNAs in cancer is provided by an open-access web app "TANRIC." The TANRIC (The Atlas of ncRNA In Cancer) 2 allows rapid and intuitive analyses of lncRNAs in the framework of experimental and other molecular information. Through TANRIC, a high amount of lncRNAs were identified with probable biomedical implication, where the majority of them shows 2 https://ibl.mdanderson.org/tanric/_design/basic/index.html robust associations with the already formed therapeutic goals and biomarkers across the cell lines or tumor types. We retrieved lncRNA, miRNA, mRNA, methylation, and protein expression data from TANRIC (Li et al., 2015) of all the TCGA liver cancer patients (Ciriello et al., 2015). The corresponding clinical data are retrieved from Genomic Data Commons (GDC) 3 . 3 https://gdc.cancer.gov/ FIGURE 2 | Differentially expressed genes on chromosomes LIHC depict differentially expressed genes on chromosomes in HCC, where expression analysis reveals that there is a significant cell cycle control in under-and overexpressed HCC, explaining that these were hotspots for the activation, where X-axis represents the over-and underexpression of genes while Y -axis indicates the chromosomes' number.
Frontiers in Bioengineering and Biotechnology | www.frontiersin.org Purity estimation was performed for the patients using consensus purity estimate and the Clonal Heterogeneity Analysis Tool (Li et al., 2012;Li and Li, 2014). Patients were filtered out with purity estimators below 60%.

Feature Identification for lncRNAs, miRNAs, mRNAs, Methylations, and Proteins
For the identification of promising discriminative lncRNAs, miRNAs, mRNAs, methylations, and proteins of survival groups, the R limma 14 package was used to identify promising discriminative biomarkers by analyzing the differential expression of lncRNAs, miRNAs, mRNAs, methylations, and proteins.

Feature Selection of lncRNAs, miRNAs, mRNAs, Methylations, and Proteins
The differentially expressed lncRNAs, miRNAs, mRNAs, methylations, and proteins were used as input features for predictive modeling. Spares learning was applied to select features. The Spares learning and LASSO method were ranked by features based on specific importance.

Predictive Modeling and Expression Landscape
We used Spares learning and LASSO to construct the predictive model of survival groups. LASSO is a powerful ensemble learning method that has achieved state-of-the-art performance in many biomedical tasks.
where b i is the coefficient of expressions other than RNA's i, | · | is an L-1 norm, and the residual is denoted as ε i . The j th coefficient element in b i indicates a regulatory relationship from RNAs j to RNAs i (with a direction) in the linear model, where zero shows non-relationship between them. In contrast with correlationbased RNA regulatory networks, linear regression-based RNA regulatory networks can capture the main effects of multiple where λ is a hyper-parameter for sparsity regularization, and || · || 2 is an L-2 norm of a vector.

RESULTS AND DISCUSSION
In this study, data from the TCGA Cancer Genomics have been used to explore, visualize, and analyze the genetic and medical features of alterations in cell cycle control found in cases of HCC from databases of TCGA. As per our knowledge, this study is the opening data mining approach that tends to discover the existing connection among modifications occurring in control of cell cycle and patients' prognosis. A lot of conclusions in this study are coherent with the previously reported data. Remarkably, we detected in our study that alterations in the cell cycle control mostly exist in HCC. Variations in these genetic factors are on autonomous cascades to HCC and are in an uncommon fashion of increasing gene changes. Although no cell cycle control was linked with any of the survival events (disease-free and global survival) in this work, it provides us with a fresh perspective to concurrently investigate biological modifications and medical features through information exploration.

Genomic Landscape and Alterations Summary
Based on obtained outcomes, it was observed that the majority of the cases undergo alteration in the cell cycle control, and nearly all of them were missense mutations. Others incorporated deep deletions and few amplifications. However, the rest of the cases remained had modifications in the cell cycle control that comprises most of the truncating and missense mutations. The shared exclusivity analysis implies that events that occurred in cell cycle control were liable to occur again in HCC as shown in Figure 1 through principal component analysis.

Expression in Cell Cycle Control in HCC
Inspection of the expression analysis reveals that there is a significant cell cycle control in under-and overexpressed HCC, explaining that these were hotspots for the activation as illustrated in Figure 2.

Survival Analysis
For the sake of survival rate inspection, Kaplan-Meier plots were used in an order to complete survival analysis in cases of HCC with as well as without cell cycle control overexpression. For the overall survival analysis, mutations in the cell cycle control were found to be concurrent and not linked to a decreased overall survival (p = 0.0615). Likewise, none of the cell cycle control was linked with any of the survival events (Figure 3).

Liver Cancer Prognosis Markers and Expression Landscape
It was observed that lncRNAs, miRNAs, mRNAs, methylations, and proteins may exert a more profound biological impact than a single gene by virtue of its intrinsic regulatory nature. Therefore, predictive modeling is also performed for the sake of liver cancer

Predictive Modeling
The optimization of an equation can be resolved by a LASSO explanation. For reliable RNA implication, we applied the Random LASSO (Lee et al., 2014). This technique is divided into two main steps: (1) features' reputation generation for RNAs and (2) drilling LASSO with the features biased by reputation. This method uses a bootstrapping approach by drilling only a small set of variables instead of drilling the whole variables directly. The constant approximation is much dependable on every individual training through highdimensional information (Figures 4, 5).

Validation of Biomarkers
Based on literature reports, an initial major component assessment leads to separation. So, the first thing that we asked was whether the whole organization is practical by using a standardized and shared dataset. Hence, to evaluate our data for equality and applicability, a fivefold leave group cross-validation was employed by the use of LASSO and spare learning. All the datasets were examined distinctly, so each dataset was utilized as a drilling set along with group tags that were equitable to the general receptor position, via reference against liver cancer. Consequently, the receptor status of the entire untrained datasets was predicted by the use of attained Spares learning and LASSO model. Nevertheless, the accuracy of the categorization of the patient's data builds by the utilization of Spares learning and LASSO models is high. Reasonable Matthews correlation coefficients and low error rates were 0.1 for the grouping of references against liver cancer. The bootstrapping sampling was modified at the time of model formation for addressing inequity in the class in various datasets for drawing an equal number of samples out of each group.

CONCLUSION
Detection of cancer at an early curable phase and eradicating the tissues can be capable of preventing the expansion of lethal intrusive cancers, which would save countless lives. Presently, it is extensively stated that lncRNAs, mRNAs, and miRNAs could be probable biomarkers for various cancers. Identification of lncRNAs, mRNAs, and miRNAs related to disease adds to the enhancement of understanding of diagnosis and pathogenesis. Therefore, for the investigation of disease association of lncRNAs, mRNAs, and miRNAs, development of numerous potential computational models has been done. Nevertheless, only some studies centered on the identification of lncRNA, mRNA, and miRNA signs for the diagnosis of liver cancer in the early stage. Consequently, in the current study, we put forward a new classification technique based on lncRNAs, mRNAs, and miRNAs for the categorization of early and advanced phases of liver cancer. The increasing trend in the implementations of machine learning methods and the latest developments of personalized medicine enhanced the forecasting of cancer. For the sake of identifying main aspects that could influence cancer development, recurrence, and survival, different machine learning techniques and algorithms employed for feature selection are globally cast. In general, cancer prediction studies based on machine learning employed expression profiles of mRNA/miRNA, clinical factors, and histological variables as an input for the procedure of cancer prediction. Success in the development of computational models for prediction of cancer rests on comprehending the biological information and shortcomings of the drilling dataset, for instance, minor collection of high-dimensional samples known as "curse of dimensionality." Nevertheless, the over-drilling problem may be overcome through appropriate feature selection and crossvalidation approaches. Our findings provide a new vision for exploring biological functions of lncRNAs, miRNAs, mRNAs, methylations, and proteins in liver cancer, and screening novel potential biomarker (lncRNAs, miRNAs, mRNAs, methylations, and proteins) signature could be a biomarker for the prognosis of liver cancer patients. Better performance toward liver cancer was shown by logistic LASSO regression descriptor where significant improvement was seen in predicting liver cancer.

DATA AVAILABILITY STATEMENT
The raw data supporting the conclusion of this article will be made available by the authors, without undue reservation, to any qualified researcher.

AUTHOR CONTRIBUTIONS
AK, XD, and D-QW designed the experiments. XD, D-QW, and AK performed the entire computational experiments and assisted in writing the manuscript. D-QW and AK analyzed the data and wrote the manuscript. AK, D-QW, XD, and AM read the manuscript and advised on method development. All authors approved the final version of the manuscript.