ESRRG, ATP4A, and ATP4B as Diagnostic Biomarkers for Gastric Cancer: A Bioinformatic Analysis Based on Machine Learning

Based on multiple bioinformatics methods and machine learning techniques, this study was designed to explore potential hub genes of gastric cancer with a diagnostic value. The novel biomarkers were detected through multiple databases of gastric cancer–related genes. The NCBI Gene Expression Omnibus (GEO) database was used to obtain gene expression files. Three hub genes (ESRRG, ATP4A, and ATP4B) were detected through a combination of weighted gene co-expression network analysis (WGCNA), gene–gene interaction network analysis, and supervised feature selection method. GEPIA2 was used to verify the differences in the expression levels of the hub genes in normal and cancer tissues in the RNA-seq levels of Genotype-Tissue Expression (GTEx) and The Cancer Genome Atlas (TCGA) databases. The objectivity of potential hub genes was also verified by immunohistochemistry in the Human Protein Atlas (HPA) database and transcription factor–hub gene regulatory network. Machine learning (ML) methods including data pre-processing, model selection and cross-validation, and performance evaluation were examined on the hub-gene expression profiles in five Gene Expression Omnibus datasets and verified on a GEO external validation (EV) dataset. Six supervised learning models (support vector machine, random forest, k-nearest neighbors, neural network, decision tree, and eXtreme Gradient Boosting) and one semi-supervised learning model (label spreading) were established to evaluate the diagnostic value of biomarkers. Among the six supervised models, the support vector machine (SVM) algorithm was the most effective one according to calculated performance metrics, including 0.93 and 0.99 area under the curve (AUC) scores on the test and external validation datasets, respectively. Furthermore, the semi-supervised model could also successfully learn and predict sample types, achieving a 0.986 AUC score on the EV dataset, even when 10% samples in the five GEO datasets were labeled. In conclusion, three hub genes (ATP4A, ATP4B, and ESRRG) closely related to gastric cancer were mined, based on which the ML diagnostic model of gastric cancer was conducted.


Data Collection and Preprocessing
The study design is shown in Figure 1. This systematic study comprehensively downloaded six datasets from the Gene Expression Omnibus (GEO) database and focused on the gene sequencing results of GC patients with each dataset containing more than 10 samples. These datasets were produced using three different microarray platforms: Affymetrix Human Genome U133 Plus 2.0 Array, Affymetrix Human Exon 1.0 ST Array, and Affymetrix Human Genome U133A Array. Raw data of these datasets were preprocessed by R packages "oligo" (Carvalho and Irizarry, 2010) and "affy" (Gautier et al., 2004), and then, the background was corrected and normalized through the Robust Multichip Average (RMA) function. In this study, GSE66229 was used to construct a weighted gene coexpression network due to the sufficient data and detailed clinical characteristics of the gastric cancer samples. Five datasets (GSE19826, GSE27342, GSE29272, GSE54129, and GSE66229) were combined into a total dataset (TD) which contains 780 samples and 11,181 genes for feature selection and building ML models. TD includes 435 tumor samples and 345 normal ones, i.e., a mild imbalanced dataset. The combat algorithm in the "sva" R package (Johnson et al., 2007) was used to eliminate batch effects between different platforms and experiments. GSE33335 acts as an independent dataset, based on which an external validation (EV) was performed to validate the authenticity of hub genes and the reproducibility and generalizability of the ML diagnostic models. Details of all datasets can be found in Supplementary Table S1.

WGCNA
The R package "WGCNA" (Langfelder and Horvath, 2008) was constructed to detect gene modules, and the correlation of each module with sample type was evaluated. The specific steps are as follows: (a) in the GSE66229 dataset, only normal and cancer samples from the same individuals (196 samples) were selected for further analysis. Then, the 196 samples were divided into "tumor" and "normal" groups according to their clinical records, with each group containing 98 samples; (b) the samples were clustered by the "hclust" function to detect the outliers. After employing the "hclust" function to the expression matrix evaluated by the average method, 35 offending samples were removed with a height cut at 125; (c) the best scale-free topology fitting index (soft threshold) was selected as 7 to achieve a higher average network connectivity with a scale-free fitting number β 0.86 ; d) the adjacency matrix was transformed into a topological overlap matrix (TOM) to define the gene co-expression similarity; (e) Based on the dissimilarity measured by TOM, the "hclust" algorithm was employed for gene hierarchical clustering; (f) the optimal module size was set as 30, and the dynamic tree was used to cut the identification module; (g) after each module was determined based on the signature gene expression profile and the sample type of patients, the correlation of the module signature genes with sample types was also determined.

Identification of the WGCNA Hub Genes
Cytoscape (Shannon et al., 2003) was used to visualize the coexpression network in the modules of the highest correlations. All genes in the selected modules were exported to Cytoscape and analyzed with the "NetworkAnalyzer" plugin (Assenov et al., 2007), which can give a comprehensive set of topological parameters for gene-gene interaction networks. Hub genes are defined as genes with high connectivity in the gene-gene interaction network. According to connectivity, i.e., node degrees in the output of "NetworkAnalyzer", the top-ranked 10% genes in the two most significant modules "red" and "turquoise" were selected (Fuxman Bass et al., 2013), which may have important implications for the progression of gastric cancer.

Supervised Feature Gene Selection With the Fisher Score Algorithm
The feature selection technique is a process of reducing the number of variables, especially important for developing a predictive model (Ali et al., 2018). The feature selection method can evaluate the relationship between each variable and the output and select those variables with the strongest relationship. Fisher score is one of the most widely used supervised feature selection methods, which returns the ranks Frontiers in Physiology | www.frontiersin.org June 2022 | Volume 13 | Article 905523 3 of the variables based on the Fisher score in the descending order (Gu et al., 2011). The Fisher score S i of the i-th feature is calculated as follows: where μ ij and σ ij are the mean and standard deviation of the i-th feature in the j-th class, respectively. n j is the number of samples in the j-th class, and μ i is the mean of the i-th feature. In this study, to select the most relevant genes that are strongly related to the sample type, feature selection using the Fisher score algorithm was applied to the combined five datasets. Here, a gene was regarded as a feature, and TD was splitted into five folds during feature selection. A list of genes ranked by their scores returned in each fold, where we picked the top-ranked 10 feature genes with a cutoff at S i ≈ 0.5 for each list for further study. The feature genes were determined as the intersection of the features of five folds. The final biomarkers in this study were obtained by the intersection of the hub genes filtered by the gene-gene interaction network and the feature genes.

Validation of the Final Hub Genes
GEPIA2 can be used to verify the expression difference of the hub genes in tumor samples and normal ones (Tang et al., 2019). The RNA-seq datasets used in GEPIA2 were based on UCSC Xena (http://xena.ucsc.edu), which was computed by standard pipelines to analyze the RNA-sequencing expression of tumor and normal samples from the TCGA (Colaprico et al., 2016) and GTEx (Lonsdale et al., 2013) datasets. In this study, we used the TCGA and GTEx gastric cancer RNA-seq data integrated by the GEPIA2 platform for a comprehensive validation. With |Log 2 FC| cutoff = 1 and p-value cutoff = 0.01, box plots of the RNA-seq data of the gastric cancer hub genes were drawn.
The immunohistochemistry (IHC) staining data for this study were downloaded from the Human Protein Atlas (HPA) database (Thul and Lindskog, 2018), and then, the results of gastric cancer pathology and normal gastric tissue were processed.
The Cytoscape plugin "iRegulon" was used to analyze the transcription factors regulating hub genes (Janky et al., 2014;Gao et al., 2020). This plugin predicts transcription factors by using the motif enrichment analysis and using track discovery in a set of regulated genes. The cutoff criteria were as follows: enrichment score threshold = 3.0, receiver operating characteristic (ROC) threshold for area under the curve (AUC) calculation = 0.03, rank threshold = 5,000, minimum identity between orthologous genes = 0.0, and false discovery rate (FDR) = 0.001. After all transcription factors were outputted, the factor which regulates all hub genes and ranks first in the normalized enrichment score (NES) was defined as the most relevant transcription factor.

Supervised Learning
At the first step, TD was randomly split into training and test datasets, with ratios of 80 and 20%, respectively. Then, a repeated stratified k-fold cross-validation was performed on the training dataset. The stratified k-fold can ensure that each fold has the same proportion of the sample type compared to the whole one, which is more suitable to imbalance datasets. The ML model was trained using of k-1 folds and validated on the one remaining fold for k times. The training performance of the model was reported on the average over k times. At last, a final evaluation was performed on the test dataset. The aforementioned steps can be regarded as an internal validation since both training and test datasets come from TD. To examine the robustness of the ML models, an EV was further performed on the independent dataset GSE33335.
In this study, k was set to 10, and the cross-validation was repeated 100 times with different randomizations in each repetition to ensure the estimated performance. The Matthews correlation coefficient (MCC) metric (Chicco and Jurman, 2020) was chosen as the performance score for the model evaluation during the training process, which is suitable to imbalance datasets.
To further reduce the affection of the dataset imbalance, the synthetic minority oversampling technique (SMOTE) was applied to the training dataset (Chawla et al., 2002). The SMOTE can synthesize new samples based on randomly picked existing samples and their k-nearest neighbors. In this study, a grid search of k ranging from 1 to 7 was also performed.  In order to select a proper classifier for the ML diagnosis model, six widely used algorithms, namely, support vector machine (SVM) (Byvatov and Schneider, 2003), k-nearest neighbors (KNN) (Zhang, 2016), decision tree (DT) (Chen et al., 2011), random forest (RF) (Chen and Ishwaran, 2012), neural network (NN) (Lancashire et al., 2009), and eXtreme Gradient Boosting (XGB) (Chen and Guestrin, 2016), were examined through their performance metrics for classification results. Hyperparameters of all the models were finely tuned using the scikit-learn GridSearchCV method, according to the highest "MCC" scores. The best model for each algorithm was selected after exploration of the whole grid. Best hyperparameters and the corresponding training performances of all supervised ML diagnostic models can be found in Table 1. Finally, the performance of each model on test and EV datasets was evaluated by these performance metrics: accuracy (Heidaryan, 2018), specificity (Altman and Bland, 1994), sensitivity (Altman and Bland, 1994), precision (Heidaryan, 2018), F1 score (Chicco and Jurman, 2020), and MCC. Furthermore, the ROC curve and AUC are also given.

Semi-Supervised Learning
To deal with a different problem, such as handling large amounts of samples with only a few diagnosed ones, a semi-supervised learning model based on a label-spreading algorithm is also examined (Zhou et al., 2003). Semi-supervised learning can learn from small amounts of labeled samples, combined with the use of unlabeled data to better capture the underlying properties and generalize better to new samples (Chapelle et al., 2009). To some extent, semi-supervised learning can be regarded as a hybrid of supervised and unsupervised learning. In this study, TD was randomly split into a labeled dataset and an unlabeled one, with five unlabeled ratios including 50, 60, 70, 80, FIGURE 3 | Heatmap of the relationship between module eigengenes and clinical traits of GSE66229. WGCNA labeled heatmaps for GSE66229, each row represents a module characteristic gene encoded by color, and the three columns represent clinical characteristics of overall survival time (OST), overall survival status (OSS), and sample type, respectively. Each cell represents the Pearson correlation coefficient and p-value (in parentheses) of the corresponding module characteristics, and the color of each cell represents the value of correlation.
Frontiers in Physiology | www.frontiersin.org June 2022 | Volume 13 | Article 905523 6 and 90%. For each ratio, the semi-supervised model was crossvalidated by 100 times of random permutations and further evaluated on the EV dataset GSE33335. The performance metrics of prediction on the unlabeled dataset and EV dataset are given.
All supervised and semi-supervised ML models in this study were implemented by Python language programming on Intel Xeon silver 4110 CPU.

Construction of the Gene Co-Expression Network
In order to find the correlation between clinical features and genes, this study used the R "WGCNA" package to construct 20,862 genes and 161 samples in the GSE66229 dataset into a gene network. A sample clustering figure was plotted (Figure 2A). To guarantee a scale-free topology and zero mean connectivity, the threshold was determined to be 7 ( Figure 2B). The dissimilarity of the modules was set as 0.2, and a total of 14 modules were generated ( Figure 2C). Two modules (red: r = 0.73 and P = 5e-28; turquoise: r = -0.84 and P = 6e-44) with the positive and negative highest correlations were acquired as the significant modules for subsequent analyses (Figure 3).

Feature Gene Selection
In order to select critical genes to the diagnostic model, feature selection was performed for the combined five datasets using the Fisher Score method on five folds. In each fold, a cutoff around the Fisher Score S i ≈ 0.5 s was applied, and as a result, 10 genes with the highest scores were selected. All selected genes as well as their Fisher Scores are listed in Table 1. At last, the intersection of all picked genes in the five folds is investigated, resulting in six intersection elements: TIMP1, ATP4A, ESRRG, CBLIF, ATP4B, and INHB.

Identification and Validation of Hub Genes
In the results of WGCNA, two significant models, the red and turquoise ones, were exported to Cytoscape. Two gene-gene interaction networks were constructed and analyzed in Cytoscape. Then, the top 10% target genes of each network were selected, according to the connectivity degree. As a    result, 30 and 330 genes in the "red" and "turquoise" modules were selected. The gene-gene network of 30 genes in the "red" module is shown in (Figure 4), while genes in the "turquoise" module are listed in Supplementary material S1. Together with the six feature genes selected through the Fisher Score method, three hub genes (ESRRG, ATP4A, and ATP4B) were finally selected in the red module, while none was selected in the turquoise module.
The expression of three genes in cancer and normal samples was validated in GEPIA2. The box plot of GEPIA2 presented the expression levels of the three genes in the standard of expression-log2 (TPM+1) ( Figure 5A). We observed that the expressions of ESSRG, ATP4A, and ATP4B in tumor samples were significantly lower than those in normal ones. This study also performed an IHC analysis in the gastric data of hub genes from the HPA database. The results of IHC staining are shown in Figure 5B, which were consistent with GEPIA2.
Among all transcription factors regulating the three hub genes, FOXA1 with the highest NES (NES = 5.142) was considered as the most important transcription factors ( Figure 5C). Existing studies have shown that the expression of FOXA1 affects the proliferation and invasion of gastric cancer cells (Lin et al., 2018;Dai Y. et al., 2021). The result verified the objectivity of the three hub genes in gastric cancer.

Establishment and Validation of the Machine Learning Model
After going through the hyperparameter grid and the SMOTE grid, the best model was selected according to the MCC metric. The corresponding hyperparameters, k-nearest neighbors in the SMOTE, and values of MCC on the training dataset are listed in Table 2. One might see that the SVM model has the best performance with an average MCC score at 0.666 ± 0.093.With the fixed hyperparameters, the performance of all six ML diagnostic models on the test dataset is shown in Figure 6. The trained SVM had the highest accuracy with 89.1%, while the RF showed the lowest but a close accuracy with 85.3% ( Figure 6A), which demonstrated the robustness of both hub genes and ML methods.
Based on the results, the weakest performance of the sensitivity metric was the KNN algorithm with a ratio of 81.7% ( Figure 6A). As a contrast, the SVM algorithm again had the highest sensitivity of 93.9%, showing the great ability for predicting tumor samples ( Figure 6A). For specificity, the NN algorithm had the best performance with a 90.5% specificity to predict normal samples. The RF algorithm had the lowest specificity of 82.4%. The SVM algorithm had the second lowest specificity of 83.8% ( Figure 6A). These results demonstrated the six models have both advantages and disadvantages.
MCC and F1 scores could serve as more reliable metrics which involve all four terms: true positive (TP), true negative (TN), false positive (FP), and false negative (FN) in the confusion matrix. According to the ratios of the MCC and F1 scores, the SVM should be the best model with 78.4 and 89%, respectively ( Figure 6A).
ROC curves for all six ML classification diagnostic assistants were built using the predicted probability of belonging to different classes. Except for the AUC of DT having the lowest value of 85%, all the other models had an AUC of 93-95% (Figure 7).
The prediction performance of six ML diagnostic assistants was further evaluated on the EV dataset (25 tumor samples and 25 normal ones). The results showed the six models can classify all the normal samples correctly with specificity and precision both equaling to 1 ( Figure 8B); however, the prediction on tumor samples varies. SVM and NN have the best performances on successively predicting 23 tumor samples ( Figure 6D). As a result, SVM and NN share the highest MCC and F1 scores of 92.3 and 96%, respectively ( Figure 6B). The AUC of SVM and NN on EV is 99% (Figures 7A, D). Therefore, one can conclude that the SVM model based on the expression profiles of three hub genes may have a potential diagnostic value for gastric cancer.

Semi-Supervised Diagnostic Model
Semi-supervised ML can learn from a combination of small amounts of labeled samples and large amounts of unlabeled ones, which is especially suitable for the scenario of annotating large amounts of samples with expensive costs or miscellaneous steps. In this study, the label spreading (LS) algorithm was tested on 50% (LS50), 60% (LS60), 70% (LS70), 80% (LS80), and 90% (LS90) randomly unlabeled samples in TD. Each learning model was cross-validated 100 times with random permutation. The results shown in Figure 8 demonstrate that the LS algorithm can successfully learn and predict the sample type even when small amounts of labeled data are available. The mean MCC and F1 scores are 0.649 ± 0.029 and 0.824 ± 0.015, respectively, with 50% unlabeled samples. As the ratio of unlabeled samples increases, the performance of the LS slightly decreases. However, with 90% unlabeled data, the LS90 model still has mean MCC and F1 scores of 0.635 ± 0.020 and 0.816 ± 0.012, respectively. Furthermore, all LS models achieved a good prediction performance for the EV dataset, for example, the LS90 model has mean MCC and F1 scores of 0.845 ± 0.066 and 0.919 ± 0.037, respectively.

DISCUSSION AND CONCLUSION
Gastric cancer is still a major disease threatening human health, so it is particularly important to find a comprehensive and effective set of biomarkers with diagnostic values. This study systematically used a series of bioinformatics methods to select key features, i.e., hub genes, which were further confirmed by both, the GEPIA2 tool and IHC experiments. The transcription factor-hub gene regulatory network confirmed that three genes are closely associated with gastric cancer in the level of transcription factors. Based on these features, ML diagnostic assistants for the diagnosis of gastric cancer were established by both supervised and semi-supervised learning. The performance of the ML models on the EV dataset further approves the potential diagnostic ability.
In this study, five GEO datasets were downloaded for construction, and one independent GEO dataset GSE33335 was used for external validation. Comprehensive data collation can make the construction of diagnostic assistants more objective (Ahluwalia et al., 2021;Dai W. et al., 2021;Ye et al., 2021). GSE66229 was used for the WGCNA analysis. WGCNA is a widely used target therapy analysis tool, which clusters related genes, according to some clinical characteristics of research subjects. There have been many studies on gastric cancer tumor markers in recent years, and most of the WGCNA clusterings are based on differentially expressed genes (DEGs) in the research dataset Xiang et al., 2021;Zhang et al., 2022). In contrast, this study performed the WGCNA analysis on all gastric cancer-related genes in one dataset and fused them with selected features using a supervised learning method, i.e., Fisher score algorithm on five combined datasets, preserving the diversity of the gastric cancer hub genes. This also reduces the hub gene bias caused by clustering with a certain clinical feature traditionally (Yang et al., 2022). We put WGCNAsignificant modules into Cytoscape to construct a gene-gene interaction network. Previous research shows that gene-gene interaction networks can reveal the principle and mechanisms of cancer (Zeng et al., 2013;Rana et al., 2020). In order to enhance the objectivity and authenticity of the hub genes, genes that are highly associated with gastric cancer screened by the gene-gene interaction network were intersected with the selected feature genes.
Three hub genes were crucial to the next machine learningbased bioinformatics approach. Although no studies used them as combined biomarkers for gastric cancer diagnosis, some studies have screened these genes in the identification of gastric cancer biomarkers and explored them to a certain extent in the field of human and animal experiments on gastric cancer (Lozano-Pope et al., 2017;Peng et al., 2020;Liu et al., 2021). ESRRG belongs to the estrogen-related receptor family. In one aspect, it has been classified that ESRRG inhibits the occurrence of gastric cancer by inhibiting the Wnt pathway by activating DY131 (Kang et al., 2018). In another, ESRRG can directly bind to the TFF1 promoter, which is a recognized tumor suppressor and inhibits Helicobacter pylori infection (Kang et al., 2021). Helicobacter pylori infection is a common cause of chronic atrophic gastritis, which is a precancerous lesion (Rolig et al., 2012). ATP4A and ATP4B belong to a family of P-type cationtransporting ATPases. These two genes belong to the gastric proton pump and are antigens of gastric parietal cells, which are diagnostic markers for immune gastric lesions including atrophic gastritis. Through in vivo and in vitro experiments in animals and humans, researchers have found that ATP4A and ATP4B were partially or fully methylated in gastric cancer cells. It was also verified that the reactivation and demethylation of ATP4A and ATP4B can effectively inhibit the progression of gastric cancer (Lin et al., 2017;Cao et al., 2020). Hence, ATP4A and ATP4B are important tumor suppressor genes.
Six supervised diagnostic models and one semi-supervised diagnostic model were developed based on different algorithms including SVM, RF, KNN, DT, NN, and XGB (supervised), and LS (semi-supervised). The performance was evaluated by seven metrics, namely, accuracy, specificity, sensitivity, precision, MCC, F1 score, and AUC. All the models were trained through crossvalidation and further examined on the EV dataset GSE33335. The results suggested that SVM and LS can serve as the most appropriate algorithm for prediction. For example, LS90 can learn from only 10% of labeled data and achieve 0.906 ± 0.008 and 0.986 ± 0.007 AUC scores for 90% unlabeled data and the EV dataset. Therefore, this study demonstrates the potential ability of the ML diagnostic model created with three hub gene expression profiles of 780 samples.
In recent years, bioinformatics analyses based on machine learning have been popularly used in individual medicine. For example, multi-classifiers and deep neural networks are being applied in cancer research (Huang et al., 2018;Zhang et al., 2021). Comparing to previous studies, our research may be more robust in model development and evaluation. First, we included five datasets with 780 samples in the model development and internal validation. Second, we also used an independent dataset only for external validation. Huang et al. (2018) applied multi-classifiers to select gastric cancer-related miRNAs in one dataset and validate their performance in another two datasets. Huang et al. and our team both explored the application of SVM in the diagnosis of gastric cancer. Their SVM diagnostic model's AUC was 95% in the training dataset, which is slightly higher than our corresponding AUC (93%). However, their model achieved a biased performance on the two valid datasets: one was 97%, while the other was less than 80%. Relatively fewer samples in model development may be responsible for this performance. Moreover, their two validation datasets were also involved in biomarker selection; thus, they might not be totally independent. Compared with the WGCNA and network control analyses used in our study to screen potential cancer-related genes, Zhang et al. (2021) fused gene expression data and DNA methylation data to obtain relatively more biomarkers for training their deep neural networks. On one hand, their study got an extremely high performance in six metrics. The accuracy, precision, recall, F1 score, and AUC value were all around 99%. On the other hand, the absence of an external validation report makes the generalization ability of their study remain unclear.
More several strengths of this study should be emphasized. First of all, data sources in this study come from Asia. Consistency in data sources may strengthen the pertinence of the model. Second, rich data in six datasets are sorted and then integrated into a comprehensive one to build an objective and effective diagnostic model. Third, hub genes selected from three robust methods were used in combination (WGCNA, gene-gene interaction network, and feature gene selection). Fourth, the selected hub genes are multiplevalidated by GEPIA2, HPA databases, and transcription factor-hub gene regulatory network, the results of which further confirm the importance of the selected biomarkers. Finally, the diagnostic model is improved with the SMOTE and passes advanced machine learning analysis on an EV dataset and presented more convincing statistical results than previous studies. This study still has some flaws. First, this study deserves to be verified by subsequent independent experiments. Second, although comprehensive bioinformatics analyses were conducted in this study, an in-depth mechanistic study of three hub genes had not been advanced.
Finally, this study systematically established a gastric cancer diagnostic assistant based on multi-database bioinformatics and machine learning analysis. Our results have a moderate effect on auxiliary diagnosis. We expect future research to test the stability of the model.

DATA AVAILABILITY STATEMENT
The datasets presented in this study can be found in online repositories. The names of the repository/repositories and accession number(s) can be found in the article/ Supplementary Material.