Identification of Early Recurrence Factors in Childhood and Adolescent B-Cell Acute Lymphoblastic Leukemia Based on Integrated Bioinformatics Analysis

Over the past 50 years, great progress has been made in the diagnosis and treatment of acute lymphoblastic leukemia (ALL), especially in pediatric patients. However, early recurrence is still an important threat to the survival of patients. In this study, we used integrated bioinformatics analysis to look for biomarkers of early recurrence of B-cell ALL (B-ALL) in childhood and adolescent patients. Firstly, we obtained gene expression profiles from the Therapeutically Applicable Research to Generate Effective Treatments (TARGET) database and the Gene Expression Omnibus (GEO) database. Then, we identified differentially expressed genes (DEGs) based on whether the disease relapsed early. LASSO and Cox regression analysis were applied to identify a subset of four genes: HOXA7, S100A11, S100A10, and IFI44L. A genetic risk score model was constructed based on these four optimal prognostic genes. Time-dependent receiver operating characteristic (ROC) curves were used to evaluate the predictive value of this prognostic model (3-, 5-, and 10-year AUC values >0.7). The risk model was significantly associated with overall survival (OS) and event-free survival in B-ALL (all p < 0.0001). In addition, a high risk score was an independent poor prognostic risk factor for OS (p < 0.001; HR = 3.396; 95% CI: 2.387–4.832). Finally, the genetic risk model was successfully tested in B-ALL using an external validation set. The results suggested that this model could be a novel predictive tool for early recurrence and prognosis of B-ALL.


INTRODUCTION
Acute lymphoblastic leukemia is the most common cancer in children, and includes both B-cell and T-cell lineages. B-ALL accounts for about 85% of pediatric ALL (1). With advances in chemotherapy and hematopoietic stem cell transplantation, the cure rate in childhood ALL is currently around 90% (2). The prognosis for adolescent patients aged 15 to 20 years receiving pediatric protocols is similar to that of children, with a 5-year OS of 87.9% (3). In spite of significant progress in long-term survival, 15-20% of patients will suffer a recurrence, which is an important factor for increases in mortality (4). Relapses occurring less than 18 months after diagnosis are defined as very early relapses, relapses appearing between 18 months of initial diagnosis and 6 months after cessation of frontline therapy is regarded as an early relapse, and relapses developing after 6 months of cessation of frontline treatment are classified as late relapses (5). The time from diagnosis to relapse is an independent risk factor for overall survival (6), and the risk ratios for adverse outcome in patients with late, early, and very early relapse were 1, 2.4, and 2.9, respectively (p < 0.001) (5). Patients with late recurrence had better survival rates than those who relapsed earlier (OS: 45-73% vs. 22-38%) (7)(8)(9). Thus, identifying patients with a high risk of earlier relapse is crucial to improving prognosis.
Genomic abnormalities have been confirmed to be related to treatment response and disease relapse in acute leukemia (10)(11)(12). An understanding of genetic abnormalities can guide the selection of subsequent regimens after the induction of remission. Technological progress has led to more detailed genetic profiling of leukemia, including DNA sequence abnormalities, gene expression abnormalities, chromosomal rearrangements, and abnormal epigenetic modifications. It has been suggested that the development of gene expression profiles for acute leukemia can improve the prediction of prognosis (13). For example, Ng et al., established a 17-gene stemness score for rapid determination of risk in acute myeloid leukemia (AML) (14). Overexpression of Wilms tumor 1 gene indicates an adverse prognosis and poor response to treatment in AML (15). In Ismail et al.'s study, the expression level of the gene BIRC6 was an adverse risk factor in acute childhood leukemia (16). Therefore, we aimed to investigate the relationship between genetic profiling and early relapse of childhood B-ALL.
In this study, we identified differentially expressed genes (DEGs) between individuals with early recurrence and no early recurrence of B-cell ALL, using data from the GEO and TARGET public databases. We constructed a prognostic risk model based on DEGs, using regression analysis. We verified the prognostic value of this risk model using an external validation set from the TARGET database. Using the potential biomarkers of early relapse in B-cell ALL, we provided a new approach to early diagnosis and treatment of patients at high risk of recurrence.

Database
We downloaded one dataset (GSE13576) from the GEO database 1 . The data from GSE13576 were based on the GPL570 platforms (Affymetrix Human Genome U133 Plus 2.0 Array) and included data on 197 pediatric ALL patients with mRNA expression information. The mRNA expression dataset of 486 B-cell ALL patients and their corresponding clinical information were obtained from the TARGET database 2 . This dataset 1 https://www.ncbi.nlm.nih.gov/geo/ 2 https://ocg.cancer.gov/programs/target/ included 370 microarray data points and 116 mRNA-seq data points. The data of 370 cases were based on the GPL570 platform. miRNA-seq was performed in 53 of 486 patients from the TARGET dataset.

Analysis of Differentially Expressed Genes
The definition of early recurrence in this study was less than 24 months from diagnosis. Data from 370 childhood and adolescent B-cell ALL patients from the TARGET database, and 197 pediatric patients from GSE13576 were classified into early relapse and non-early relapse groups. Of the 370 patients from the TARGET database, 127 had early recurrence and 243 did not have early recurrence. In the GSE13576 dataset, 24 patients were in the early relapse group, and 173 patients belonged to the no early recurrence group. We used the limma R package to normalize the data and performed gene differential analysis by comparing the two groups in the R computing environment (Version 3.6.2). The differences in the gene expression levels were adjusted for multiple testing using the Benjamini-Hochberg method. Genes with a p-value < 0.05 and | Log2(fold change) | > 0.5 were defined as DEGs in this study. Finally, a volcano plot and heatmap of the DEGs was drawn using the ggrepel and heatmap R packages, respectively. The Venn diagrams of the DEGs were drawn using the website tool Venny 2.1.0 3 .

Selection of Candidate Genes
The 370 samples from the TARGET database were randomly separated into training and test (6:4) sets for constructing and validating the prognostic models, using the base R package. Univariate Cox regression analyses were used to explore the correlations between the overall survival of B-cell ALL patients and the expression level of DEGs in training set (n = 222). The genes with a p-value < 0.01 were selected as candidate prognosisrelated genes. To further investigate the candidate genes, least absolute shrinkage and selection operator (LASSO) regression analysis was applied using the glmnet R package. This algorithm minimizes the usual sum of squared errors with tenfold crossvalidation, and identifies the optimal lambda value. The genes with coefficient are related to prognosis. The optimal prognostic genes derived from the LASSO regression model were validated using the Oncomine website 4 . We selected genes associated with early recurrence as hub genes with which to construct the prognostic model. Unpaired t-tests were used to compare the levels of gene expression between groups, using GraphPad Prism (Version 7.03).

Genetic Risk Score Model Construction
The gene-related prognosis model in this study was established using the training set. We applied a multivariate COX regression model to assess the role of hub genes as independent prognostic risk factors. Then, a four-gene-based genetic risk score model was constructed. The risk score formula is as follows: The β in the above formula is the regression coefficient of each gene, and the Exp is the expression level of the gene. The cutoff value of the risk score was defined as the median score. The 222 patients were separated into low-and high-risk groups based on the threshold. To evaluate the predictive power of this risk score model, we used the survival ROC R package to plot timedependent receiver operating characteristic (ROC) curves, and calculated the area under the ROC curves (AUC).

Validation of the Genetic Risk Score Model
For internal and external validation, the testing set (n = 148), total set (n = 370) and external validation set (n = 116) were used to assess the predictive power of the genetic     (upregulated DEGs)   TOMM22, COTL1, S100A11, TSPO, HOXA7, CTSC, BIK, TUBB2A, RAPGEF5, SUCLG2, SNX10, TOP1MT, CRIP1, S100A10, ANXA2P2, CXXC5, S100A4, CAMK2D,  IGFBP7, S100A6, DAD1, CPVL, PKIG, TUBB6, HK2, WIPI1, MEIS1, LGALS1, TST, CHN2, MAP7, PPP1R14A, CLEC11A   Gene Symbol (downregulated DEGs)   ZNF91, GFOD1, SFMBT2, ZNF704, SIPA1L2, GBP4, CD69, HERC5, IFI44, TCF4, SHANK3, DAPK1, IFI44L, MYO5C, MRC1, ST3GAL6, MAN1A1, NAV1, FHIT,  CYP46A1, STK32B, COL5A1, MX1, PDE4B, S100Z, RASD1, POU4F1, ITGA6, MDK, P2RY14, DDIT4L, DNTT  prognostic model for B-ALL. The risk score of each patient was calculated using the regression coefficients of four genes in the training set. The patients were divided into high-and low-risk groups according to the median risk score of the training set. Time-dependent ROC curves were plotted for the validation sets. Survival analysis of the patients was constructed using the R software, with the survival and survminer packages. We created Kaplan-Meier plots to illuminate the correlations between the genetic risk scores and the survival index of patients, including overall survival and event-free survival (17). Univariate and multivariate COX regression analyses were used to test whether the genetic risk score model is an independent prognostic risk factor relative to the clinical characteristics of the total set. Statistical significance was tested using log-rank tests. The prognostic risk factors were analyzed using Cox Regression with the survival R packages. A p-value < 0.05 was considered to be statistically significantly different.

Analysis of Differentially Expressed MicroRNAs
For studying the regulation of gene expression, we obtained miRNA-seq data for 53 B-ALL patients from the TARGET dataset. Eight patients relapsed within 24 months, and were included in the early recurrence group, and the remaining 45 patients were included in the no early recurrence group. The edgeR R package was used to identify differentially expressed miRNAs (DEMs) between the two groups. Genes with a adjusted p-value < 0.05 and | Log2(fold change) | > 0.5 were defined as DEMs. Volcano plots were performed using R packages.

mRNA-miRNA Network Construction
The miRNAs which were target DEGs were predicted by the miRWalk 5 and DIANA-Tarbase v.8 databases (18). We selected the miRNAs that existed in both prediction databases to construct an mRNA-miRNA network using Cytoscape software (Version 3.7.2).

Characteristics of B-ALL Patients in the Total Set From the TARGET Database
To determine the association of genetic profiles with early recurrence in B-ALL, we obtained microarray profiles from

Identification of DEGs Based on Disease Recurrence
The flowchart of the construction of our risk model is shown in Figure 1. To identify the DEGs, we analyzed the gene expression data of 370 precursor B-cell ALL patients obtained from TARGET, and 197 pediatric ALL patients from GSE13576. Using a threshold of p < 0.05 and | fold change| > 0.5, we identified a total of 1359 DEGs in the TARGET dataset, including 747 upregulated genes and 612 downregulate genes. There were 457 DEGs in GSE13576, including 215 upregulated genes and 242 downregulate genes. The DEGs of the early relapse vs. non-early relapse groups in these two datasets are shown in the volcano plot and heatmap, separately (Figure 2). Through integrated analysis, we identified 33 commonly upregulated genes and 32 downregulated genes from the TARGET and GEO datasets (Figure 3 and Table 2).

Construction of the Prognostic Risk Score Model Using the Training Set
The 65 common DEGs (including 33 upregulated and 32 downregulated genes) were analyzed for prognostic value using univariate Cox regression on the training set. As shown in Table 3, 27 common DEGs had a p-value < 0.01, including 17 with a Hazard Ratio (HR) > 1 and 10 with HR < 1.
Using the LASSO algorithm, the optimal lambda value was 0.038, and 15 of the 27 DEGs were selected as potential prognosis-related genes ( Figure 4A and Table 4). Subsequently, we searched for the relationship between 15 common DEGs and early relapse using the Oncomine database. As shown in Figures 4B-F, the gene expression of BIK, HOXA7, S100A10 and S100A11 were significantly higher in B-ALL patients who relapsed within 24 months (both p < 0.05). The gene IFI44L had lower expression levels in early relapse patients (p < 0.05). These five genes associated with early recurrence were selected to build a genetic risk score model for B-ALL patients, using multivariate Cox regression. The risk model was constructed from the four optimal genes: HOXA7, S100A10, S100A11, and IFI44L. The genetic risk score model is as follows: Risk score = (0.329 * expression value of HOXA7) + (0.136 * expression value of S100A10) + (0.275 * expression value of S100A11) + (−0.151 * expression value of IFI44L) The AUC values for OS at 3 years, 5 years and 10 years were 0.820, 0.825 and 0.787 in the training set, respectively ( Figure 5A). The risk score ranged from 1.535 to 6.539, and we selected the median risk score as the cut-off value (4.045). If the risk score was greater than the cut-off value, the patient was defined as belonging to the highrisk group, otherwise the patient was classified in the lowrisk group.

Verification of the Prognostic Risk Score Model in the Validation Set
We used the test set (n = 148), the total set (n = 370) and the external validation set (n = 116) to assess the predictive value of the risk score model constructed using the training set. We used the median value of the risk score formula defined using the training set as the threshold for distinguishing high-and low risk groups. The AUC curve showed that the 3-year, 5-year and 10-year OS of the test set, using the total set and the external validation set were 0.785, 0.764, 0.720, 0.806, 0.801, 0.761, 0.809, 0.761, and 0.739, respectively (Figures 5B-D). The distribution of risk score, survival status, and the heatmap of risk genes in internal total set and external validation set are shown in Figure 6. Among these four prognosis-related genes, HOXA7, S100A10, and S100A11 were overexpressed in high risk group and IFI44L was downregulated. The prognosis of patients was worse with the increase in risk scores. In the total set, the mortality of patients in the high-risk group (64.3%) was higher than that in low-risk group ( (Figure 7).

Identification of DEMs Based on Disease Recurrence
Amongst the 53 B-ALL patients from the TARGET database, 17 differential microRNA precursors (pre-miRNAs) were identified based on early disease recurrence (Figure 9), including eight upregulated miRNAs and nine downregulated miRNAs ( Table 5).

DISCUSSION
Recurrence is an important factor for the increase of mortality in ALL, and the early recurrence of leukemia predicts a worse prognosis. Therefore, the identification of factors associated with early recurrence is vital for the identification of new risk subgroups in ALL patients, and also offers new directions for targeted therapies in high-risk patients. MLL rearranged, hypodiploid, BCR/ABL fusion gene and high end-induction MRD are the common risk factors for relapse in childhood ALL FIGURE 10 | The mRNA-miRNA network. Pink color represents target genes and blue color represents miRNA. Yellow color shows hsa-miR-124-3p, hsa-miR-103a-3p, and hsa-miR-486-3p.
patients (19). And several published articles showed that the biomarkers for early recurrence in B-ALL include the persistence of MRD, chromosome 19p13 translocations, upregulation of nucleotide excision repair genes, deletion of CDKN2A/B and overexpression of LINC00152 (Supplementary Table S1) (20)(21)(22)(23)(24). In this study, we looked for genes involved in early relapse of B-ALL on a larger sample. DEGs were identified from early recurrence vs. non-early recurrence groups both in the total set from the TARGET database and in the GSE13576 dataset. The common DEGs were subjected to univariate COX regression and LASSO regression in the training set. The optimal candidate DEGs were validated using the Oncomine website. We constructed a genetic risk score model of B-ALL using stepwise multivariate COX regression analysis, and verified its prognosis predictive value.
In this study, a four-gene risk score model was identified from 65 common DEGs. The AUCs of the risk model in the training and verification sets were all greater than 0.7, indicating that the risk score had a good prognosis predictive value in >70% of patients. The survival indexes OS and EFS in the highrisk group were confirmed to be worse than those in low-risk group with statistically significant differences. To further verify the relationship between the model and prognosis, we included clinical factors for univariate and multivariate COX analyses. The results showed that high genetic risk score and MRD ≥ 0.01% at day 29 were independent adverse prognostic factors.
The HOXA7 gene is a member of the homeobox (HOX) gene family, which plays vital roles in hematopoiesis and cell differentiation (25). Previous research has shown that the gene HOXA7 is overexpressed in MLL-fusion leukemias, which commonly have very poor outcomes (26)(27)(28)(29). The proteins encoded by S100A10 and S100A11 are members of the S100 family. S100 proteins are involved in many cellular processes, such as cell growth and motility, cell cycle progression and differentiation (30). Dysregulated expression of S100 proteins is a common feature of human cancers, including pediatric ALL (31,32). High S100A10 expression has a link to poor outcomes and chemoresistance in several types of cancer, including leukemia (33). The overexpression of S100A11 is also associated with cancer progression and poor survival. The IFI44L gene is a type I interferon-stimulated gene. Although the function of IFI44L is unclear, some studies have revealed that it is involved in the inflammatory response, and may be a tumor suppressor (34)(35)(36). According to data from a study by Bhojwani et al., the expression of the genes HOXA7, S100A10, and S100A11 was significantly higher in B-ALL patients who relapsed within 24 months than in those who did not (4). In contrast, IFI44L was highly expressed in patients with no recurrence or late recurrence. This finding is consistent with those of our research. These results suggest the possibility of the use of the four genes as early prognostic biomarkers for childhood and adolescent B-ALL.

CONCLUSION
Our study identified four genes related to early recurrence of B-ALL in childhood and adolescent patients, using integrated bioinformatics analysis. The four-gene risk score model is an independent prognostic risk indicator. Detection of the expression levels of these four genes provides a new signature for the early identification of high-risk patients.

DATA AVAILABILITY STATEMENT
The datasets presented in this study can be found in online repositories. The names of the repository/repositories and accession number(s) can be found in the article/ Supplementary Material.

ETHICS STATEMENT
The studies involving human participants were reviewed and approved by all data are from public database on the internet. Written informed consent from the participants' legal guardian/next of kin was not required to participate in this study in accordance with the national legislation and the institutional requirements.

AUTHOR CONTRIBUTIONS
YH was mainly responsible for the idea design and writing of the article. JL was in charge of the writing of the manuscript. YC was in charge of the data collection. PJ was responsible for the chart making. LW provided the statistical assistance. JH contributed to the idea design and article modification for this article. All authors contributed to the article and approved the submitted version.