Construction of a solid Cox model for AML patients based on multiomics bioinformatic analysis

Acute myeloid leukemia (AML) is a highly heterogeneous hematological malignancy. The bone marrow (BM) microenvironment in AML plays an important role in leukemogenesis, drug resistance and leukemia relapse. In this study, we aimed to identify reliable immune-related biomarkers for AML prognosis by multiomics analysis. We obtained expression profiles from The Cancer Genome Atlas (TCGA) database and constructed a LASSO-Cox regression model to predict the prognosis of AML using multiomics bioinformatic analysis data. This was followed by independent validation of the model in the GSE106291 (n=251) data set and mutated genes in clinical samples for predicting overall survival (OS). Molecular docking was performed to predict the most optimal ligands to the genes in prognostic model. The single-cell RNA sequence dataset GSE116256 was used to clarify the expression of the hub genes in different immune cell types. According to their significant differences in immune gene signatures and survival trends, we concluded that the immune infiltration-lacking subtype (IL type) is associated with better prognosis than the immune infiltration-rich subtype (IR type). Using the LASSO model, we built a classifier based on 5 hub genes to predict the prognosis of AML (risk score = -0.086×ADAMTS3 + 0.180×CD52 + 0.472×CLCN5 - 0.356×HAL + 0.368×ICAM3). In summary, we constructed a prognostic model of AML using integrated multiomics bioinformatic analysis that could serve as a therapeutic classifier.


Introduction
Acute myeloid leukemia (AML) is a highly heterogeneous group of hematological malignancies that are characterized by various cytogenetic and molecular heterogeneities (1,2). Although substantial progress has been achieved with combinatorial therapies including radiation, chemotherapy, immunotherapy and/or targeted therapy, the cure rate of patients remains only 35%-40% in younger patients (age< 60 years) and 5%-15% in older patients (age< 60 years) (3). Relapse and refractory disease continue to be major obstacles in the treatment of AML, with 29% or fewer patients living beyond 5 years.
Several studies have shown that changes in the bone marrow (BM) microenvironment in AML largely promote distinct biological processes in leukemogenesis, drug resistance and leukemia relapse (4). Thus, insights into BM microenvironment action may provide better diagnosis and treatment strategies for AML patients.
The leukemia immune microenvironment presents with immune dysregulation and suppression, leading to an imbalance of suppressor T cells and effector T cells, T cell exhaustion and an increase in myeloid-derived suppressor cells (MDSCs) and leukemia-supporting macrophages compared to normal bone marrow tissue (14). Recent studies on the characterization of the leukemia immune microenvironment could aid in the search for novel prognostic biomarkers and potential therapeutic targets (15,16). In addition, treatments such as chemotherapy, immunotherapy, and combination therapy to alter the immune microenvironment of AML have been widely used (17,18). However, different immune cells have different effects in AML. Therefore, understanding the distribution and function of immune-related genes in different immune cells is of great significance to further explore the BM immune microenvironment of AML patients.
Here, we investigated the impact and potential mechanisms of immune-related genes on the prognosis of AML. We are the first group to screen for hub genes in this disease using multiomics analysis. We constructed a LASSO-Cox regression model to predict the prognosis of AML according to the characterization of the leukemia immune microenvironment. The distribution of hub genes in immune cells was revealed through single-cell sequencing data and may provide the potential for precise patient stratification and treatment.

Datasets
The test cohort of AML was downloaded from The Cancer Genome Atlas (TCGA) database (https://www.cancer.gov/) and includes mRNA data from 151 cases (RNASeq V2), miRNA data from188 cases (Illumina HiSeq miRNAseq) and Illumina Human Methylation450 Bead Array data from 140 cases. Samples were selected for the study according to the following criteria: 1) acute myelocytic leukemia was pathologically diagnosed, 2) all three kinds of data were available for the patient, and 3) the clinical information was complete. Finally, 97 patients were selected for our following study.
The single-cell RNA sequence dataset GSE116256, including 16 untreated samples (D0), was used to reveal the distribution of hub genes in immune cell types. The immune gene set, including 776 genes, was acquired from a previous study (19).

Screening of candidate genes and hierarchical clustering
Differential mRNA and miRNA expression were analyzed by the DESeq2 (20) function (P<0.05, |logFC|>1). For each probe in the methylation data, the value shown is the b value (b =U/ (M+U+1)), where M is the methylated probe signal strength and U is the unmethylated probe signal value. The methylmix package (21) was used to analyze the correlation between the gene methylation level and mRNA expression value (Pearson correlation coefficient test, R>0.5, P<0.05). Unsupervised hierarchical clustering was performed (Euclidean distances and ward.D2 method) based on survival-related immune genes (SIGs) to establish an immunogenomic classification of TCGA-AML patients.

Immune infiltration analysis
The enrichment scores of 28 immune signatures in each AML sample were quantified using single-sample gene set enrichment analysis (ssGSEA) (22). Stromal, immune, and estimate scores were further calculated to evaluate tumor purity and immune cell infiltration in tumor tissues based on the mRNA expression data using the ESTIMATE (Estimation of STromal and Immune cells in MAlignant Tumor tissues using Expression data) algorithm (23). The ESTIMATE algorithm is based on the immune signature genes and stromal signature genes in the mRNA data, and then calculates the immune score and stromal score by ssGSEA.
Protein-protein interaction network construction and gene ontology functional enrichment analysis mRNA interaction data were obtained from the STRING database (https://string-db.org/) (24). The threshold of interaction score is 0.4. The PPI network was established using Cytoscape software (25). The Database for Annotation, Visualization and Integrated Discovery (DAVID, https://david. ncifcrf.gov/) (26, 27) was used for GO enrichment analysis (P < 0.05), and the results were plotted by the GO plot package (28).

Survival analysis and prognostic model construction
A Cox regression model was constructed to identify immune-related genes that significantly correlated with the OS (SIGs) of AML patients in TCGA. The genes that met the standard of P < 0.05 were used for subsequent study. Least absolute shrinkage selection operator (LASSO)-penalized Cox regression analysis (29,30) was applied to distinguish the most important SIGs and to construct a prognostic model for AML based on a linear combination of the regression coefficients. Kaplan-Meier survival curves and receiver operating characteristic (ROC) curves were used to test and validate the performance of the classifier.

scRNA dataset analysis
We employed the Seurat (31,32) and SingleR (33) packages to generate Uniform Manifold Approximation and Projection (UMAP) plots and reveal the distribution of hub genes in each immune cell type.

Patients and samples
A total of 55 patients with newly diagnosed AML were enrolled between Jan 2016 to Dec 2020 at the Xinqiao Hospital of the Army Medical University in China. Samples were selected for the study according to the following criteria: 1) acute myelocytic leukemia diagnoses were based on morphological findings, karyotype, and immunophenotypic features of leukemia cells by consultant hematologist, 2) patients were newly diagnosed, and remained untreated at the time of collection, 3) the clinical information was completed.

Molecular docking
The virtual screening of molecular docking was performed using AutoDock Vina 1.1.2 (34) to predict the most likely optimal ligands. The three-dimensional structure of hub genes was retrieved from the Protein Data Bank (https://www.rcsb. org/). A library of 2115 FDA approved compounds were extracted from ZINC15 druglike database (http://zinc.docking. org/). The visualization of active interactions between proteins and compounds was performed by Discovery Studio Visualizer v4.5.0 (BIOVIA).

Classification of AML based on immunerelated genes that significantly affect patient prognosis
For a more extensive study of immune genes in AML, we retrieved transcriptome, microRNA, and DNA methylation profile data and integrated clinical information for 97 samples from TCGA database (Table S1). And the flowchart of analysis is shown in Figure S1. A Cox proportional hazard regression model was employed to analyze 776 immune-related genes (19) in the mRNA expression data of 97 samples (P<0.05), and 98 survival-related immune genes (SIGs) that significantly affected the survival of AML patients were identified (Table S2).
Using unsupervised clustering analysis (Euclidean statistics and complete linkage method) of 98 SIGs, those 97 samples were clustered into three distinct immune subtypes (Im1: immune cluster 1, Im2: immune cluster 2, Im3: immune cluster 3) based on the gene expression signature ( Figure 1A). As shown in immune gene heatmaps, most of the SIGs were highly expressed in the Im1 and Im3 clusters but expressed at low levels in the Im2 cluster ( Figure 1B). Kaplan-Meier survival analysis revealed that the prognosis of the Im2 cohort was significantly better than that of the Im1 and Im3 cohorts ( Figure 1C, log-rank test, P=0.0008).

The immune infiltration was significantly different in different cluster
As the immune microenvironment was significantly correlated with the occurrence and development of AML, a single-sample gene set enrichment (ssGSEA) algorithm was utilized to explore differences in the immune microenvironment among the three immune clusters. The results showed that the Im2 cluster had fewer infiltrating immune cells than the Im1 and Im3 clusters (Figure 2A), and the specific infiltration of immune cells in different clusters is shown in Figure S2. Consistent findings demonstrated that the immune scores were significantly lower in the Im2 cluster ( Figure 2B, unpaired t test, P < 0.001), while tumor purity was significantly higher in the Im2 cluster but significantly lower in the Im1 and Im3 clusters ( Figure 2C, unpaired t test, P < 0.001). Generally, we can conclude that patients with less immune infiltration and lower immune scores may have a better prognosis than those with more immune infiltration and higher immune scores.

DEG-SIGs were screened by mRNA expression data analyzing
Based on the significant differences in immune infiltration and survival trends between the Im2 cluster and Im1/3 cluster, we defined Im2 as the immune infiltration-lacking subtype (IL type) and Im1/3 as the immune infiltration-rich subtype (IR type). To reveal the potential mechanisms of different prognoses between IL and IR subtypes, an elaborate analysis of the mRNA expression profiles of the two types of AML patients was implemented. We performed differentially expressed gene analyses and identified 1936 differentially expressed genes (DEGs) with significant differences between IL and IR subtypes. There were 42 SIG-DEGs which were common members of 1936 DEGs and 98 SIGs (P < 0.05, |Fold Change| >2 or <0.5) ( Table 1) (Figures 3A, B).
To elucidate the mechanism of prognosis difference between IL and IR subtype we obtained the interaction data of the 42 DEG-SIGs from STRING website (https://string-db.org/) (interaction score > 0.4), and then constructed protein-protein interaction (PPI) network by Cytoscape. (Figure 3C). Gene ontology (GO) functional enrichment analysis distinguished some enriched terms in three subontologies: biological processes (BP), cellular component (CC), and molecular function (MF) ( Figure 3D). For BP, 42 DEG-SIGs were enriched in defense response, inflammatory response, and immune system process. With regard to CC, 42 DEG-SIGs were enriched in integrin complex, external side of plasma membrane, and cell surface. For MF, 42 DEG-SIGs were enriched in cell part, tertiary granule, and whole membrane. These results may partially illustrate the potential mechanisms of 42 DEG-SIGs affecting the prognosis of AML patients. Immune functional characters of 3 AML patients clusters. (A) Heatmap of the Im1, Im2 and Im3 cohorts of 97 TCGA-AML patients using ssGSEA scores from 28 immune cell types. Violin plots depict the immune score (B) and tumor purity (C) of the Im1, Im2 and Im3 cohorts (***: P<0.001).
19 hub genes were screened by integrated analysis of mRNA expression data, miRNA expression data and methylation data Considering the complex mechanism of leukemogenesis and progression, we next conducted integrated multiomics analysis to identify hub genes that were associated with prognosis. Comparing the miRNA expression profiles of patients between IL and IR subtypes, we revealed 93 miRNAs that were significantly differentially expressed (P<0.05, |FC|≥2) ( Figure 4A). A total of 7294 target miRNA genes (TDEmiRs) were identified using DIANO TOOLS/microT-CDS (threshold=0.9). Through integrated bioinformatics analysis, we selected 15 commonly differentially expressed genes from 42 DEG-SIGs and 7294 TDEmiRs between the IL and IR subtypes ( Figure 4C).
Combined analysis of mRNA and methylation profiles indicated that there were significantly negative correlations between mRNA expression level and degree of methylation for 355 genes (R < -0.5, p< 0.05). When these 355 methylation correlation genes (MethylCor) were cross-referenced with the 42 DEG-SIGs, we identified 6 common genes associated with immune infiltration and differential expression, methylation and prognosis between IL and IR subtypes (Figures 4B, C).

A prognostic model based on 5 hub genes was constructed
Having observed significant differences in immune infiltration, gene expression and clinical behavior between IL and IR types, we next developed a LASSO-Cox proportional hazards regression model based on 19 immune-correlated DEGs by combining microRNA and epigenetic regulation data. Using the LASSO model, we built a classifier based on the 5 hub genes to predict the prognosis of AML (risk score = -0.086×ADAMTS3 + 0.180×CD52 + 0.472×CLCN5 -0.356×HAL + 0.368×ICAM3) (Figures 5A, B). Kaplan-Meier plots displayed OS differences between patients in various subtypes (P=3.931×10 -06 ) ( Figure 5C), and the ROC curve suggested that the model can effectively predict the 1-, 3-and 5-year prognosis of AML (AUC=0.82, 0.83, 0.99, respectively) ( Figure 5D). Consistent with earlier analysis, we found similar predictive performance for 151 mRNA samples of the TCGA-AML profile (P=3.369×10 -06 , AUC=0.63, 0.74, 0.83) (Figures 5E, F). To test this model further, validation cohorts were obtained from the GEO database. Kaplan-Meier plots and ROC curves at 1, 3 and 5 years confirmed the prognostic value of the 5-hubgene-based model: GSE106291 (log-rank test, P=3.311×10 -06 , AUC=0.66, 0.66, 0.61) (Figures 5G, H). After stratification by disease classification, the results showed that the risk score of the IL type was significantly lower than that of the IR type (P=4.39×10 -07 ) ( Figure S3). These evaluations demonstrated that the 5-hub-gene-based model could identify a group of high-risk patients within conventionally assigned risk groups and may guide clinical practice.

The better efficacy of the prognostic model for the prognosis of AML patients was further verified by clinical samples
For verifying the prognostic value in the 5-hub-gene-based model, we collected 6575 genes mutation detected in 200 newly diagnosed AML patients (TCGA.LAML.mutect.somatic.maf, https://portal.gdc.cancer.gov/files/27f42413-6d8f-401f-9d07-d019def8939e) and 38 genes mutation detected in 55 newly diagnosed AML patients (Xinqiao Hostpital). The common mutated genes were DNMT3A, IDH1, NRAS, RUNX1 and TET2. In this model classification, the high risk was significantly associated with the mutation of RUNX1 (p=0.015) and TET2 (p=0.054) considered by chi-square test (Table 2). Kaplan-Meier analyses of 55 patients with prognostic information indicated that patients with mutation of RUNX1 ( Figure 6A, p=0.0001) and TET2 (Figure 6B, p=0.2257) were correlated with a poor prognosis and had a shorter median survival duration.

The diverse distribution of hub genes in immune cells of AML patients
To explore the value of these five hub genes in AML pathogenesis, we further identified the single-cell sequencing dataset GSE116256 to describe the distribution of the 5 hub genes in immune cells using the Seurat package for clustering and the SingleR package for annotation ( Figure 7A). As shown in the scatter plot ( Figure 7B) and violin plot ( Figure 7C), CD52, ICAM3 and CLCN5 were widely expressed in granulocytes, monocytes, T lymphocytes, B lymphocytes, dendritic cells and NK cells, whereas ADAMTS3 was rarely expressed in those cells.
HAL is highly expressed in granulocytes and monocytes but rarely expressed in other immune cells. Accordingly, we hypothesized that these hub genes play various roles through the regulation of gene expression in specific cells. The hub gene expression of blood cells in the Protein Atlas database (https:// www.proteinatlas.org/) further confirmed this result ( Figure S4).

Investigation of best-fitting compounds on hub genes
To investigate best-fitting compounds, we performed virtual screening of molecular docking using the three-dimensional structure of CD52 (PDB ID: 6OBD), CLCN5(PDB ID: 2J9L), ICAM3(PDB ID: 1T0P) and 2115 FDA approved compounds in ZINC15 database. The predicted binding affinities of the top 2 hit compounds against respective targets are ranked from highest to lowest. Binding energy (Kcal/mol) for interaction of proteins and compounds are as follows: CD52 with ZINC164528615 (Glecaprevir) (−6.4 Kcal/mol), CD52 with ZINC3938684 (Toposar) (−6.3 Kcal/mol); ICAM3 with ZINC52955754

Discussion
AML is a highly heterogeneous malignant tumor with poor prognosis. A large number of studies have demonstrated that the immune microenvironment of AML patients is altered significantly to promote leukemogenesis in AML (35,36). In this study, we classified AML patients according to the difference in the SIG expression signature. The results showed that the prognosis of AML patients with immune infiltration deficiency (IL subtype) was significantly better than that of patients with immune infiltration enrichment (IR subtype), which was contrary to the effect of the immune infiltration degree in solid tumors (37). Recent studies have shown that the heterogeneity of immune infiltration may be affected by the components of cytokines in the microenvironment (38,39) and the expression of tumor driver genes (40). These heterogeneities were significantly associated with immunotherapeutic effects.
We further compared the gene expression profiles of IL and IR patients and found that the functions of genes with differential expression between the two subtypes were mainly enriched in defense response, inflammatory response and immune system process (BP); integrin complex, plasma membrane and cell surface (CC); and cellular partial, tertiary granules, and whole membrane (MF). Consistent with our results, several studies have confirmed that the biological functional diversity of the immune system is significantly altered in the BM immune microenvironment. Previous studies have constructed many prognostic models on the basis of diverse omics data. Hu et al. constructed a DNA methylation-based prognostic model for AML patients (41), but the accuracy of the model (AUC=0.67-0.75) was lower than that of our model (AUC=0.82-0.99). A three-miRNA signature was built for non-M3 AML patients by Xue et al., but due to the lack of validation by any independent data, its clinical application value may need further verification (42). A four-gene-based prognostic model from Huang et al. (43) was also less accurate (AUC=0.66-0.71) than our model. These conventional prognostic analyses generally consider only mRNA, miRNA, or methylation data. To the best of our knowledge, this is the only study to date to construct a prognostic model through the integrated analysis of multiple omics datasets (mRNA, miRNA and methylation data). The TCGA-AML test dataset and an independent GEO-AML validation dataset confirm that the model is very effective in predicting the prognosis of AML patients. Further, the prognostic model was verified by mutation data of AML patients in Xinqiao Hospital. Although the data were limited, it also indirectly proved that the model could effectively predict the prognosis of AML patients. The model consists of five hub genes: ADAMTS3, CD52, CLCN5, ICAM3 and HAL.
ADAMTS3 is a member of the ADAMTS family and plays an important role in the genesis and development of a variety of tumors (44,45). Previous studies demonstrated that ADAMTS3 is expressed in mast cells (19), and mast cells have both tumorpromoting and tumor-inhibiting effects (46), thus leading to different prognoses in different tumors (47-49). Our study confirmed that ADAMTS3 expression was significantly negatively correlated with the prognosis of AML and is rarely expressed in other immune cells. Therefore, we will pursue further in-depth analysis of how ADAMTS3 exerts its antitumor effect through mast cells to explore its potential therapeutic value.
CLCN5 (chloride voltage gated channel 5) is a member of the chloride channel family. It is widely expressed in a variety of tumor cells (50) and can enhance the chemotherapy resistance of chronic lymphocytic leukemia and multiple myeloma (51,52). Hsa-let-7c-5p and hsa-mir-495-3p, two miRNAs that target CLCN5, were found to be involved in the occurrence and development of a variety of tumors (53,54). Our study confirmed that CLCN5 plays a role in promoting tumorigenesis in AML, but its specific mechanism needs deeper investigation.
HAL (histidine ammonia lyase) is the rate-limiting enzyme of histidine catabolism (55). Kanarek et al. found that tumor cells with higher HAL expression levels possess higher sensitivity to methotrexate. Acute lymphocyte leukemia (ALL) patients with higher HAL expression were more likely to benefit from methotrexate treatment (56). In our study, we found that the expression level of HAL was significantly correlated with the degree of gene methylation. However, as a demethylation agent, hypomethylating agent (HMA) alone has difficulty achieving the intended effect in the clinical treatment of AML (57). We speculated that the combination of methotrexate and HMA may achieve better efficacy when changes in HAL expression are detected. Surprisingly, we found that HAL was only expressed in granulocytes and monocytes. Consistent with AML cells, granulocytes and monocytes originate from myeloid precursor cells (14). Therefore, insight into the methylation mechanism of HAL and its effect on the innate immune response and AML cells will be conducive to improving the therapeutic effect of demethylation agents. We also assessed the interaction between HAL and hsa-miR-582-3p, which has been previously indicated as a tumor suppressor miRNA in AML (58). Thus, deeper investigation of the relationship between HAL and miR-582-3p will be helpful to further understand the potential mechanism of malignant progression of AML.
Our study found that CD52 was widely expressed in leukemia cells and all types of immune cells. Evidence has revealed the high expression of CD52 in CD34+ stem cells of AML (5q-) patients and a significantly negative correlation between the CD52 expression value and the prognosis of AML (59), which is consistent with our results. ICAM-3 (intercellular adhesion molecule 3, CD50) belongs to the ICAM (intercellular adhesion molecule) immunoglobulin superfamily and plays important roles in immune response and tumor development (60,61). Consistent with our scRNA sequencing results, ICAM-3 is expressed in granulocytes, monocytes and lymphocytes (62). In vivo and in vitro experiments confirmed that ICAM-3 participates in the proliferation, stemness and radiotherapy resistance of various tumors through the FAK pathway, PI3K/ Akt pathway or other mechanisms (63-65). Interestingly, we found that the expression levels of CD52 and ICAM-3 were both significantly correlated with the degree of methylation but had opposite impacts on the prognosis of AML patients. Therefore, the roles of CD52 and ICAM-3 should be considered in HMA reagent treatment.
In conclusion, using a multiomics analysis and validation approach, we constructed and validated a novel, 5-hub-genebased model that allows robust risk stratification and facilitates the identification of prognosis in AML. The distribution of the 5 hub genes in immune cells was revealed through scRNA sequencing analysis. Furthermore, we conducted virtual screening of three genes (CD52, CLCN5 and ICAM3) with known protein structure, and found the compounds with the lowest binding energy with them, which provided ideas for further searching for targeted inhibitors. Of course, this study is limited to bioinformatics analysis, and the proposed approaches need to be further tested in the clinic.

Data availability statement
The original contributions presented in the study are included in the article/Supplementary Material. Further inquiries can be directed to the corresponding authors.