Identification and Validation of a Prognostic Gene Signature for Diffuse Large B-Cell Lymphoma Based on Tumor Microenvironment-Related Genes

Diffuse large B-cell lymphoma (DLBCL) is an extremely heterogeneous tumor entity, which makes prognostic prediction challenging. The tumor microenvironment (TME) has a crucial role in fostering and restraining tumor development. Consequently, we performed a systematic investigation of the TME and genetic factors associated with DLBCL to identify prognostic biomarkers for DLBCL. Data for a total of 1,084 DLBCL patients from the Gene Expression Omnibus database were included in this study, and patients were divided into a training group, an internal validation group, and two external validation groups. We calculated the abundance of immune–stromal components of DLBCL and found that they were related to tumor prognosis and progression. Then, differentially expressed genes were obtained based on immune and stromal scores, and prognostic TME‐related genes were further identified using a protein–protein interaction network and univariate Cox regression analysis. These genes were analyzed by the least absolute shrinkage and selection operator Cox regression model to establish a seven-gene signature, comprising TIMP2, QKI, LCP2, LAMP2, ITGAM, CSF3R, and AAK1. The signature was shown to have critical prognostic value in the training and validation sets and was also confirmed to be an independent prognostic factor. Subgroup analysis also indicated the robust prognostic ability of the signature. A nomogram integrating the seven-gene signature and components of the International Prognostic Index was shown to have value for prognostic prediction. Gene set enrichment analysis between risk groups demonstrated that immune-related pathways were enriched in the low-risk group. In conclusion, a novel and reliable TME relevant gene signature was proposed and shown to be capable of predicting the survival of DLBCL patients at high risk of poor survival.


INTRODUCTION
Diffuse large B-cell lymphoma (DLBCL) is a heterogeneous tumor entity with a striking degree of genetic and clinical heterogeneity. Although more than half of DLBCL patients may achieve longterm remission, the disease remains a major clinical challenge, with approximately 30% of patients not being cured (1,2). The heterogeneity of the tumor, in particular, poses a major barrier to understanding the genetic basis of the disease and its response to therapy (3). Therefore, there is an urgent need to identify new individual prognostic and risk-stratified biomarkers.
In recent years, the role of the tumor microenvironment (TME) in tumorigenesis has gradually been discovered (4). Studies have revealed that tumor cells are targets of the immune system in the early stages of tumor development; however, over time, these cells begin to resist the innate immune response and then gradually weaken and adapt to it (5)(6)(7). Thus, a better understanding of the interactions between the TME and the immune response may provide new approaches to improve the efficiency of current immunotherapies, especially immune checkpoint inhibitor and chimeric antigen receptor (CAR) T cell therapies (8). Several studies have considered the latent role of the TME in the occurrence and development of DLBCL, but their results were controversial (9).
In the era of rituximab and immunotherapy, the ability of the International Prognostic Index (IPI) to predict the prognosis of individual DLBCL patients has decreased (10). A better understanding of the interactions between the TME and IPI scores may provide new approaches to improve response rates to current treatment strategies. Thus, incorporating a prognostic factor from the TME into the existing IPI system would help greatly in the development of prognostic stratification of DLBCL. Here, we propose a compound prognostic nomogram combining a TME-related prognostic model with clinical features. This approach provides a basis for better understanding the molecular mechanisms underlying the prognoses of DLBCL patients.

Data Sources
Gene expression profiling and clinical data of patients with DLBCL were obtained from the Gene Expression Omnibus (GEO) database (https://www.ncbi.nlm.nih.gov/geo/). Data series were downloaded in a normalized expression matrix file format and were used directly for the analyses. Patients in the GSE31312 dataset were randomly divided into a training group (N = 282) and an internal validation group (N = 188) in a 6:4 ratio. In addition, two DLBCL microarray datasets were used for validation, including 414 DLBCL patients from the GSE10846 dataset and 200 DLBCL patients from the GSE11318 dataset.
Generation and Analysis of ImmuneScore, StromalScore, and ESTIMATEScore The ESTIMATE (Estimation of STromal and Immune cells in MAlignant Tumors using Expression data) package in R version 3.6.3 was used to evaluate the abundance of immune and stromal components of the TME from expression data (11). We obtained three scores, ImmuneScore, StromalScore, and ESTIMATEScore, which respectively represent the immune abundance, the stromal abundance, and the sum of both in the TME; for instance, a higher ImmuneScore means a higher abundance of immune components in the TME.

Identification of Differentially Expressed Genes
To obtain DEGs between high-and low-scoring samples, microarray data from GEO were analyzed using NCBI GEO2R (https://www.ncbi.nlm.nih.gov/geo/geo2r/). 470 patients with GSE31312 were divided by the median of ImmuneScore (StromalScore), and the high and low ImmuneScore (StromalScore) groups were obtained, respectively. Based on comparisons of the high-and low-scoring groups, Adj. P <0.05 and fold change >1.05 were set as the thresholds for DEG identification to obtain differential immune and stromal genes. We then took the intersection of immune differential genes and stromal differential genes. The DEGs were visualized using heatmaps (https://software.broadinstitute.org/morpheus/) and Venn plots (http://bioinformatics.psb.ugent.be/webtools/Venn/).

Functional Enrichment Analyses
DAVID (https://david.ncifcrf.gov/summary.jsp), an online tool for gene functional enrichment, was used for gene ontology (GO) analysis (with respect to cellular component, molecular function, and biological process) and Kyoto Encyclopedia of Genes and Genomes (KEGG) pathway analysis of the 183 DEGs shared between the ImmuneScore groups and the StromalScore groups. The results were displayed using the ggplot2 R package. P <0.05 was considered statistically significant.
Gene set enrichment analysis (GSEA) was performed using the gsea-4.0.3 software downloaded from https://www.gseamsigdb.org/gsea/index.jsp to explore whether immune pathways were significantly different between the high-risk and low-risk groups.

Protein-Protein Interaction Network Construction and Univariate Cox Regression Analysis
The PPI network was obtained from the STRING database (https://string-db.org/) and reconstructed with version 3.7.2 of Cytoscape. Univariate Cox analysis of overall survival (OS) was used to determine the relationships between expression of 183 DEGs and prognosis. In order to obtain the most critical and meaningful genes to construct the model, we used a method from Bi et al. (12) to further screen the genes. In the PPI network, we kept the core gene located in the center by eliminating genes with fewer peripheral nodes. In univariate Cox regression, the smaller the p-value, the more significant the prognosis. Therefore, the sharing factors between the degree of the nodes ≥3 in PPI and the p <0.0001 of univariate Cox regression analysis were carried out.

Generation of the Risk Prediction Model
The training dataset GSE31312 was used to establish the TME risk model. The R package "glmnet" was used for least absolute shrinkage and selection operator (LASSO) Cox regression analysis. A risk formula for predicting prognosis was established by LASSO Cox regression analysis. Then, we calculated the individualized risk score of each patient by dividing all patients into high-risk and low-risk groups using the median risk score as the cut-off value. Kaplan-Meier survival analysis and log-rank test were used to evaluate the difference in OS between the high-and low-risk groups. Time-dependent receiver operating characteristic (ROC) curves were plotted to evaluate prognostic value (13). The analysis was performed using SangerBox (http://sangerbox.com/Tool).

Univariate and Multivariate Cox Regression for IPI Components and the Seven-Gene Signature
To assess whether the risk prediction model could be used as an independent prognostic indicator for DLBCL patients, univariate and multivariate Cox regression analyses were performed with SPSS 26.

Construction and Validation of the Nomogram
A nomogram integrating IPI components and the seven-gene model was established based on the GSE31312 cohort to assess the probability of 1-, 3-, and 5-year individualized OS via the rms R package (https://cran.r-project.org/web/packages/rms/). In addition, the discriminatory ability of the nomogram was graphically evaluated using a calibration map.

CIBERSORT and Tumor-Infiltrating Immune Cell Profile
We used the CIBERSORT computational method to estimate the abundance distribution of TICs in GSE31312. The abundance of 22 immune cells was detected by t-test to observe the differences among risk groups. All analyses were performed with R (version 3.6.3, https://www.r-project.org/).

Identification of Scores Associated With Survival and Clinical Features
The clinical information of DLBCL patients from the GSE31312, GSE10846, and GSE11318 datasets is summarized in Table 1.
To determine the correlations of ImmuneScore, StromalScore, and ESTIMATEScore with clinical features of DLBCL, clinical data from the GSE31312 dataset were analyzed. As shown in Figure  1A, the high-scoring group had longer OS than the low-scoring group for ImmuneScore, StromalScore, and ESTIMATEScore. Significant differences in ImmuneScore, StromalScore, and ESTIMATEScore were found at different stages of DLBCL by Kruskal-Wallis rank sum test ( Figure 1B). On average, stage IV ranked the lowest among all stages with respect to ImmuneScore, StromalScore, and ESTIMATEScore ( Figure 1B). Moreover, the scores showed a negative correlation with IPI scores by Kruskal-Wallis rank sum test ( Figure 1C). Taken together, these results show that the immune and stromal components of the TME are related to prognosis and progression of DLBCL.

DEGs AND ENRICHMENT ANALYSIS
To gain insight into the role of the TME in DLBCL, we divided patients into high-scoring and low-scoring groups according to the median ImmuneScore (StromalScore). The number of patients in the high ImmuneScore group who also had a high StromalScore group was 147, and the number of patients in the low ImmuneScore group who also had a low StromalScore group was also 147. Therefore, 294 patients had consistent ImmuneScore and StromalScore subgroups, representing 62.5% of all patients (Supplementary Table S1). We next investigated the changes in immune (or stromal) score-related DEGs between high-scoring and low-scoring samples. A total of 865 ImmuneScore-related DEGs were identified among which 836 genes were upregulated and 29 genes were downregulated ( Figure 2A). Similarly, there were 597 StromalScore-related DEGs, including 569 upregulated and 28 downregulated genes ( Figure 2A). A Venn diagram was used to depict the co-up/ downregulated genes associated with the different score groups (N = 183, Figure 2B, Supplementary Table S2). Enrichment analyses showed that these DEGs were related to immunerelated GO terms, including immune response and informatory response ( Figure 2C, Supplementary Table S3). KEGG pathway analyses also indicated that the genes were mainly involved in cytokine-cytokine receptor interaction pathways ( Figure 2D, Supplementary Table S3).

Sharing of PPI Network and Univariate Cox Regression
A total of 183 DEGs were used to construct the PPI network, which is shown in Figure 3A. Univariate Cox regression was performed on 183 genes, and the genes with p <0.0001 were selected ( Figure 3B). Then, intersection analysis was performed between the degree of the nodes ≥3 in the PPI and p <0.0001, resulting in 18 overlapping genes ( Figure 3C).

Construction of a Risk Prediction Model
Based on the TME According to the characteristics of variable selection and regularization, LASSO Cox regression was used to determine the optimal weight coefficient for the prognostic TME-related genes. Using one standard error of the best penalty parameter l value and 1,000-fold cross-validation ( Figure 4A), we obtained a seven-gene prognostic signature from the 18 genes identified above ( Figure 4B). The left line indicated the optimal values by l.min criteria ( Figure 4B). Then, coefficient values were extracted, and the coefficients of the seven genes were multiplied by their mRNA expression levels to calculate individual risk scores using the following formula: Risk score = the mRNA expression level of CSF3R*   Figure 4E).

Internal Validation of our Signature in GSE31312 Cohort
To verify the robustness of the seven-gene prognostic signature, we used the signature to calculate individual risk scores and divided patients from the GSE31312 cohort into high-risk and low-risk groups using the same cutoff value determined in the training set. The distributions of risk scores, survival status, and expression levels were consistent with those obtained in the training set (Supplementary Figure S1A). The verification results demonstrated that the AUC values were 0.71, 0.68, and 0.61 during 1-, 3-, and 5-year follow-up, respectively (Supplementary Figure S1B). The prognosis of the low-risk group was significantly better than that of the high-risk group (HR = 4.1; 95% CI = 2.3-7.33; p = 0.001; Supplementary Figure  S1C). Finally, we also applied the seven-gene signature to all GSE31312 samples and found that the relationships between the distributions of risk scores, survival status, and expression levels were again consistent with those obtained in the training set (Supplementary Figure S2A). Time-dependent ROC curve analysis showed that in predicting 1-, 3-, and 5-year OS, the AUC values were 0.71, 0.68, and 0.65, respectively, indicating an acceptable degree of distinction (Supplementary Figure S2B). Similarly, the prognosis of the low-risk group was significantly better than that of the high-risk group (HR = 4.45; 95% CI = 3.07-6.47; p < 0.0001; Supplementary Figure S2C). Thus, the model effectively provided prognostic classifications within the GSE31312 dataset.

External Validation of our Signature in GSE10846 and GSE11318 Cohorts
The seven-gene prognostic signature was validated in the GSE10846 and GSE11318 datasets. The results are presented in Figure 5 as Kaplan-Meier curves. Consistent with the above findings, there were significant differences in survival outcomes among different risk groups in both GSE10846 (HR = 1.96; 95% CI = 1.42-2.7; p < 0.0001; Supplementary Figure S1D) and GSE11318 (HR = 1.66; 95% CI = 1.14-2.42; p = 0.008; Supplementary Figure S1E). Thus, the signature was predictive in both internal and external datasets.   Figure 5A). Similar results were obtained in germinal center B-cell-like (GCB) patients (p < 0.0001; HR = 2.85; 95% CI = 1.7-4.48; Figure  5B). An analogous result was also obtained in patients at different stages; the low-risk group had a favorable prognosis compared with the high-risk group (Figures 5C, D). Moreover, patients in the low-risk group had significantly favorable OS compared with high-risk patients in both the IPI<2 group and the IPI≥2 group (Figures 5E, F). These results demonstrate the independent predictive ability of our signature in clinical applications.

Univariate and Multivariate Cox Regression for IPI Components and the Risk Prediction Model
To evaluate whether the risk prediction model could be used as an independent prognostic index for DLBCL patients, analyses were performed to identify the factors affecting the prognosis of DLBCL patients. These analyses were performed only on the datasets that included clinical IPI components data (GSE31312 and GSE10846), and the results showed that the seven-gene signature was an independent prognostic factor in these datasets ( Table 2).

Construction and Validation of the Nomogram
A nomogram was established to forecast 1-, 3-, and 5-year survival based on IPI components and the seven-gene model ( Figure 6A). The nomogram demonstrated that high total points predicted worse survival. The calibration chart showed an acceptable agreement between the predicted survival rate and   the actual survival rate, indicating that our proposed nomogram has stability in predicting DLBCL patient prognosis in clinical practice ( Figures 6B-D).

Immune-Infiltrating Cells and GSEA in Different Groups
In order to further verify the relationship between TME and DLBCL, we used CIBERSORT to analyze the proportions of 22 immune-infiltrating cells in GSE31312 samples between different risk groups. Figure 7A shows the compositions of immune cells in 470 patients, and Figure 7B shows the differences in the proportion of each immune cell type between the high-risk and low-risk groups. The results showed that 12 types of immuneinfiltrating cells were correlated with the risk group. Specifically, the proportions of naïve B cells, memory B cells, regulatory T cells, resting natural killer (NK) cells, and monocytes were significantly higher in the high-risk group compared with the low-risk group, whereas the proportions of CD8+ T cells, activated CD4+ memory T cells, gamma delta T cells, activated NK cells, M1 macrophages, M2 macrophages, and neutrophils were significantly lower ( Figure 7B). These results further confirmed that the prognostic signature was related to the immune activity of the TME. GSEA was used to probe the potential mechanism for the two risk groups to identify the enriched KEGG pathway. Two immune-related pathways were enriched in the low-risk group: intestinal immune network for IgA production and primary immunodeficiency diseases ( Figures 7C, D), and the rest of the enrichment results are shown in the Supplementary Figure S1. The immune pathway of the intestinal immune network for IgA production is associated with central deficits in the pathogenesis of the disease (14). Primary immunodeficiency is a pathway associated with primary immunodeficiency (15). These results suggest that the prognosis of DLBCL may be closely related to immune regulation in the TME.

DISCUSSION
DLBCL is characterized by a heterogeneous tumor entity with variable therapeutic outcomes. Risk stratification and prognosis of DLBCL patients remain challenges for clinicians and researchers (1,16). In the rituximab era, IPI, R-IPI, and NCCN-IPI risk scores are calculated using easily obtained clinical features that are part of standard diagnostic procedures. However, all three scoring systems fail to identify a very high-risk group with long-term OS clearly below 50% (10). Moreover, the progression of a tumor is affected not only by its intrinsic characteristics but also by the extrinsic TME. Immune system accumulation and immune cell infiltration could have a profound impact on carcinogenesis and prognosis (17). There is also increasing evidence that the microenvironment has an important role in predicting tumor progression and prognosis (17,18). Consequently, screening for a gene prognostic signature that adequately reflects the TME would be of great significance for individualized prevention and treatment of DLBCL patients (16,19). In this study, we established seven prognostic gene markers which robustly and reliably predicted an individualized immune prognostic model for DLBCL patients on the basis of immune genes. The seven-gene prognostic signature was combined with IPI components to build a composite prognostic nomogram, which was capable of accurate prediction and showed positive net benefit in DLBCL. The seven genes in our signature were TIMP2, QKI, LCP2, LAMP2, ITGAM, CSF3R, and AAK1. Although these genes are differentially expressed in immune cells or stromal cells, the expression of these genes in normal germinal center B cells and in a subset of DLBCL is uncertain, and our sequencing data comes from tumor tissues, so we cannot distinguish whether the expression level is caused by stromal or tumor components. High expression of TIMP2 has been reported to inhibit matrix metalloproteinases to produce anti-tumor activity (20). TIMP2 was also demonstrated to interact with multiple integrin pathways and is involved in angiogenesis in gastric cancer (21). LCP2 encodes a protein of 533 amino acids that participates in T cell activation and increases the activity of the IL-2 gene promoter through its transient overexpression (22).
High expression of LCP2 is associated with better outcomes in DLBCL patients (17,18). CSF3R is closely related to prognosis of patients with chronic neutrophilic leukemia or atypical chronic myeloid leukemia and thus represents a potentially useful criterion for disease diagnosis (23). QKI gene encodes an RNA-binding protein that regulates the functions of diverse mRNAs, which play critical parts in tumorigenesis through inactivation of tumor suppressor genes in multiple tumors (24,25). LAMP2 is essential for maintaining the structural integrity of the lysosomal compartment and relocalizes to the cell surface of some highly metastatic tumor cells. LAMP2 has been functionally validated as an essential mediator of drug resistance and tumor recurrence in hematological diseases (26)(27)(28). ITGAM and AAK1 have not been previously reported to be related to cancer, and our study is the first to suggest that they could be used as new prognostic markers of DLBCL. Furthermore, the GSEA results showed that enrichment of the seven-gene signature was significantly correlated with immunerelated signaling pathways, indicating that this model has potential clinical applications in predicting survival outcomes of patients.
Zamani-Ahmadmahmudi et al. constructed an independent seven-gene prognostic signature that could distinguish low-risk  patients with DLBCL from high-risk ones (29). In our study, we not only constructed a risk prediction model for the prognosis of DLBCL patients, but we also explore the relationship between TME and DLBCL. The results showed that immune and stromal components in the TME were negatively correlated with the prognosis of DLBCL patients. Our TME-related seven-gene prognostic signature was shown to have strong predictive power and to represent an independent prognostic factor. In the era of immune targeting, distinguishing high-risk patients from the perspective of TME can inform clinical decisions and lead to better outcomes. Although our TME-based prognostic model was shown to predict tumor prognosis in a large sample, this study had some limitations. First, owing to the retrospective design and the unavailability of control group samples in the GEO databases, the results were biased to some extent. Thus, a well-designed, prospective, international, multicenter clinical trial is needed to verify our findings in the future. In addition, further functional research is warranted to explore the molecular functions of the identified immune genes during DLBCL progression.
In conclusion, we established for the first time a TME-related prognostic signature in DLBCL patients, which is a promising prognostic model when combined with clinical IPI components. The results presented here not only help to clarify immune responses in the DLBCL microenvironment but also indicate new clinical applications for immune therapy and individualized therapy in patients with DLBCL.

DATA AVAILABILITY STATEMENT
Publicly available data sets were analyzed in this study. These data can be found here: https://www.ncbi.nlm.nih.gov/geo/.

AUTHOR CONTRIBUTIONS
HZ and LX conceived and designed the study and reviewed the manuscript. TP and YH collected, arranged, and analyzed the data and wrote the manuscript. HC, YL, RZ, and JX revised the statistical methodology. JP, SC, YZ, and LQ designed and prepared the figures and tables. All authors contributed to the article and approved the submitted version.