Identification of a Five-Gene Signature for Predicting Survival in Malignant Pleural Mesothelioma Patients

Malignant pleural mesothelioma (MPM), predominantly caused by asbestos exposure, is a highly aggressive cancer with poor prognosis. The staging systems currently used in clinics is inadequate in evaluating the prognosis of MPM. In this study, a five-gene signature was developed and enrolled into a prognostic risk score model by LASSO Cox regression analysis based on two expression profiling datasets (GSE2549 and GSE51024) from Gene Expression Omnibus (GEO). The five-gene signature was further validated using the Cancer Genome Atlas (TCGA) MPM dataset. Univariate and multivariate Cox analyses proved that the five-gene signature was an independent prognostic factor for MPM. The signature remained statistically significant upon stratification by Brigham stage, AJCC stage, gender, tumor size, and lymph node status. Time-dependent receiver operating characteristic (ROC) curve indicated good performance of our model in predicting 1- and 2-years overall survival in MPM patients. The C-index was 0.784 for GSE2549 and 0.753 for the TCGA dataset showing moderate predictive accuracy of our model. Furthermore, Gene Set Enrichment Analysis suggested that the five-gene signature was related to pathways resulting in MPM tumor progression. Together, we have established a five-gene signature significantly associated with prognosis in MPM patients. Hence, the five-genes signature may serve as a potentially useful prognostic tool for MPM patients.


INTRODUCTION
Malignant pleural mesothelioma (MPM), the most common form of malignant mesothelioma, is a highly aggressive neoplasm arising from the pleural mesothelial tissues covering the lung and is predominantly associated with occupational and environmental exposure to asbestos fibers (Wagner et al., 1960;Walker et al., 1983). Centers for Disease Control and Prevention (CDC) reports the annual number of malignant mesothelioma deaths increased by 4.8%, from 2479 in 1999 to 2579 in 2015, in the United States (Mazurek et al., 2017). Though it is still considered a rare disease, as many as 3000 new cases are diagnosed annually in United States alone (Vogelzang et al., 2003). Considering the latency period between the first asbestos exposure to MPM development often ranges anywhere between 20 and 71 years, the global incidence of MPM will likely be on the rise (Lin et al., 2019). Although the time from exposure to onset is long, the progression from onset is rapid. MPM patients often have non-specific symptoms at first which makes diagnosis extremely challenging at early stage. Moreover, the lack of an accurate and universally accepted staging system makes it even harder for investigation and treatment of the disease, leading to the poor prognosis of MPM (Rusch, 1995;Kindler et al., 2018).
Researchers have found several clinicopathological factors associated with poor prognosis of MPM, such as the male gender, elevated serum lactate dehydrogenase levels, chest pain, thrombocytosis, non-epithelial histology, and age > 75 years (Curran et al., 1998;Herndon et al., 1998). Simultaneously, potent biomarkers have been studied in relation to pathogenesis, diagnosis and prognosis of MPM. To date, non-tissuebased biomarkers have been characterized, including soluble mesothelin-related protein (SMRP), osteopontin and fibulin-3 (Hollevoet et al., 2012;Kindler et al., 2018). However, none of these biomarkers being evaluated at this time for MPM have demonstrated sufficiently rigorous prospective or blinded validation to recommend their use (Kindler et al., 2018). Some studies have evaluated the diagnostic value of microRNA to differentiate MPM from normal pleural mesothelial proliferation or other carcinomas (Busacca et al., 2010;Benjamin et al., 2010;Balatti et al., 2011). The six-microRNA signature constructed by Kirschner et al. (2015) was reported to predict the survival of MPM patients. However only a small number of long and short-term survivors were compared following extrapleural pneumonectomy, and normal samples were not included as a control, thereby limiting the use of this specific signature (Kirschner et al., 2015). Furthermore, an increasing number of studies using gene expression profiling have primarily discovered new oncogenes or tumor suppressor genes or have simply used MPM tumor samples without normal samples to construct the prognostic model. For instance, Gordon et al. developed and validated a nine-ratio (six gene) signature to differentially diagnose MPM from adenocarcinoma, and they also defined a molecular classification of MPM using transcriptional profiling by microarray, however did not independently validate the results (Gordon et al., 2002(Gordon et al., , 2005. The study took inspiration from previous research and concerning the difficulty to apply the MPM staging system into clinical work to make accurate evaluation of prognosis for MPM patients (Rusch et al., 2012), it is of great value to use bioinformatics methods to discover prognostic genes between MPM tumor and normal tissues as possible biomarkers and construct a risk-score model to clarify MPM patients into high-and low-risk subgroups, leveraging an objective approach in MPM patients' prognosis evaluation.
In this study, based on the gene expression profiling data from GEO and TCGA dataset, we developed and validated a reliable five-gene signature model independent of clinicopathological factors that improved the risk stratification for MPM patients.

Datasets
The two gene expression arrays of human MPM datasets GSE2549 (Gavin et al., 2005) and GSE51024 (Suraokar et al., 2014a) were derived from the Gene Expression Omnibus (GEO) 1 , which is a database repository of high throughput gene expression data and hybridization arrays, chips, microarrays. GSE2549 and GSE51024 sets were conducted on GPL96 (Affymetrix Human Genome U133A Array) and GPL570 (Affymetrix Human Genome U133 Plus 2.0 Array) platforms respectively. The GSE2549 dataset includes 40 discarded human MPM tumor specimens and 9 normal specimens in which four are normal lung specimens and five are normal pleura. The GSE51024 dataset includes 41 MPM tumor tissues along with 41 paired normal tissues. For the validation of TCGA set, we downloaded the mRNA expression data (RNA-seq in FPKM value) which includes 85 MPM patients together with the clinical information from the Cancer Genome Atlas 2 .

Differential mRNA Analysis
To identify the differentially expressed genes (DEGs), GSE2549 and GSE51024 MPM datasets were employed. After annotating the probes into gene symbols by the platform annotation files, we used R package LIMMA (Ritchie et al., 2015) to get the DEGs between the tumor and normal tissue following the criteria of adjusted P-value < 0.05, |log 2 FoldChange| > 1 in both sets. Finally, by overlapping the upregulated and downregulated DEGs from the two datasets, we got the DEGs in both datasets.

Construction of the Prognostic Model by LASSO Cox Regression
By performing the univariate Cox regression analysis on the candidate DEGs from the two discovery sets, we calculated the correlation between each gene and the overall survival time of each patient and investigated the genes having strongest association with the patients' overall survival time following the criteria of P < 0.05. R package "survival" was used to perform the univariate Cox regression analysis (Gordon et al., 2020;Terry and Therneau, 2020). The least absolute shrinkage and selection operator (LASSO) Cox regression with 10-time cross validation was used to choose the penalty regularization parameter λ (Gui and Li, 2005). The coefficient of each gene was forced to shrink to zero which eliminated the correlation between the selected genes and prevented the model from being overfitting. By applying the minimum deviance, lamda.min, genes were selected. R package "glmnet" was used to perform LASSO Cox regression analysis (Goeman, 2010;Friedman et al., 2020). Together with the coefficient of each gene generated by multivariate Cox regression analysis, the prognostic risk score model was constructed. R package "survminer" was used to perform the multivariate Cox regression analysis (Kassambara et al., 2020). Based on the expression of each gene discovered, each patient's risk score was calculated according to the risk score model. The risk score model was then used to evaluate the prognosis of MPM patients.
Coeffcient i * Expression of Gene i

Construction of the Prognostic Model by Three Other Methods
To validate the prognostic model constructed by LASSO Cox regression method. Forward stepwise regression, bidirectional stepwise regression and relaxed LASSO methods were introduced to rebuild the model. Prognosis related genes detected by univariate Cox regression analysis were used to perform the analyses for the three methods. In forward stepwise regression analysis, genes were ranked by their z-score, which represented their predictive power. The gene with the highest z-score was first enrolled and gene was subsequently added according to the z-score rank list from high to low. The AIC (Akaike information criterion) for each model was calculated new gene was terminated being added to the model if the AIC stopped decreasing which indicated the genes already enrolled giving the best prognostic performance. The bidirectional stepwise regression procedure included the iterations between the "forward" and "backward" steps. R package "My.stepwise.coxph" was used to do the analysis (My.stepwise.coxph, 2017). We found the R package "fastcox" was for Lasso and Elastic-Net penalized Cox's regression in high dimensions models and it could help realize the relaxed lasso analysis (Yang and Zou, 2013a,b). The penalty regularization parameter λ was chosen via 10-time cross-validation. By applying the minimum deviance, lamda.min, genes were selected. The genes selected by each of the three methods were then enrolled in the multivariate Cox regression analysis respectively and the risk score models were then established with the method mentioned above.

Survival Analysis and ROC Analysis
According to the risk score formula, we calculated the risk score for each patient in GSE2549 discovery set and TCGA validation set. Patients from the two datasets were then divided into lowrisk and high-risk groups respectively with the median cutoff of the risk score. Kaplan-Meier survival curves were performed to evaluate whether there was significant difference between the low-risk and high-risk groups by log-rank test with P < 0.05. Univariate and multivariate analyses were conducted to see whether the five-gene signature could be a prognostic factor for MPM independent of other clinicopathological factors. P < 0.05 was considered significant. Hazard ratios and 95% confidence interval were also calculated. Univariate and multivariate analyses were performed using IBM SPSS version 23.0 (IBM Corp., NY, United States). Time-dependent receiver operating characteristic (ROC) curve analyses were conducted to evaluate the prognostic effectiveness of the risk score model compared with other clinicopathological factors and one literature model. We used R package "survivalROC" to perform ROC analysis (Heagerty et al., 2000;Patrick, 2013). C-index was calculated by R package "survival" (Terry and Therneau, 2020). R package "compareC" was used to perform the comparison of C-index value (Le Le . The method was for statistical comparison of two diagnostic or predictive systems, of which they could either be two biomarkers or two fixed algorithms, in terms of their C indices . Z score test was used for hypothesis testing.

Functional Enrichment Analysis
Gene Set Enrichment Analysis was conducted between the low-risk and high-risk groups to predict the possible molecular mechanisms responsible for the poor prognosis of MPM. Molecular Signatures Database (MSigDB) C2 Canonical pathways gene set database was used to screen the significant pathways with the criteria of |NES| (normalized enrichment score) > 1, NOM P < 0.05 and FDR (false discovery rate) q < 0.05 after performing 1000 permutations (Liberzon et al., 2011). The enrichment analysis was performed by GSEA 4.0.3 3 .

Identification of Candidate DEGs
To obtain the differentially expressed genes (DEGs) between human MPM tumor and normal tissues, two expression datasets GSE2549 and GSE51024 were enrolled as discovery datasets (Figure 1). We firstly screened for the DEGs between MPM tumor and normal tissues in these two datasets using LIMMA analysis in R (q < 0.05, |log 2 Fold Change| > 1). A total of 1438 DEGs, including 837 upregulated genes and 601 downregulated genes, were identified in GSE2549 between 40 MPM tumor specimens and 9 normal specimens. A total of 727 DEGs, including 211 upregulated genes and 516 downregulated genes were screened in GSE51024 between 41 MPM tumor tissues and 41 paired normal tissues. The volcano plots were generated for both datasets (Figure 2A). Moreover, 92 upregulated genes and 133 downregulated genes overlapped between GSE2549 and GSE51024 ( Figure 2B). Therefore, a total of 225 candidate DEGs were selected.

Construction and Validation of the Prognostic Risk Score Model for MPM
By performing univariate Cox regression analysis between these 225 candidate DEGs and survival data of discovery set GSE2549, 36 genes were detected (P < 0.05). A LASSO Cox regression analysis together with 10-time cross validation was then conducted to eliminate the number of genes and select those with non-zero coefficient ( Figure 2C). Five genes were identified and using multivariate Cox regression analysis, the coefficient of each gene was calculated. Therefore, a five-gene signature risk score model was developed based on the five genes along with their coefficients and gene expression level. Risk score = 0.1197 × expression of CDH2 + 0.6824 × expression of CKS2 + 0.5594 × expression of KIF11 + 0.7141 × expression of KIF18B + 0.5004 × expression of LOX. To confirm the validity of the signature, we used three other methods, forward stepwise regression, bidirectional stepwise regression and relaxed LASSO, to construct the prognostic models and compare them with the five-gene signature. We found the genes selected by the three methods overlapped with the five-gene signature which confirmed the validity of the genes selected by LASSO Cox regression method (Supplementary Tables 1, 2 and Supplementary Figure 1). We then used AIC, C-index and ROC curves to evaluate and compare between the models and found the five-gene model showed overall better performance (Supplementary Tables 2, 3 and Supplementary Figure 2). We further validated the differential expressions of the five genes between tumor and normal tissues in GSE2549 and GSE51024 sets. All five genes were significantly overexpressed in tumor tissues (Supplementary Figure 3A). At the same time, we did the Kaplan-Meier curve to evaluate the relationship between the gene expression and the overall survival (OS). The five genes were strongly negatively related to the OS in GSE2549 dataset (Supplementary Figure 4).
Each patient's risk score was calculated according to the risk score model in the discovery cohort GSE2549. The patients were divided into low-risk or high-risk group by the median risk score. Patients in high-risk group had much shorter survival time than those in low-risk group (hazard ratio = 3.909, 95% CI = 1.797-8.503, log-rank test P < 0.0001). Chi-square analysis showed the death rate was significantly higher in high-risk group than low-risk group (Figures 3A-C). Simultaneously, we used MPM TCGA dataset (85 cases) as a validation cohort to confirm the reproducibility of the risk score model (Figures 3D-F). The prognostic signature was successfully validated in TCGA dataset showing that patients in high-risk group had significantly shorter OS than low-risk group patients (hazard ratio = 3.929, 95% CI = 2.295-6.726, log-rank test P < 0.0001). In addition, all five genes were validated significantly negatively related to the OS in TCGA dataset (Supplementary Figure 5).
To evaluate whether the five-gene signature could be an independent prognostic factor for MPM, we conducted the univariate and multivariate Cox regression analyses in both GSE2549 and TCGA datasets. From the univariate analysis of GSE2549 dataset in Table 1, we observed that patients with mixed histological subtype, lymph node positive or high risk score were more likely to have shorter survival time and worse prognosis compared to patients with epithelial subtype, lymph node negative or low-risk score. Hence, these results showed that histological subtype, node status and risk score had a relatively significant impact on prognosis of MPM patients. From the univariate analysis of TCGA dataset in Table 2, risk score was the only factor found to influence the prognosis of MPM patients. After controlling for gender, race, histological type, Brigham stage, AJCC stage, tumor size, lymphatic metastasis status and organ metastasis status, the multivariate analysis results showed the five-gene signature remained an independent prognostic factor in both datasets (P < 0.001 in Table 1, P < 0.0001 in Table 2).

Stratified Survival Analysis of the Five-Gene Risk Score Model
Stratified survival analysis was further conducted in subgroups of patients with different clinical variables (gender, histological subtype, tumor size, lymph node metastasis status, AJCC stage,  and Brigham stage). According to Tables 3, 4, the five-gene risk score model was generally of statistically significant prognostic value. The MPM patients stratified by Brigham staging system, in either early stage or advanced stage, could be separated into the subgroups of better prognosis and poorer prognosis by the fivegene signature (Figures 4A,B). The model retained its prognostic value when stratified by AJCC staging system in TCGA dataset (Figures 4C,D). At the same time, we also found significant prognostic value of our model in lymph node negative or positive patients in both datasets (Figures 4E,F).
To further evaluate the prognostic value of the five-gene signature model, we compared our model with a three-gene prognostic model established by Zhou et al. (2019). The AUCs for 1-and 2years OS in either GSE2549 or TCGA dataset by our model were comparable with Zhou's model. We also used concordance index (C-index) to evaluate our model. And R package "compareC" was used to do the statistical comparison of the C-indices between our model and Zhou's model. The C-indices for GSE2549 dataset were comparable between our model and Zhou's model and no statistical difference was detected (Table 5). However, our model showed significant higher C-index value than Zhou's model (P < 0.05) in TCGA validation dataset ( Table 6).

Functional Analysis
To obtain the potential biological function of the five-gene signature in MPM tumorigenesis, the Gene Set Enrichment Analysis (GSEA) was conducted to identify the associated pathways between the high-risk and low-risk subgroups in GSE2549 discovery dataset and TCGA validation dataset.
Here we used the canonical pathway gene set from the Molecular Signatures database as our gene set database. The gene sets were considered significantly enriched when the definite value of normalized enrichment score was more than 1, Nominal P-value was less than 0.005 and FDR q-value was less than 0.05. From the GSEA report, we discovered that "RHO GTPASES activate formins, " "Mitotic spindle checkpoint, " "PLK1 pathway, " and "resolution of sister chromatid cohesion" pathways were significantly enriched in high-risk group patients from both discovery and validation sets. Simultaneously, several cancer related pathways such as "TP53 regulates transcription of cell cycle genes, " "regulation of TP53 activity through phosphorylation, " "cell cycle checkpoints, " "kinesins, " and "DNA double strand break repair" were also enriched in the high-risk group of TCGA. They were all reported to be involved in tumorigenesis and tumor progression (Supplementary Figure 8).

DISCUSSION
Over the years, several staging systems have been proposed for MPM (Rusch and Venkatraman, 1996). The TNM staging system proposed by the International Mesothelioma Interest Group (IMIG) subsequently accepted by the American Joint  HR, hazard ratio; CI, confidence interval. *One patient was not included in the stratified analysis of GSE2549 dataset, because the clinical information was not available.
All P values in bold were less than 0.05 which were considered statistically significant.

Commission on Cancer (AJCC) and the Union for International
Cancer Control (UICC) is the one that has been generally accepted. Nevertheless, the current AJCC/UICC staging classification for MPM is still difficult to apply to clinical staging and thus may be imprecise in predicting prognosis and providing appropriate treatment for MPM patients (Rusch et al., 2012). Therefore, it is of great significance to explore biomarkers with optimal prognostic value for MPM patients. Research shows that combination of multiple biomarkers will improve the prognostic value instead of a single biomarker (Nalejska et al., 2014). In this study, we developed and validated a five-gene signature to evaluate the prognosis of MPM.
First, we identified 225 candidate DEGs based on two expression microarrays from GEO database. Aiming to eliminate the correlation between the genes selected by univariate analysis and improve the applicability in clinical practice, a five-gene signature risk score model was constructed by LASSO Cox regression. The LASSO Cox regression model has been widely applied to the Cox proportional hazard regression model for survival analysis with high dimensional data (Tibshirani, 1997;Zhang and Li, 2007;Wei et al., 2015;Lin et al., 2020). By studying Hastie et al's research work on comparing the LASSO with other model selection methods (Hastie et al., 2017), we introduced three more methods to confirm the validity of the genes selected by LASSO Cox regression. The performances of the models were evaluated by AIC, C-index and AUC with respect to the degree of goodness-to-fit, the prediction accuracy and the predictive capacity. From the results, we could see the bidirectional stepwise regression showed the worst performance among these four methods (Supplementary Tables 2, 3 and Supplementary Figure 2). Our five-gene model by LASSO showed better or equivalent performance when compared with the models constructed by forward stepwise regression or relaxed LASSO. At the same time, we noticed four of the five genes (KIF18B, CKS2, LOX, and CDH2) in forward stepwise regression model and four of seven genes (CKS2, KIF11, KIF18B, and LOX) in relaxed LASSO model were overlapped with the five genes in LASSO model which further confirmed the validity of the genes selected by LASSO Cox regression method. Compared with the model constructed by relaxed LASSO method which included seven genes, our signature only had five genes and it showed better or equivalent prediction performance. The general principle for model selection was that for a given level of accuracy, a simpler or a more parsimonious model is preferable to a more complex one (Bozdogan, 1961;Stone, 1981). Taking into account the measurement cost to implement the model and the complexity of the model, the five-gene signature not only had better or equivalent prediction performance but also had higher practicability value. The model was validated in GSE2549 and TCGA datasets showing that high-risk group patients always had worse clinical outcome and shorter survival time than low-risk group patients. In stratified analysis, our model had good performance in predicting OS. The univariate and multivariate analyses indicate our five-gene signature to be an independent prognostic factor for MPM patients. Simultaneously, our results showed that histological subtype, gender, and lymph node had relatively significant impact on prognosis, and these results were in agreement with many other researches for prognostic factors (Spirtas et al., 1988;Anand, 1994;Curran et al., 1998;Vigneswaran et al., 2017). To evaluate the accuracy and discrimination power of our risk score model, we did the ROC and C-index analyses. Simultaneously, we compared our model with available clinicopathological factors. Our risk score model exhibited stable predictive performance compared with clinicopathological factors (Supplementary Figures 6, 7). Considering our risk score model was using relatively objective gene expression levels tested by microarray or RNAseq while evaluating the prognosis of MPM patients, some of the clinicopathological factors might be variable in evaluation of prognosis of MPM patients. And this might also be due to lack of the data. We also compared our model with the literature model by Zhou et al. (2019), the AUCs for 1-and 2-years OS were comparable in discovery GSE2549 dataset. While putting the two models in the larger validation TCGA dataset, although our model showed higher AUCs, but the confidence of intervals overlapped, indicating no statistical difference. Therefore, we could not determine yet which model was better at evaluating the prognosis of MPM patients from the statistical perspective. Maybe due to the lack of data, we could not detect the statistical difference of AUCs between our model and Zhou's model. When we used C-index to compare our model and Zhou's model in the validation TCGA dataset, our model showed higher C-index value with a P-value of 0.000649.
To gain more insights into the modulatory roles of the five genes in the signature, GSEA analysis was performed and showed that the "RHO GTPASES activate formins, " "mitotic spindle checkpoint, " "PLK1 pathway, " and "resolution of sister chromatid cohesion" pathways were significantly associated with poor prognosis in high-risk subgroup MPM patients. The Rho family of GTPases is a family of small signaling G protein that regulate actin cytoskeleton organization and dynamics (Vega and Ridley, 2008). Nearly 75% of MPM cases harbor loss of function of core components of the Hippo pathway, which negatively regulates YAP activity (Zhang et al., 2010;Murakami et al., 2011). RhoA may strongly enhance YAP/TAZ activity, thereby promoting the proliferation of MPM in the sense that high YAP/TAZ activity is positively related to high proliferation capacity for MPM (Dupont et al., 2011;Mizuno et al., 2012). Zhang et al. reported a proliferation inhibitory effect in MPM cell lines with GSK269962A, a selective inhibitor of Rho−Kinase (Zhang et al., 2017). These studies, together with our result, indicated that the "Rho GTPASES activate formins" pathway is likely a significant mechanism in MPM tumorigenesis, and might serve as a target in the treatment of MPM patients. Additionally, the "mitotic spindle checkpoint" and "resolution of sister chromatid cohesion" pathways are closely related. The spindle checkpoint delays sister chromatid separation until all chromosomes have undergone bipolar spindle attachment. Dysfunction of this checkpoint contributes to tumorigenesis. Moreover, certain components of the mitotic spindle checkpoint pathway, including CHEK1, BUB1, and MAD2L1 were found to be upregulated in MPM tumors via microarray technology (Crispi et al., 2009). Similarly, Suraokar et al. carried out a gene expression microarray experiment on 53 surgically resected MPMs tumors along with paired normal tissues and found that the mitotic spindle checkpoint pathway was the most significantly altered pathway in MPM patients  (Suraokar et al., 2014b). They also evaluated the indicator for the deregulated expression of the mitotic spindle checkpoint pathway in an independent cohort of 80 MPM tumors and found higher nuclear MAD2L1 expression associated significantly (P = 0.043) with lower rates of OS. These findings, along with our result, clearly demonstrate the important role of the "mitotic spindle checkpoint" pathway in MPM that the pathway, suggesting that it might be a possible target for MPM therapy. Lastly, PLK1 (polo-like kinase 1), has been identified as a candidate therapeutic target and independent prognostic marker of MPM by an RNAi-based screening study (Linton et al., 2014). The inhibitory effect on cell proliferation following treatment with the PLK1 small inhibitors BI6727, and BI2539 or the PLK1specific siRNA and artificial microRNA, have been validated in multiple MPM cell lines (Linton et al., 2014;Kato et al., 2016b). Hence, under our five-gene risk score classification, the high-risk MPM patients may benefit from PLK1 specific small molecule inhibitors, which are currently considered to be attractive therapeutic strategies against specific tumor types such as leukemia and non-small cell lung cancer (Medema et al., 2011;Lee et al., 2015).
All five genes (CDH2, CKS2, KIF11, KIF18B, and SEMA3G) in our model were confirmed to be significantly associated with the OS of MPM patients. Cadherin2 (CDH2), encoding N-cadherin protein, is closely related to the epithelial-mesenchymal transition (EMT) process and demonstrated prognostic significance in MPM (Schramm et al., 2010). It was reported that there was a substantial switch from epithelial markers such as E-cadherin and β-catenin to mesenchymal markers such as N-cadherin through epithelium to biphasic and sarcomatoid  C-index, concordance index; CI, confidence interval. All P values in bold were less than 0.05 which were considered statistically significant.
subtypes, indicating the three histological subtypes of MPM are the consequences of different steps in an EMT process, of which the sarcomatoid subtype has the worst OS (Fassina et al., 2012). Consistent with these previous studies, we found the expression of CDH2 was higher in MPM tumor tissues than normal tissues and the high expression level of CDH2 was associated with poor prognosis (Supplementary Figures 3, 4A, 5A). Cyclin-dependent kinases regulatory subunit 2 (CKS2) is the member of cell cycle dependent protein kinase subunits family. Although rarely reported with MPM, there is accumulating evidence showing CKS2 is upregulated in many types of tumor as a prognostic factor, including hepatocellular carcinoma, colorectal cancer, bladder cancer, breast cancer, gastric cancer and epithelium ovarian cancer (Kawakami et al., 2006;Shen et al., 2010;Tanaka et al., 2011;Yu et al., 2015;Huang et al., 2019;Xu et al., 2019). Our results revealed that the expression of CKS2 was elevated in MPM tumor tissues compared to normal tissues and it was harmful for the prognosis of MPM patients, which is in line with these previous findings of other types of tumor (Supplementary Figures 1, 4B, 5B). Investigators also found that CKS2 may advance tumor progression by promoting tumor cell proliferation and regulating apoptosis (Shen et al., 2013).
Kinesin family member 11 (KIF11, also known as EG5) and Kinesin family member 18B (KIF18B) are two kinesin superfamily members, yet they are from two different kinesin subfamilies and function differently (Lawrence et al., 2004). KIF11 is essential for cell growth and proliferation, involved in the formation of the bipolar spindle in cell mitosis. KIF11 was found overexpressed not only in MPM human tumor samples and MPM human cell lines, but also in blast crisis chronic myelogenous leukemia and pancreatic cancer (Nowicki et al., 2003;Liu et al., 2010;Kato et al., 2016a). This is consistent with our analysis that KIF11was significantly overexpressed in MPM tumor tissues and was confirmed as a poor prognostic factor for MPM patients (Supplementary Figures 3, 4C, 5C). Mitotic arrest as a result of KIF11 inhibition has been observed in a variety of tumors (Infante et al., 2012). Several compounds inhibiting KIF11 entered Phase I and II clinical trials (El-Nassan, 2013). Different from KIF11, KIF18B is involved in the regulation of microtubule dynamics (Lee et al., 2010). KIF18B was rarely reported with MPM, but it was reported promoting tumor progression in cervical cancer, hepatocellular carcinoma, and pancreatic cancer (Wu et al., 2018;Li et al., 2020). High levels of KIF18B were associated with poor prognosis in lung adenocarcinoma patients (Ji et al., 2019). Our analysis showed the high expression of KIF18B was significantly associated with poor prognosis of MPM patients (Supplementary Figures 4D, 5D).
Lysyl oxidase (LOX) was one of the five paralogs functioning primarily as the crosslink of collagens or elastin in extracellular matrix (Kim et al., 2011). Studies have shown that LOX mRNA level was increased in various cancer types, including head and neck squamous cell carcinoma, and breast and prostate cancers (Kirschmann et al., 2002;Lapointe et al., 2004;Erler et al., 2006). Recently, LOX was identified as a potential diagnostic biomarker in MPM (Kim et al., 2020). We also note that patients with high expression of LOX had shorter overall survival time and worse prognosis compared with patients with low expression of LOX (Supplementary Figures 4E, 5E).
Several limitations of our study should be pointed out. First, the clinical information of the patients was not comprehensive. Specifically, the clinical information of the discovery set GSE51024 was not accessible. In addition, there was no age, asbestos exposure history or any treatment information for the patients in either GSE2549 or TCGA dataset. Therefore, we could not evaluate additional possible prognostic factors. Second, given the low incidence of MPM, and the scarcity of available public databases, we could only adopt two datasets, GSE2549 and GSE51024, as our discovery datasets. GSE2549 dataset had a split of 40 tumor samples and 9 normal samples, while GSE51024 had 41 pairs of tumor and normal samples, this may lead to result bias and potentially cause the loss of viable DEGs. Third, only one MPM TCGA dataset was used as the validation dataset. The model was constructed based on the DEGs overlapping between the two discovery datasets using microarray technology, yet it was validated in the TCGA dataset using RNAseq technology which might affect the performance of the model. Compared with RNA-seq, microarray is unable to detect novel transcripts and might cause loss of possible DEGs, particularly genes with low expression. To test the practicality and accuracy of our model, rigorous validation in large prospective studies are needed. Finally, the molecular and biological mechanisms of these five genes in MPM need to be further investigated by additional research.

CONCLUSION
In conclusion, we have identified a five-gene signature risk score model as an objective and practical prognostic tool independent of clinicopathological factors for MPM patients, which complements the current MPM staging system. The accuracy and stability of our model provides opportunity for future clinical application.

AUTHOR CONTRIBUTIONS
YB, XW, JH, LG, KN, and LJ contributed to the study design. YB, XW, XL, ZR, LJ, and KN contributed to the statistical analysis. YB, ZR, and HG contributed to the manuscript draft. YB, XW, LJ, and KN prepared the figures and tables. All authors reviewed and approved the final manuscript and carried out in collaboration work.