Identification of miR-200c and miR141-Mediated lncRNA-mRNA Crosstalks in Muscle-Invasive Bladder Cancer Subtypes

Basal and luminal subtypes of muscle-invasive bladder cancer (MIBC) have distinct molecular profiles and heterogeneous clinical behaviors. The interactions between mRNAs and lncRNAs, which might be regulated by miRNAs, have crucial roles in many cancers. However, the miRNA-dependent crosstalk between lncRNA and mRNA in specific MIBC subtypes still remains unclear. In this study, we first classified MIBC into two conservative subtypes using miRNA, mRNA and lncRNA expression data derived from The Cancer Genome Atlas. Then we investigated subtype-related biological pathways and evaluated the subtype classification performance using Decision Trees, Random Forest and eXtreme Gradient Boosting (XGBoost). At last, we explored potential miRNA-mediated lncRNA-mRNA crosstalks based on co-expression analysis. Our results show that: (1) the luminal subtype is primarily characterized by upregulation of metabolism-related pathways while the basal subtype is predominantly characterized by upregulation of epithelial-mesenchymal transition, metastasis, and immune system process-related pathways; (2) the XGBoost prediction model is consistently robust for classification of the molecular subtypes of MIBC across four datasets (The area under the ROC curve > 0.9); (3) the expression levels of the molecules in the miR-200c and miR141-mediated lncRNA-mRNA crosstalks differ considerably between the two subtypes and have close relationships with the prognosis of MIBC. The miR-200c and miR-141-dependent mRNA-lncRNA crosstalks might be of great significance in tumorigenesis and tumor progression and may serve as the novel prognostic predictors and classification markers of MIBC subtypes.


INTRODUCTION
Urothelial bladder cancer (UBC) is one of the most common malignant tumors of urinary system. UBC can generally be classified into non-muscle-invasive bladder cancer (NMIBC) and muscle-invasive bladder cancer (MIBC), according to whether the cancer cells are restricted locally in the lamina propria or invade the muscularis propria (Kamat et al., 2016). A great number of studies have reported that according to shared RNA expression patterns or specific genomic alterations MIBC can be further classified into two major subtypes, namely basal and luminal (Sjodahl et al., 2012;Iyer et al., 2013;Choi et al., 2014a,b;Damrauer et al., 2014;Network, 2014;Robertson et al., 2017), which are strikingly similar to the molecular subtypes first described in breast cancer (Perou et al., 2000;Prat et al., 2010). The basal subtype has drawn much attention because it is associated with a more aggressive phenotype and has a higher risk of distant metastasis than luminal subtype (Choi et al., 2014a;Robertson et al., 2017). One reason for the difference is that the two subtypes develop from etiologically different pathways. Pathways that are involved in EMT and immune-associated pathways are upregulated in the basal subtype (Choi et al., 2014a). The molecular biomarkers and pathways involved in MIBC subtypes are the key to understanding its subtype heterogeneity and identifying subtype-specific biomarkers that can be used to better manage MIBC patients.
MicroRNAs (miRNAs) represent one of the most exciting areas of modern medical and biological sciences as they can modulate an immense and complex regulatory network of gene expression in a broad spectrum of developmental and cellular processes, such as cell proliferation, metabolism, apoptosis, and viral infection (Johnston and Hobert, 2003;Hatfield et al., 2005;Zhao et al., 2005;Chen et al., 2006;Oliveira-Carvalho et al., 2012;Huang M. et al., 2016). miRNAs not only have a well-established inhibitory effect on gene expression but also promote gene expression in some cases (Sayed and Abdellatif, 2011;Song et al., 2014), and long non-coding RNAs (lncRNAs) exhibit facilitative or suppressive effects on the gene regulatory network during tumor development (Gontan et al., 2012;Sun et al., 2013). Furthermore, aberration or perturbation in miRNA-mediated mRNA and lncRNA expression levels has a significant correlation with serious clinical consequences, including diseases of diverse origins and malignancy (Salmena et al., 2011;Valinezhad Orang et al., 2014;Tay et al., 2014;Yuan et al., 2014;Zeng et al., 2016;Hu et al., 2017).
Regarding molecular drivers of cancer development, oncogenic mutations and downstream signaling pathways in the pre-cancerous or cancerous cell have been thought to play a crucial role in the cancer formation and progression. In addition, recent studies have shown that metabolic reprogramming plays much more important roles than previously thought in cancer development (Cairns et al., 2011). It is possible that a great number of genomic mutations detected in cancer provide a selective advantage for the cancer cell in the stressful tumor microenvironment by reprogramming cell metabolic processes (Zhang et al., 2015). No matter what is the primary cause of cancer development, it is clear that both the oncogenic signaling and reprogrammed metabolisms involve numerous genes, working in a concerted manner in a complex network. Gene regulatory network-based view can, therefore, provide a deeper insight into the cancer development.
The aim of this study is to identify subtype-specific dysregulated miRNA-mediated mRNA-lncRNA interactions and discover new critical subtype-related genes in MIBC.

Data Acquisition and Pre-processing
The MIBC RNA-Seq (FPKM) and clinical data were obtained from The Cancer Genome Atlas (TCGA) public data portal 1 , and miRNA-Seq (RPM) data was downloaded from the Broad GDAC Firehose 2 . The gene expression datasets of 403 tumor samples and 19 adjacent normal tissue samples contain 19181 mRNAs, 14376 lncRNAs, and 2588 mature miRNAs. The microarray datasets (GSE32894, GSE13507, and GSE31684) derived from Gene Expression Omnibus (GEO) were used to evaluate the performance of classifiers and verify the prognostic use of marker genes 3 .

Clustering Analysis and Gene Set Enrichment Analysis
Consensus clustering (Monti et al., 2003) is a method that provides quantitative evidence for determining the number and membership of possible clusters within a dataset, such as RNA-Seq and microarray. For CC analysis, the RPKM gene expression data was pre-processed to detect the most highly expressed and variable genes across samples. We removed 25% genes that have the low arithmetic mean of the given gene across samples. Then the MAD was used to select the most highly expressed and variable 3,000 mRNAs, 300 miRNAs, and 3,000 lncRNAs. CC available in the R package "ConsensusClusterPlus" was performed on 3,000 mRNAs, 300 miRNAs, and 3,000 lncRNAs with 403 tumor samples, using the following key parameters: reps = 50, innerLinkage = complete, clusterAlg = hc, k = 6, and distance = pearson (Wilkerson and Hayes, 2010).
Cluster of cluster analysis is a method of integrating the primary clustering results into final cluster assignments. Each sample is represented as a binary vector, whose length is t i=1 K i (where t is the number of datasets and K i is the number of clusters for dataset (i), to implement subsequent clustering analysis. We first conducted the CoC analysis on the clustering results of mRNA, miRNA, and lncRNA dataset to obtain a binary dataset. The CC was once more performed on the binary dataset for generating final clusters. Number of final clusters (K) was estimated by commonly used methods including ASW, CPCC, Relative Change in Area under Cumulative density function [ (K)], and PAC (Şenbabaoglu et al., 2014).
In order to explore subtype-associated biological processes, GSEA (Subramanian et al., 2005) was conducted using three gene set datasets (GO-BP, KEGG, and Hallmark gene sets]. The following parameters were taken for GSEA: Number of permutations = 1000, Permutation type = gene_set, Enrichment statistic = weighted, Metric for ranking genes = Signal2Noise.

Differentially Expressed Genes and Machine Learning
"Ballgown" (R package) was used to identify DEGs between tumor and normal samples (Frazee et al., 2015). F-test was used in "Ballgown", and DEGs here were defined as those with FDR adjusted p-value < 0.05 (Benjamini-Hochberg method) and |log 2 fold change| > 0.57).
Three tree-based machine learning methods, namely DTs, RF, and eXtreme Gradient Boosting (XGBoost or XG), were performed on 3000 mRNAs, 300 miRNAs, and 3000 lncRNAs for MIBC subtype classification. The area under the ROC curve (AUC) was used to estimate the performance of the classification methods. For each classification method, MIBC samples were randomly divided into training (60%) and testing (40%). We performed RF with different parameter values of ntree and mtry, and used 10-fold cross-validation to acquire the mean accuracy. XGBoost was implemented with the following parameters: gamma = 1, min_child_weight = 1, max_depth = 14, nrounds = 2000. In order to optimize the parameter iter (number of iterations) of XGBoost, we obtained 10-fold crossvalidation performance for each iter and selected the iter value that generated the best performance. For DTs, the following parameters were taken: minCases = 20 and CF = 0.25. Moreover, the well-performed classifiers in this study were trained on the TCGA-derived RNA expression data and were tested on the GSE32894 to further evaluate their performance. All machine learning methods were implemented using R packages including "C5.0", "randomForest", and "XGBoost" packages (Liaw and Wiener, 2002;Chen and Guestrin, 2016;Kuhn et al., 2018).
The overlap between the feature genes obtained by the wellperformed classifiers and DEGs was referred to as DEFGs. GO enrichment analysis available in the R package "clusterProfiler" was performed on DEFGs to identify their enriched GO terms (Yu et al., 2012). A multiple-test correction was done using the method proposed by Benjamini and Hochberg, in which an adjusted p-value < 0.05 was considered to represent statistical significance.

Construction of a Subtype-Related mRNA-miRNA-lncRNA Network
Pairwise Pearson's correlation analysis was carried out on the DEFGs. The lncRNA-miRNA pairs, miRNA-mRNA pairs, and lncRNA-mRNA pairs with |r| > = 0.4 and p-value < 0.05 were considered to be co-expressed gene pairs. If both elements in a co-expressed lncRNA-mRNA pair are simultaneously coexpressed with a miRNA, it is defined as a miRNA-dependent lncRNA-mRNA co-expressed interaction. A miRNA-dependent lncRNA-mRNA network was established using Cytoscape software (version 3.5.1). miRWalk2.0 (Dweep et al., 2011) is an integration of six widely used databases (miRWalk, miRanda, miRDB, miRNAMap, RNA22, and Targetscan) and supplies the biggest available collection of predicted and experimentally verified miRNA-target interactions. Our inferred co-expressed interactions including mRNA-miRNA and lncRNA-miRNA interactions were compared to those derived from miRWalk2.0. An mRNA is considered to be a true target of miRNA if their interaction occurs in at least four databases, and an lncRNA is considered to be a true target of miRNA if their interaction is supported in at least one database among miRWalk, miRanda, and Targetscan.

Survival Analysis
We further assessed whether the genes in the inferred interactions are correlated with the overall survival of MIBC patients. Based on the mean expression level of the genes, patient samples were divided into high and low expression groups. We performed survival analysis available in R package "survival" (Therneau, 2015) using the Kaplan-Meier curve (K-M curve) method. A logrank test was used to compare survival times between two groups, and p < 0.05 was considered to represent the statistical significance.

Clustering Analysis and GSEA
We first performed the CC on mRNA, miRNA, and lncRNA expression datasets to obtain the clustering results. By applying the CoC analysis to the clustering outcomes of CC, a binary dataset was obtained, which was referred to as CoC dataset. The CC was once again performed on the CoC dataset to generate the different Ks, and the ASW, CPCC, K, and PAC were used to evaluate the optimal K (Supplementary Figure S1). As a result, for the CoC dataset, ASW evaluation suggests the optimal K of 6 and CPCC, K, and PAC indicate the optimal K of 2. Given that K = 2 is the consistent optimal value, we chose K = 2 as a solution, dividing MIBC samples into two subtypes, namely subtype-1 and subtype-2. The hierarchically clustered heatmap of K = 2 for CoC dataset was shown in Figure 1A. Survival curves regarding two subtypes were plotted using the K-M method. Our results have shown that 5-year overall survival rate with regard to subtype-1 is 55% and 30% for subtype-2, indicating that they differ considerably in clinical prognosis ( Figure 1B, p < 0.01). The heatmap depicting basal biomarkers, luminal biomarkers, and clinical indicators for the two subtypes was shown in the Figure 1C. The subtype-1 is characterized by the high expression of luminal markers such as CYP2J2, ERBB2, and KRT18, while the subtype 2 is characterized by high expression of basal markers such as CD44, CDH3, and KRT1. The Pearson's chi-squared test is utilized to compare clinical indicators between the two subtypes. The histology, stage, grade, and status are significantly different between the two subtypes, and gender almost differs between the two subtypes (Supplementary Table S1). The subtype-1 and subtype-2 resemble the luminal and basal subtype, respectively, in terms of K-M curves, biomarkers, and clinical indicators, therefore, which were redefined as luminal and basal subtypes (Choi et al., 2014a).
Gene set enrichment analysis was done for the basal and luminal subtypes, and the results were shown in Tables 1, 2. Upregulated pathways in luminal subtype are mainly involved in metabolism (e.g., oxidative phosphorylation, cytochrome P450, and fatty acid metabolism) ( Table 1). Whereas, upregulated pathways in the basal subtype are principally related to immune system process (e.g., extracellular structure organization, allograft rejection, mTORC1 signaling, and TNF-a signaling via NF-kB), metastasis, and EMT ( Table 2).

Differentially Expressed Genes and Machine Learning
The DEGs that could distinguish tumor from normal samples were analyzed and visualized as volcano plots ( Supplementary  Figures S2A-C). In total, 208 miRNAs (148 upregulated and 60 downregulated), 2488 lncRNAs (1402 upregulated and 1086 down-regulated), and 4167 mRNAs (2314 upregulated and 1853 downregulated) are differentially expressed.
We applied DTs, RF, and XGBoost for the basal and luminal subtype classification based on mRNA, miRNA, and lncRNA expression dataset, and AUC was used to evaluate their performance. As shown in Figure 2A, XGBoost outperforms RF and DTs, having AUC values of 98.6, 94.5, and 98.7%, respectively, in mRNA, miRNA and lncRNA-based classification. Details regarding 10-fold cross-validation procedure can be found in Supplementary Figure S3. DTs was excluded in the following comparison, as it is significantly inferior to RF and XG on average. By using the CC method, the GSE32894 dataset containing 28 biomarkers and 190 samples was grouped into two subtypes prepared for the classification task. The heatmap plots and the K-M curves for the two subtypes were shown in Supplementary Figure S4. We trained the well-performed classifiers (RF and XG) on mRNA dataset that was derived from TCGA and tested them on GSE32894 dataset. The results demonstrated that XGBoost has a better performance than RF ( Figure 2A4). The intersection between DEGs and feature genes obtained by RF and XG was defined as DEFGs, which includes 57 lncRNAs, 120 miRNAs, and 278 mRNAs. The Upset plot and heatmap  plots for DEFGs were shown in Figure 2B and Supplementary  Figures S2D-F. The genetic and clinical information of DEFGs was visualized in Figure 2C. GO enrichment analysis indicated that differentially expressed feature mRNAs are enriched with adherens junction, cell-substrate junction, cell-cell junction, cellsubstrate adherens junction, and focal adhesion ( Figure 2D). These GO terms have been found to play roles in tumorigenesis and tumor progression by regulating T-cell signaling, innate immunity, TGF-β signaling, and Wnt signaling through posttranslational modification (Kikuchi et al., 2006;Lönn, 2010;Liu et al., 2016;Cho et al., 2018;Kuwabara et al., 2018).

Survival Analysis of Crosstalk-Involved Genes
The association between expression levels of crosstalk-involved genes and MIBC prognosis was analyzed by K-M method. Strikingly, the results revealed that all of them are closely related to prognosis of MIBC. Specifically, the higher expression level of miR-141-5p, miR-141-3p, AC010326.3, AC073335.2, miR-200c-3p, and GATA3 predicts better prognosis, indicating that they may function as tumor suppressors (Figures 4B-F,H); In contrast, the higher expression level of MIR100HG, PALLD, and CLIC4 is associated with worse prognosis, suggesting that they may play an oncogenic role (Figures 4A,G,I). In addition, the association between MIBC prognosis and expression levels of crosstalk-related mRNAs (CLIC4, PALLD, and GATA3) was validated in two independent microarray datasets (GSE13507 and GSE31684), suggesting again their prognosis value in MIBC (Supplementary Figure S5).

DISCUSSION
In this study, we have investigated miRNA-dependent mRNA-lncRNA interactions in MIBC basal and luminal subtypes using bioinformatics approaches. On the basis of MIBC mRNA, miRNA, and lncRNA expression datasets obtained from TCGA, 403 MIBC samples were reliably classified into two intrinsic molecular types, which resemble basal and luminal subtypes identified previously (Choi et al., 2014a). A number of subtype-related pathways were identified through GSEA. Moreover, we conducted and compared subtype classification performance among tree-based machine learning algorithms, and found XGBoost outperforms other classifiers. Additionally, FIGURE 4 | Kaplan-Meier plots for crosstalk-involved genes. (A-I) The red lines represent the high expression of crosstalk-involved genes while blue lines represent the corresponding low expression. The p-value was calculated using a log-rank test, where p < 0.05 represents statistical significance.
we implemented a gene co-expression analysis on DEFGs and successfully identified subtype-specific mRNA-lncRNA crosstalks, which differ considerably between basal and luminal subtypes and have close relationships with the prognosis of MIBC. Subtype-related pathways presented in this study (Tables 1, 2) are largely consistent with the previously identified (Choi et al., 2014a;Hurst and Knowles, 2014;McConkey et al., 2016;Ochoa et al., 2016;Hau et al., 2017;Seiler et al., 2017;Baker et al., 2018). In general, pathways that are involved in the EMT, metastasis, and immune system process, are upregulated in the basal subtype, whereas, metabolic-related pathways are upregulated in the luminal subtype. Th pathways enriched in basal and luminal subtypes provide a biological explanation for their distinctively different clinical and pathological behaviors. However, the mechanisms by which some other pathways shown in our results, like valine leucine, isoleucine degradation, autoimmune thyroid disease, hematopoietic cell lineage and viral myocarditis, play a role in MIBC subtypes deserve further investigation.
Many machine learning methods have been broadly applied in many areas of biology such as gene family classification, hepatotoxicity prediction, RNA methylation prediction, cancer prediction and classification (Zou et al., 2014;Kourou et al., 2015;Liao et al., 2017Liao et al., , 2018Su et al., 2018;Wei et al., 2018a,b). As suggested in previous studies, RF is a powerful classifier for classifying gene expression data (Wu et al., 2003;Lee et al., 2005;Ishwaran et al., 2010). And XGBoost keeps winning in "every" Kaggle competition and has become a really popular tool among data scientists (Ren et al., 2017;Torlay et al., 2017;Zhang and Zhan, 2017). Recently, XGBoost has been successfully applied to many classification problems, such as pan-cancer classification  and prediction of RNA-protein interactions (Jain et al., 2018). However, no comparison between RF and XGBoost in terms of cancer classification has been made in the past. In this study, we compared the performance of DTs, RF, XGBoost in classifying basal and luminal subtypes. Our results clearly demonstrated the advantage of XGBoost in gene expression databased cancer classification (Figure 2A).
Previous studies investigated MIBC-associated miRNAs and their target genes without considering the genetic heterogeneity of MIBC subtypes (Martens-Uzunova et al., 2014;Huang T. et al., 2016;Xue et al., 2016;Zhong et al., 2016). It is therefore important to elucidate the subtype-related molecular pathways and identify novel biomarkers for MIBC subtypes. In this study, we systematically explored MIBC subtype-related gene coexpression networks. A total of three mRNAs (GATA3, CLIC4, and PALLD), three miRNAs (miR-200c-3p, miR-141-3p, and miR-141-5p), and three lncRNAs (AC010326.3, AC073335.2, and MIR100HG) were found in miRNA-mediated mRNA-lncRNA crosstalks, which differ considerably in their expression between basal and luminal subtypes (Figure 3), and their expression level is significantly associated with the prognosis of MIBC (Figure 4). It was previously observed that miR-141-5p, miR-141-3p, miR-200c-3p, and GATA3 are the most important markers of luminal subtype, which is consistent with our results (Robertson et al., 2017). Besides, previous studies found that the down-regulation of miR-200c and miR-141 is associated with elevated ZEB1 (Wiklund et al., 2011;Shan et al., 2013;Mahdavinezhad et al., 2015), and the down-regulation of miR-200c is also coupled with the down-regulation of BMI-1 and E2F3 , which play an important role in the invasion, migration, and EMT of bladder cancer.
It has been shown that some other genes in the crosstalk are also closely related to cancer. For example, AC073335.2, a highly expressed lncRNA in human glioblastoma, is involved in tumorigenesis via acting as a competing endogenous RNA of miR-940 (Shi et al., 2017). MIR100HG was previously reported to act as a regulator of hematopoiesis and oncogenes in many cancers (Emmrich et al., 2014;Nair, 2016;Shang et al., 2016;Wieczorek and Reszka, 2018;Zhang et al., 2018). In agreement with our findings, MIR100HG was reported to be down-regulated in MIBC and may serve as a significant biomarker for MIBC . As reported previously, GATA3 is a prognostic marker and inhibits cell migration and invasion in MIBC (Miyamoto et al., 2012;Choi et al., 2014a,b). And, GATA3 is differentially expressed between basal and luminal subtypes and can be used as a luminal-infiltrated marker (Robertson et al., 2017). CLIC4 has a complicated role in cancer. For instance, it functions as a tumor suppressor in lung adenocarcinomas (Okudela et al., 2014). And it promotes the metastasis and development of colorectal cancer (Deng et al., 2014;Peretti et al., 2015). Previous studies have established that the expression of CLIC4 in MIBC has a subtype-dependent pattern (Robertson et al., 2017). And the overexpression of CLIC4 in stroma increases cell migration and invasion and promotes epithelial to mesenchymal transition in multiple human cancers (Shukla et al., 2014). PALLD SNPs were reported to be a significant predictor of prostate cancer-specific mortality (Bao et al., 2011). Our findings are largely consistent with previously reported results, suggesting crosstalk-implicated genes might be of great significance in MIBC pathogenesis and post-transcriptional gene regulation.
The combination of bioinformatics and several machine learning approaches in this study have achieved reliable results regarding the MIBC subtype classification, subtype-associated pathways, and the network-associated markers for MIBC subtypes. The subtype-related genes can not only be used for subtype classification but also serve as a good predictor of cancer prognosis. It is worth noting that we can enhance our study in the following aspects in the future: (1) the crosstalks discovered through computational analyses need to be verified by biological experiments.
(2) DEFGs were defined as the overlap between DEGs and feature genes that were determined by XGBoost based on the ranking approximates of Information Gain. This procedure may result in the missing of some highly correlated genes that are also biologically important.

CONCLUSION
By conducting bioinformatics analyses, we identified two subtypes of MIBC and lncRNA-mRNA crosstalks mediated by miR-200c and miR-141, which are found to be significantly associated with prognosis, formation, and metastasis of bladder cancer. Our results should be informative for molecular subtype classification, prognosis and molecule-targeted therapy of bladder cancer. FIGURE S2 | Volcano plots for DEGs and heatmap plots for DEFGs. (A-C) Volcano plots for differentially expressed 4167 mRNAs, 208 miRNAs, and 2488 lncRNAs between tumor and normal samples (adjusted p-value < 0.05 and |log 2 fold change| > 0.57). (D-F) Heatmap plots for 278 DEFmRNAs, 120 DEFmiRNAs, and 57 DEFlncRNAs. Basal, luminal, and normal samples are represented by the red, blue, and yellow bar, respectively. FIGURE S3 | Parameter selection and Performance of RF and XG in mRNA, miRNA and lncRNA dataset. (A) The x-axis represents the number of mtry set for RF classifier (1,5,10,15,20,25). The y-axis represents the corresponding AUC. (B) The x-axis represents the number of ntree set for RF (20, 400, 600, 800). The y-axis represents corresponding obb error rates. The colors correspond to mtry numbers. (C) The x-axis represents the number of fold set for RF. The y-axis represents corresponding accuracy. The red color shows mean accuracy. (D) The x-axis represents the number of iter set for XG (1,400,800,1200,1600,2000,2400) and the y-axis represents the corresponding accuracy.