Identification of miRNA Biomarkers for Diverse Cancer Types Using Statistical Learning Methods at the Whole-Genome Scale

Genome-wide analysis of miRNA molecules can reveal important information for understanding the biology of cancer. Typically, miRNAs are used as features in statistical learning methods in order to train learning models to predict cancer. This motivates us to propose a method that integrates clustering and classification techniques for diverse cancer types with survival analysis via regression to identify miRNAs that can potentially play a crucial role in the prediction of different types of tumors. Our method has two parts. The first part is a feature selection procedure, called the stochastic covariance evolutionary strategy with forward selection (SCES-FS), which is developed by integrating stochastic neighbor embedding (SNE), the covariance matrix adaptation evolutionary strategy (CMA-ES), and classifiers, with the primary objective of selecting biomarkers. SNE is used to reorder the features by performing an implicit clustering with highly correlated neighboring features. A subset of features is selected heuristically to perform multi-class classification for diverse cancer types. In the second part of our method, the most important features identified in the first part are used to perform survival analysis via Cox regression, primarily to examine the effectiveness of the selected features. For this purpose, we have analyzed next generation sequencing data from The Cancer Genome Atlas in form of miRNA expression of 1,707 samples of 10 different cancer types and 333 normal samples. The SCES-FS method is compared with well-known feature selection methods and it is found to perform better in multi-class classification for the 17 selected miRNAs, achieving an accuracy of 96%. Moreover, the biological significance of the selected miRNAs is demonstrated with the help of network analysis, expression analysis using hierarchical clustering, KEGG pathway analysis, GO enrichment analysis, and protein-protein interaction analysis. Overall, the results indicate that the 17 selected miRNAs are associated with many key cancer regulators, such as MYC, VEGFA, AKT1, CDKN1A, RHOA, and PTEN, through their targets. Therefore the selected miRNAs can be regarded as putative biomarkers for 10 types of cancer.

Genome-wide analysis of miRNA molecules can reveal important information for understanding the biology of cancer. Typically, miRNAs are used as features in statistical learning methods in order to train learning models to predict cancer. This motivates us to propose a method that integrates clustering and classification techniques for diverse cancer types with survival analysis via regression to identify miRNAs that can potentially play a crucial role in the prediction of different types of tumors. Our method has two parts. The first part is a feature selection procedure, called the stochastic covariance evolutionary strategy with forward selection (SCES-FS), which is developed by integrating stochastic neighbor embedding (SNE), the covariance matrix adaptation evolutionary strategy (CMA-ES), and classifiers, with the primary objective of selecting biomarkers. SNE is used to reorder the features by performing an implicit clustering with highly correlated neighboring features. A subset of features is selected heuristically to perform multi-class classification for diverse cancer types. In the second part of our method, the most important features identified in the first part are used to perform survival analysis via Cox regression, primarily to examine the effectiveness of the selected features. For this purpose, we have analyzed next generation sequencing data from The Cancer Genome Atlas in form of miRNA expression of 1,707 samples of 10 different cancer types and 333 normal samples. The SCES-FS method is compared with well-known feature selection methods and it is found to perform better in multi-class classification for the 17 selected miRNAs, achieving an accuracy of 96%. Moreover, the biological significance of the selected miRNAs is demonstrated with the help of network analysis, expression analysis using hierarchical clustering, KEGG pathway analysis, GO enrichment analysis, and

INTRODUCTION
MicroRNAs (miRNAs) belong to the non-coding RNA family. They consist of 19-25 nucleotides and play an important role in the regulation of gene silencing. These non-coding RNAs are present in every eukaryotic cell and can also be encoded by a viral genome (Ray and Maiti, 2015;Bruscella et al., 2017). The miRNAs are formed by RNA polymerase II in the cell nucleus and are then transferred to the cytoplasm (Bartel, 2009) for biological activities such as cell cycle control, apoptosis, and oncogenesis. They interact with the complementary strand of mRNAs and lead to the degradation of the corresponding mRNAs; they also interfere with protein production by suppressing protein synthesis (Valencia-Sanchez et al., 2006). A miRNA molecule can bind one or more targets, thus forming a complex underlying regulatory network. These networks have a profound impact on cancer signaling pathways . Previously, lowthroughput and high-cost technologies were the main obstacle to answering systems-level biological questions. However, recent advancements in next generation sequencing (NGS) have enabled researchers to address such complex problems (van Dijk et al., 2014). Moreover, new sequencing technologies and genomic datasets have helped us to gain better understanding of the biological complexities related to genomic abnormalities in cancer. The considerable achievements in sequencing techniques have made high-throughput techniques a fundamental platform for miRNA, RNA, and DNA research. Generally, miRNAs are involved in a wide range of diseases, including neurological disease, heart disease, and cancer (Giza et al., 2014;Paul et al., 2018). In many cases of cancer in humans, dysregulation of miRNA expression has been observed, and it is well-known that miRNAs can serve as potential cancer biomarkers (Lu et al., 2005;Jacobsen et al., 2013;Wong et al., 2017). In this regard, scientific communities are also trying to understand the role of miRNAs in paring with mRNAs (Zhang et al., 2014;Shrestha et al., 2017), in different cancer types by ranking miRNAs  to elucidate their effects and drug resistance (Ma et al., 2010;Li and Yang, 2014;Cheerla and Gevaert, 2017).
To reduce the time taken for clinical trials, and to provide better and more accurate treatments while avoiding unnecessary interventions, the proper selection of miRNAs as biomarkers is crucial. For this purpose, miRNAs are often used as features in statistical learning methods viz. clustering, classification, and regression in order to identify potential biomarkers (Song et al., 2016;Yang et al., 2017;Yokoi et al., 2017). Song et al. (2016) performed a clustering analysis on breast cancer data in order to find miRNAs that could be prognostic biomarkers; these miRNAs are up-regulated in this type of cancer and are linked to local relapse, distant metastasis, and poor clinical outcomes. Similarly, Yang et al. (2017) used clustering to find miRNA biomarkers for breast cancer. The identified miRNAs have higher specificity and sensitivity than single-gene biomarkers. On the other hand, Yokoi et al. (2017) used a classification task to inform the development of a predictive model to distinguish patients with ovarian cancer tumors from healthy subjects. In this study, eight miRNAs were found as biomarkers for ovarian cancer. Jacob et al. (2017) conducted a study on colon cancer and identified 16 miRNA signatures, which act as prognostic biomarkers at cancer stages II and III. Apart from the aforementioned works, regression analysis has been used to predict the survival rates of patients with different types of cancer (Liang et al., 2018). Liang et al. (2018) used Cox regression analysis for pancreatic cancer and identified five miRNAs as independent prognostic factors. The progress in this regard can be found in the literature (e.g., Peng and Croce, 2016;Hosseinahli et al., 2018).
Generally, in miRNA-based cancer studies, statistical learning algorithms viz. clustering, classification, and regression are used separately for different cancer types, as described above and in the literature (Akhtar et al., 2015;Ang et al., 2016;Li et al., 2016;Lin and Lane, 2017). However, to leverage the advantages of the different algorithms, it may be useful to integrate them into a single method for identifying potential biomarker miRNAs. Besides, little work exists on multi-class classification of diverse cancer types using NGS data. These two facts motivated us to develop the method described in this paper, which can not only classify 10 types of cancer (bladder, breast, colon, glioblastoma, head and neck squamous cell, kidney renal clear cell, lung adenocarcinoma, lung squamous cell, ovarian, and uterine corpus endometrial carcinoma) but also find putative miRNAs that are highly associated with these cancer types. The proposed two-part wrapper-based feature selection method, referred to as the stochastic covariance evolutionary strategy with forward selection (SCES-FS), uses stochastic neighbor embedding (SNE) (Hinton and Roweis, 2003) in conjunction with the covariance matrix adaptation evolutionary strategy (CMA-ES) (Hansen et al., 2003) and a simple classifier, either random forest (RF) (Breiman, 2001), support vector machine (SVM) (Cortes and Vapnik, 1995), naive Bayes (NB) classifier (George and Langley, 1995), K-nearest-neighbors (K-NN) classifier (Altman, 1992), or decision tree (DT) (Quinlan, 1986), in the first part. Here SNE is used to reorder the features by performing an implicit clustering, such that neighboring features are highly correlated. Then, from these clusters of features, a subset of features is selected randomly in order to perform the multi-class classification task for the diverse cancer types. However, as the features are randomly selected, this classification task is treated as an underlying optimization problem for CMA-ES to find the features automatically. Hence, the final set of features/miRNAs is obtained by using forward selection (Whitney, 1971).

MATERIALS AND METHODS
In this section, we briefly describe SNE (Hinton and Roweis, 2003), CMA-ES (Hansen, 2006), and Cox regression analysis (Cox, 1972). More details about classification techniques and feature selection methods are given in the Supplementary Material 2 for this article. This section also describes the proposed method, which consists of two parts. The first part is the wrapper-based SCES-FS. In the second part, the selected features, such as expression of miRNAs and clinical 1 https://tcga-data.nci.nih.gov/tcga/ 2 http://www.nitttrkol.ac.in/indrajit/projects/mirna-prediction-multicalss/ data, are used in survival analysis to assess the importance of the selected miRNAs in different types of cancer and to evaluate the effectiveness of the selected miRNAs. Figure 1 shows the flowchart of the proposed method.

Stochastic Neighbor Embedding
Let X = {x 1 , x 2 , . . . , x n } denote a set of n observations, where x i ∈ R D . SNE (Hinton and Roweis, 2003) constructs a lowdimensional embedding that recreates X in a space of lower dimension as X In SNE, both X and X ′ are represented as discrete probability distributions P and Q, where (1) that model pairwise distances between data points. The values of var i ∈ R are adjusted in such a way that the entropies of all distributions P i are equal.
The mismatch between P and Q is reduced through minimization of Kullback-Leibler (KL) divergence objective KL(P||Q) = i j p ij log p ij q ij = i KL(P i ||Q i ), by altering Q with gradient-based optimization methods. Optimization is difficult due to the existence of multiple local optima, and entirely different embeddings may be obtained with different initial Q distributions.

Covariance Matrix Adaptation Evolution Strategy
Evolutionary strategies are black-box optimization algorithms which belong to a broader group of evolutionary algorithms. In such methods, a set of candidate solutions is maintained. In successive iterations of the procedure, these candidate solutions are perturbed and evaluated, and in each iteration the best solution is left unchanged and carried over to the next set of candidate solutions. CMA-ES (Hansen and Ostermeier, 1996) is an evolutionary strategy where the set of candidate solutions is modeled and sampled from a multivariate Gaussian distribution N (m, C). The covariance matrix C represents pairwise relationships between attributes. The objective function of CMA-ES maximizes two likelihoods: (a) the likelihood of having the best individuals in previous iterations, and (b) the likelihood of taking the best search steps in previous iterations. At the end of each iteration, (a) guides updates of the mean m as a weighted average of µ best solutions, is the ith best solution in iteration g + 1 and w i is its weight, and (b) updates the covariance matrix C as follows: (2) ∈ R D is a vector that amplifies the updates in favorable directions: where c 1 , c µ , c c ∈ R are weights, σ (g) ∈ R is an adaptive step size, which is dependent on the iteration, and z ∈ R is a normalizing constant. Details of the parameters of CMA-ES can be found in Hansen and Ostermeier (1996) and Hansen et al. (2003).

Cox Regression Analysis
The Cox regression model (Cox, 1972) is a proportional hazards regression model in which the hazard ratio is constant but other contents have the same baseline hazard function. Based on this assumption, the survival function is calculated as where H 0 (τ ) represents the cumulative baseline hazard function at time τ and S 0 (τ ) = exp(−H 0 (τ )) is the baseline survival function; H 0 (τ ) is taken to be Breslow's estimator (Breslow, 1974), which is the most widely used and given bŷ As the Cox model is based on the proportional hazards assumption, it is represented as for an given instance i = 1, 2, 3, . . . , n, where the baseline hazard function h 0 (τ ) can be an arbitrary negative function of time, and x i = (x i1 , x i2 , . . . , x iD ) is the corresponding covariate vector for instance i and is the coefficient vector. The Cox model is a semiparametric algorithm where the baseline hazard function h 0 (τ ) is unspecified. For any two instances x 1 and x 2 , the hazard ratio is given by This means that the hazard ratio is independent of the baseline hazard function.

Wrapper-Based Feature Selection
Integrating SNE and CMA-ES The first part of our proposed method performs the task of miRNA selection for diverse cancer types, which is considered a multi-class classification problem here. SNE is used to reorder the features by performing an implicit clustering such that neighboring features are highly correlated. Then, the underlying multi-class classification task is performed using well-known classifiers and treated as an optimization problem for which CMA-ES is used to find the miRNAs automatically. The miRNAs thus found are further refined using forward selection (FS). Therefore, we call this wrapper-based feature selection method as stochastic covariance evolutionary strategy with forward selection (SCES-FS). Algorithm 1 presents the SCES-FS method in detail. It starts with the dataset X = {x 1 , x 2 , . . . , x n }, which denotes a set of n observations with x i ∈ R D , the class label Y, the population size λ, the maximum number of generations N, the classifier A, and the number of runs R as inputs. In the dataset, each feature is characterized by expression levels of the samples. The features of the original dataset X are reordered using SNE in the ConstructEmbedding step, producing the embedding dataset X ′ whose size is the same as that of the original dataset. The parameters are initialized in the SetParamsCMAES step. The individuals/vectors in CMA-ES are encoded as a simple threshold weight vector, with a single weight for each miRNA. An individual x ∈ R H encodes a weight vector w ∈ R H and a threshold t ∈ R, where H ≤ D is the number of weights. Only those features whose weights exceed the threshold are eventually selected into a feature set S: where t ∈ R is a threshold that is chosen carefully. The population of vectors is drawn in the DrawPopulationCMAES step from N (m, σ 2 C), where m is the mean, C is the covariance matrix, and σ is the step-size control parameter of CMA-ES. In the ConstructFeatureSets step, each individual x i is translated to a feature set S i according to Equation (8).
In the ScoreFeatureSet step, each feature set is evaluated by training a classifier on a dataset restricted to only the selected features. The objective function of CMA-ES combines the accuracy obtained from the classifier for each individual x and the size of the feature set S as follows: where D is the dimension of the dataset, ϒ is the target number of features, and α ∈ R + balances the penalty for the excessive number of features. The objective function is designed in this way so that higher classification accuracy can be achieved for a small number of features, and it incorporates L 0 regularization term as penalty, which is bounded by ϒ in order to have redundancy in selected subsets of features. The optimization with CMA-ES allows inter-feature relationships with covariance matrices. Finally, the parameters are updated according to CMA-ES update rules in the UpdateParamsCMAES step, and the best set of features/miRNAs in a particular run are kept in S RBest .

Preparation of the Final Set of Features
The SCES-FS is random in nature to reduce the probability of returning sub-optimal solutions. Therefore, a single run of SCES-FS does not guarantee a reliable solution. To overcome this limitation, SCES-FS is executed up to a maximum number of runs, R. In each run, a set of best features/miRNAs are collected into a list, L. After completion of the maximum number of runs, RankFrequency sorts the cumulative set of features in descending order according to the frequency of occurrence in each run and produces a modified list, L ′ . Thereafter, ForwardSelection applies the forward feature selection method, using the classifier to evaluate the feature set iteratively to obtain the best feature set, S best .

Justification for miRNA Selection
Because of the random nature of the algorithm, there is a chance of false positives or false negatives occurring in the selection of miRNAs. To reduce the probability of having false positives or false negatives in the selected set of miRNAs, the SCES-FS algorithm is run 50 times, and then the miRNAs are ranked based on their frequencies of occurrence in 50 different sets of features. Thus, a stable set of miRNAs is selected on the basis of maximum classification accuracy, which is computed by considering the miRNAs cumulatively from the top of the list. By this process, 17 distinct miRNAs are selected.
This procedure does not, however, ensure the absence of false positives or false negatives, so additional measures are taken. The presence of false positives is made unlikely by the sorting procedure. In fact, since a given false positive is supposed to occur less frequently than all the true positives, it will appear in the tail after sorting. Consequently, it is likely that this false positive will be filtered out when selecting the final list of miRNAs. On the other hand, the presence of false negatives is related to the choice of the number of runs and the corresponding expected number of miRNAs belonging to the sorted list. In this regard, a mathematical argument is given below to justify that the SCES-FS does not exhibit random behavior.
Suppose that at each run the SCES-FS randomly selects 10 miRNAs from the whole collection of miRNAs. We first compute the expected number of distinct miRNAs reported after all the runs, and then we compare this number with the results of our experiments. We have the following parameters: D = 199 is the number of miRNAs, S = 10 is the number of miRNAs returned after each run, and R = 50 is the number of runs. For i = 1, . . . , D, let V i be a Bernoulli-distributed indicator variable, where V i = 1 if the miRNA m i never shows up. The probability that m i is selected in one run is S/D, so the probability that it is never selected is 1−S/D. Since each of the R runs is independent, the following equation can be written: Let V = D i=1 V i be the random variable that counts the number of miRNAs that do not belong to the final set of miRNAs reported at least once. By linearity of the expectation, we obtain the equation Hence, the number of expected miRNAs reported at least once is Substituting the above-mentioned values for the parameters, we obtain that • the expected number of miRNAs reported at least once after 50 iterations is 183; and • the number of new miRNAs added in a further iteration would be 1.
As the number of sorted miRNAs in our experiment is 39 (see the Supplementary Material), this difference suggests that 50 runs are enough to conclude that all the true positives are included in the sorted list and so false negatives are unlikely.

Cox Regression Analysis for Evaluating miRNAs
The second part of the proposed method is for the evaluation of selected miRNAs using Cox regression analysis. The primary objective of this stage is to assess the importance of the miRNAs selected in the first part of the method, and this is done using Cox regression analysis for survival in 10 different cancer types. Expression data of the selected miRNAs and the associated clinical data are used for the Cox regression analysis. In the clinical data, the vital status of each patient, indicating whether the patient is still alive or has died, and the number of days since the last followup are taken into account when performing the survival analysis. Based on the expression levels and the clinical data of the selected miRNAs in each cancer type, the Cox coefficient, hazard ratio, and p-value are computed. A higher value of the Cox coefficient signifies greater importance of that miRNA to the respective cancer type. Moreover, the up-and down-regulation of all the selected miRNAs are observed to understand their behavior with respect to that particular cancer type based on change in expression in tumor and normal samples.

Complexity Analysis
Let D be the number of features and n the number of samples in the input dataset. As the available approximations may considerably lower the overall complexity, we discuss the complexity of each building block separately. A single step of SNE requires computing relations between all data points. We embed a transposed dataset, making an optimization step in O(D 2 ) time. Computing SNE usually involves performing principal component analysis for preliminary dimension reduction, though its cost is negligible. The internal complexity of CMA-ES is estimated as O(D 2 ), due to sampling and updating of the covariance matrix. The matrix needs to be factorized, which can be done by eigen decomposition in O(D 3 ) time.
Factorization does not happen in every generation, which gives O(D 2 ) amortized time. Empirical evidence suggests that the sufficient number of objective function evaluations usually scales sub-quadratically with D ( Ros and Hansen, 2008). The computation time is similar, since the vast majority of it is time spent training similar classifiers. On the other hand, the time complexity of Cox regression analysis is O(nD 2 ) for a single run (Kelley, 1999).

Dataset Preparation and Parameter Setting
The miRNA expression and clinical datasets of bladder, breast, colon, glioblastoma, head and neck squamous cell, kidney renal clear cell, lung adenocarcinoma, lung squamous cell, ovarian, and uterine corpus endometrial carcinoma were obtained from TCGA. These 10 cancer types have also been studied previously (see, e.g., Jacobsen et al., 2013). Moreover, Hoadley et al. (2018) found that the characteristics of certain cancers out of 33 types provided in TCGA are overlapping in nature. As a result, 10-15 distinct groups of cancer were reported in Hoadley et al. (2018), which are similar to the cancer types studied in the present article. Our choice of cancer types was based on: (1) careful review of the literature, (2) the availability of tissue-specific tumor and normal samples to avoid the class imbalance problem in classification, and (3) the availability of common miRNA expression data for different cancer types and their corresponding clinical data. Thus, 10 cancer types were selected for our study. The expression data were generated using an Illumina high-throughput sequencing machine in the form of read counts of 199 miRNAs, normalized to reads per million, while the clinical data contain gender, age, days since last followup, and vital status. After removing miRNAs that contain more than 60% zeros in each cancer type and taking the miRNAs common to all the cancer types, we found 199 miRNAs and considered their expression in different cancer types. The number of samples and other details for each cancer type are given in Table 1; we also included 333 normal samples in the analysis with the expression of same miRNAs. For convenience, the expression datasets containing reads per million are further normalized onto a log 2 scale. These processed datasets can be downloaded from the Supplementary Material website 3 .
To construct the final ranking of miRNAs, the SCES-FS algorithm was run 50 times. Five-fold cross-validation was applied during each classification to avoid the issue of overfitting or underfitting. The parameters used in the experiments are shown in Table 2 and were obtained either experimentally or from the literature (Latinne et al., 2001;Oshiro et al., 2012;Hansen, 2016).

Experimental Outcomes
The problem of finding miRNAs that can correctly distinguish different cancer types were posed as a multi-class classification task using expression data of miRNAs, and the importance of the selected miRNAs was evaluated using Cox regression analysis with the help of clinical data. The classification results of SCES-FS using the classifiers RF, SVM, NB, K-NN, and DT over 50 runs are reported in It is to be noted that our results have been verified with FSCOX, which computes important miRNAs via Cox regression on all miRNAs and subsequently uses them for the classification. It has been observed that the overall classification accuracy of FSCOX is less than the accuracy attained by SCES-FS as reported in Table 3. Other omics-based analyses, such as protein array, copy number variation (CNV), and methylation studies, also have potential applications in pan-cancer classification, as explained in Zhang et al. (2015Zhang et al. ( , 2016Zhang et al. ( , 2019, where the overall accuracy achieved in the range of 93-97%. The cancer dataset used in Zhang et al. (2015Zhang et al. ( , 2016Zhang et al. ( , 2019 was carefully selected for protein array, CNV, and methylation studies and is not directly suitable for experimenting with miRNAs, as it creates a class imbalance problem when selecting miRNA expression data for both tumor and normal cases. However, the overall accuracy of our method is higher than 96%, which is on the higher side of the accuracy range reported by Zhang et al. (2015Zhang et al. ( , 2016Zhang et al. ( , 2019, suggesting that our selected miRNAs can also be considered potential markers for pan-cancer classification. The results of SCES-FS have been further validated by survival analysis, including Cox regression, network analysis, expression analysis using hierarchical clustering in the form of heatmaps, KEGG pathway analysis, GO enrichment analysis, and PPI network analysis, as described in the following. Table 4 reports the results of the Cox regression analysis, i.e., the Cox coefficient and hazard ratio values, for 10 cancer types in order to evaluate the importance of the 17 selected miRNAs with respect to these cancer types. The Cox regression analysis was performed by integrating the miRNA expression and clinical data to assess the effect of miRNA expression on cancer survival. Higher Cox coefficient and hazard ratio values indicate greater influence of the miRNA on the cancer type. For example, the miRNA hsa-mir-375 has the highest Cox coefficient, 0.9882, and hazard ratio, 1.7879, for the GBM cancer type. To help visualize the importance of each miRNA to the different cancer types, a circos plot of Table 4 is shown in Figure 2. In the figure, a broader band signifies a stronger association between the miRNA and the particular cancer type. Table 5 summarizes the cancer types that are most highly associated with the selected miRNAs. In this table, for each miRNA, the cancer type for which that miRNA has the highest Cox coefficient is shown, along with the associated p-value, false discovery rate (FDR), and up-or down-regulation of the miRNA expression in that cancer type. Similarly, Supplementary Table 2 gives the up-and down-regulations along with the corresponding FDR values of the selected miRNAs for all cancer types. The up-and down-regulations of the miRNAs are computed after evaluating the change in expression of the selected miRNAs between tumor samples and normal samples or the population for a particular cancer type using the ANOVA test; this is discussed further in section 3.2.3. Here, the change in expression is considered significant if the p < 0.05, and the up-or down-regulation is computed if the change in population expression is positive or negative, respectively. In summary, we find that hsa-mir-205, hsa-mir-10a, hsa-mir-196b, hsa-mir-10b, hsa-mir-375, hsa-mir-143, hsa-let-7c, hsamir-107, hsa-mir-378, hsa-mir-133a, hsa-mir-1, hsa-mir-30c,   hsa-mir-16, hsa-mir-30a, hsa-let-7i, hsa-mir-24, and hsa-mir-95 are highly associated with GBM, UCEC, OV, BLCA, COAD, and BRCA.

Network Analysis
For the 17 selected miRNAs, miRTarBase (Huang et al., 2020) was used to find their targets in order to elucidate their role in the different cancer types. To identify the most correlated targets, we computed the Pearson correlation between the expression values of miRNAs and mRNAs obtained from TCGA for the 10 cancer types and took the negative correlation value used in Zhou et al. (2015) as indicating strong association. The top five negatively correlated mRNAs associated with each of the 17 miRNAs are reported in Table 6; the rest are reported in Supplementary Material. To construct the interaction network, the miRNAs and their targets were ranked based on the cumulative negative correlation score and their presence in different cancer types as indicated by the association number. These results are reported in Supplementary Table 3.
For example, hsa-mir-16 and its target, PHYHIP, are related to six cancer types and the cumulative negative correlation score is −4.142. Similarly, hsa-mir-24 is correlated with C1QTNF6 in another six cancer types, with a cumulative negative correlation score of −3.906. The subset of such targeted mRNAs is used to construct the interaction network shown in Figure 3.  Figure 3 signify that the number of cancer types associated with the corresponding pair of miRNA and target mRNA is 6, 5, 4, 3, 2, and 1, respectively. The targets of the miRNAs are investigated further using KEGG pathway, GO enrichment, and PPI network analysis in the following subsections in order to see their impact on the different types of cancer.

Expression Analysis
Expression analysis was conducted using a one-way ANOVA test in order to evaluate the statistical significance of the differential expression of the 17 selected miRNAs. Alternative techniques can also be used (Conesa et al., 2016;Costa-Silva et al., 2017;Crow et al., 2019). To perform the test, the population of patients was divided into tumor and normal groups for a given miRNA in a particular cancer type. As a result of ANOVA, significant (p < 0.05) changes in expression were observed between the tumor and normal groups for the 17 selected miRNAs. For example, the p-values of hsa-mir-10a, hsa-mir-196b, hsa-mir-10b, hsa-mir-375, hsa-mir-143, hsa-let-7c, hsa-mir-107, hsa-mir-378, hsa-mir-133a, and hsa-mir-30c were 6.17E-24, 2.84E-48, 4.71E-03, 1.80E-14, 3.38E-57, 3.85E-38, 5.42E-24, 5.04E-20, 2.78E-20, and 4.48E-17 respectively. Additionally, box plots of the selected miRNAs in tumor and normal samples are provided in Supplementary Figure 3. To investigate the relationship between the expression levels of miRNAs for each cancer type, hierarchical clustering was performed on the tumor and normal samples of the 17 selected miRNAs. The results are shown in Figure 4, from which the change in expression levels of the miRNAs between tumor and normal samples is evident. In the cluster plots, red indicates high expression levels and green low expression levels; black corresponds to not significantly expressed samples.

KEGG Pathway Analysis
In order to perform KEGG pathway analysis for the 17 selected miRNAs in 10 cancer types, their targets were identified based on negative correlation values as described in section 3.2.2. Then, these target mRNAs were used in the DIANA tool (Vlachos et al., 2015) separately to identify significant KEGG pathways associated with the selected miRNAs in different cancer types. The five most significant pathways for each of the 17 miRNAs according to the FDR-corrected p-value within 5% statistical significance for the different cancer types are reported in The presence of such critical pathways for the 17 selected miRNAs suggests that these miRNAs play a significant role in various cancer types, including the 10 types considered in this paper.

Gene Ontology Enrichment Analysis
Similar to the KEGG pathway analysis, GO enrichment analysis was also performed with the targets of the selected miRNAs using the Enrichr tool (Kuleshov et al., 2016), to assess the significance of the roles the selected miRNAs play in different biological activities. The results of different analyses for biological processes, molecular functions, and cellular components are reported in Table 8 and in Supplementary Tables 4, 5, respectively; the details of the enrichment analysis are also given in Supplementary Material. In Table 8, we see that various biological process GO terms which are related to the targets of the 17 selected miRNAs have important roles in cancer development. For example, GO:0023051 is linked with regulation of signaling, which is involved in the development of colorectal cancer. Similarly, GO:0051252 is linked with regulation of the RNA metabolic process of bladder urothelial carcinoma,   Corr. Corr.
Corr. with an FDR-corrected p-value of 1.30E-03, which is <0.05. Other GO terms for biological processes, such as GO:0051171, GO:0048519, and GO:0009653, are linked with the regulation of nitrogen compound metabolic process, negative regulation of biological process, and anatomical structure morphogenesis, respectively, for different cancer types. As with the biological processes, GO terms for molecular functions were also found to be significant in the development of various types of cancer, as reported in Supplementary Table 4. For example, GO:0008134 is linked with transcription factor binding, which plays an important role in BRCA, GBM, KIRC, LUAD, OV, and UCEC, with respective FDR-corrected pvalues 8.10E-03, 8.31E-06, 6.80E-06, 4.70E-03, 3.76E-05, and 3.40E-03 all being <0.05. Other important GO terms such as GO:0044877, GO:0003723, GO:0003676, and GO:0008092 are found to be linked to protein-containing complex binding, RNA binding, nucleic acid binding, and cytoskeletal protein binding, respectively, for different cancer types. Supplementary Table 5 reports the significant GO terms for cellular components in various cancer types. GO:0070013 is linked to intracellular organelle lumen, which has FDR-corrected p-values within the 5% significance level for seven different cancer types, namely BLCA, BRCA, COAD, GBM, HNSC, KIRC, and LUSC. In addition, GO:0043227, GO:0005654, and GO:0044444 are associated with membrane-bounded organelle, nucleoplasm, and cytoplasmic part, respectively, which are critical processes in the progression of different types of cancer. The relationship between these GO terms and important activities in cancer development have also been cross-validated in other studies (Waldman et al., 1997;Dhillon et al., 2007;He et al., 2013;Reimand et al., 2013;McClurg and Robson, 2015). Taken together, all of these evidences point to the potential importance of the 17 selected miRNAs in the development of various types of cancer.

Protein-Protein Interaction Network Analysis
In PPI networks, a node and an edge signify the interaction of a given protein and the protein-protein association. Here, related proteins share common functions, although they do not necessarily physically interact with each other. In our study, to perform the PPI network analysis, 170 sets of targets relating to 17 miRNAs in 10 different cancer types were used to compute the PPI networks using the STRING database (Szklarczyk et al., 2019). Then, the 170 interaction networks were further analyzed to rank the proteins based on the degree of their nodes and their presence in 10 cancer types. The top 30 proteins are reported in Table 9, while the rest are given in Supplementary Material. For example, MYC has degrees 34, 28,41,85,33,133,54,25,38, and 32 with respect to BLCA, BRCA, COAD, GBM, HNSC, KIRC, LUAD, LUSC, OV, and UCEC, respectively. Therefore, the total degree of MYC is 503. Similarly, other proteins have certain degrees in different cancer types. Moreover, their presence in the different cancer types is indicated by the association count; for example, MYC has an association count of 10. The proteins in Table 9 are used to construct the final consolidated PPI network shown in Figure 5, which represents the associations between the top 30 proteins and 10 different cancer types. The average node FIGURE 3 | The interaction network of miRNAs and their target genes, where yellow nodes represent miRNAs and purple nodes represent genes. The edge colors red, blue, pink, dark green, light green, and black indicate that the corresponding miRNA-gene pair is associated with 6, 5, 4, 3, 2, and 1 cancer types, respectively. degree in this network is 13.7, with PPI enrichment p <1.0E-16. In Figure 5, the significant proteins are MYC, PTEN, CDK1, BRCA1, AKT1, PIK3R1, etc. MYC is the most important protein, known to be oncogenic for breast cancer. Similarly, AKT1 is very significant for breast, lung, and colon cancers. PIK3R1 also plays an important role in breast cancer. The detailed list of PPI networks is provided in Supplementary Material. In summary, the PPI network analysis suggests that the 17 selected miRNAs can be considered important biomarkers with respect to diagnosis of 10 different cancer types.

Web Application for Prediction of 10 Cancer Types
A web version of the predictor 4 was developed for use of the scientific and medical communities to aid in cancer diagnosis. The application uses the random forest and the expression of 4 http://www.nitttrkol.ac.in/indrajit/projects/mirna-prediction-multicalss/ 17 miRNAs to predict the occurrence of 10 different types of cancer. The random forest was chosen as it outperformed other classifiers in our experiments. To predict the type of cancer for a particular patient or set of patients, the expression values of 17 miRNAs in those patients need to be uploaded in a specific format, instructions are provided on the website. The application will return and display the predicted cancer types for the given set of patients.

CONCLUSIONS
We have proposed a statistical learning method for feature selection that integrates clustering, classification, and regression to find putative miRNAs by solving a multi-class classification problem on 10 cancer types. The first part of the method is a wrapper-based feature selection algorithm, called stochastic covariance evolutionary strategy with forward selection (SCES-FS), which involves stochastic neighbor embedding (SNE), Frontiers in Genetics | www.frontiersin.org   covariance matrix adaptation evolutionary strategy (CMA-ES), forward selection (FS), and a classification technique. Features are reordered using SNE by performing clustering of highly correlated features. From the result of the clustering, a subset of features is randomly selected to perform multi-class classification on 10 cancer types. Although the features are randomly selected, the underlying classification task is treated as an optimization problem for CMA-ES in order to find the features automatically. Thereafter, a final set of features/miRNAs is obtained using forward selection. The results of the first part of SCES-FS have been compared with the results of some well-known feature selection methods, including ESVM-RFE, LASSO, NSGA-II-SE, MOGA, SVM-nRFE, SVM-RFE, CMIM, ICAP, SCAD, JMI, CIFE, mRMR, FSCOX, DISR, SNRs, and RankSum, as well as the result with all features, in terms of classification accuracy. The SCES-FS method selected 17 putative miRNAs associated with 10 cancer types and achieved higher classification accuracy than the other methods. Using the 17 selected miRNAs, a web-based multi-class cancer predictor application has been developed.
These selected miRNAs are used in the second part of the proposed method, which employs Cox regression analysis to examine their importance with respect to survival of particular types of cancer. The analysis uses data on expression of the 17 selected miRNAs together with clinical data. A high Cox coefficient value signifies the importance of an miRNA for a particular cancer type. For example, it is found that hsa-mir-375 has the highest Cox coefficient, 0.9882, for the glioblastoma multiform cancer type. Similar results were obtained for other miRNAs, associated with different cancer types. The up-and down-regulations of the 17 selected miRNAs have been computed based on the ANOVA test. Furthermore, network analysis, expression analysis using hierarchical clustering, KEGG pathway analysis, GO enrichment analysis, and PPI network analysis have been performed to assess the biological significance of the selected miRNAs. The network analysis revealed the association of different cancer types with each pair of miRNA and its target mRNA. The hierarchical clustering program, and was partially supported by the Academy of Development as the Key to Strengthen Human Resources of the Polish Economy co-financed by the European Union under the European Social Fund. This research was also partially supported under the RENOIR Project of the European Union Horizon 2020 Research and Innovation Programme under the Marie Sklodowska-Curie grant agreement no. 691152 and by the Ministry of Science and Higher Education of Poland under grant nos. W34/H2020/2016 and 329025/PnH/2016.