Autoencoder Based Feature Selection Method for Classification of Anticancer Drug Response

Anticancer drug responses can be varied for individual patients. This difference is mainly caused by genetic reasons, like mutations and RNA expression. Thus, these genetic features are often used to construct classification models to predict the drug response. This research focuses on the feature selection issue for the classification models. Because of the vast dimensions of the feature space for predicting drug response, the autoencoder network was first built, and a subset of inputs with the important contribution was selected. Then by using the Boruta algorithm, a further small set of features was determined for the random forest, which was used to predict drug response. Two datasets, GDSC and CCLE, were used to illustrate the efficiency of the proposed method.


INTRODUCTION
The prediction of drug responses for individual patients is an essential issue in the research of precision medicine. It is known that the drug response for various patients can be different (Wilkinson, 2005). Thus, there are different therapeutic effects when using the same anticancer drug for a cohort of patients (Dong et al., 2015). It has been suggested that the patients with similar response to an anticancer drug can have similar genetic features, like gene mutations and expressions (Wang et al., 2017). These features can be used as the biomarkers to predict the drug response (La Thangue and Kerr, 2011).
Because the clinical trials are of high time and economic costs, the researchers prefer to use the cell lines obtained from the cancer patients for investigating drug responses. These investigations lead to several drug response databases, like Genomics of Drug Sensitivity in Cancer (GDSC) (Yang et al., 2012) and Cancer Cell Line Encyclopedia (CCLE) (Barretina et al., 2012). By using these databases, constructing models for the prediction of drug response becomes feasible. Primarily, researchers always use IC50 (Barretina et al., 2012;Garnett et al., 2012), which indicates the concentration required for 50% inhibition in vitro, to measure the sensitivity of drug response. Taking IC50 as the dependent variable, linear regression models, including ridge regression, lasso, and elastic net, were developed to predict drug response (Barretina et al., 2012;Garnett et al., 2012;Basu et al., 2013;Iorio et al., 2016). Further complex models, like support vector regression, artificial neural network, and random forest (RF), were also constructed for this purpose (Riddick et al., 2010;Menden et al., 2013;Ammad-Ud-Din et al., 2014;Ammad-ud din et al., 2016;Costello et al., 2014;Ospina et al., 2014;Cichonska et al., 2015;Dong et al., 2015;Zhang et al., 2015). Neto et al. (2014) proposed the STREAM algorithm that combined a Bayesian inference strategy with ridge regression for the prediction of drug response. Besides the regressions, several network-based models were also proposed Fey et al., 2015;Zhang et al., 2015). Model ensembles have also been considered by some works (Wan and Pal, 2014;Cortés-Ciriano et al., 2015). Meanwhile, deciding whether an individual patient is sensitive or not to the anticancer drugs is meaningful for treatment. By setting a proper threshold value for IC50, drug response can be divided into two categories: sensitivity and non-sensitivity. In this case, classification models can be fitted for predicting drug response. To this end, the recommender system, naive Bayes classifier and support vector machine have been used (Barretina et al., 2012;Dong et al., 2015;Suphavilai et al., 2018). Nilsson et al. (2007) indicated that the appropriate selection of small feature set gives the best possible classification results. Thus, selecting an appropriate feature set from a large number of genetic feature candidates is a crucial issue for classification models for predicting drug response. In this paper, we developed a drug response prediction model, called AutoBorutaRF, by using autoencoder (Liou et al., 2008) and Boruta algorithm (Kursa et al., 2010) for feature selection and RF for classification. We first constructed the autoencoder network (Liou et al., 2008), which is a type of artificial neural network, for the reduction of genetic features. By using the Gedeon method (Gedeon, 1997), we initially reduced the total number of features. We further selected a smaller feature set feasible for RF by using the Boruta algorithm. By applying AutoBorutaRF to GDSC and CCLE, we proved that our proposed method is of excellent prediction accuracy. We further analyzed the biomarkers obtained from the lung cell lines in GDSC by the proposed feature selection method.

Datasets and Preprocessing
In this research, we used two datasets, including GDSC (Garnett et al., 2012) and CCLE (Barretina et al., 2012). The datasets were downloaded by using R package PharmacoGx (Smirnov et al., 2015). We used the sensitivity measure IC50 (Barretina et al., 2012;Garnett et al., 2012) as the response variable (denoted by y rs,c ) for cell line c. We used three types of genetic features as the explanatory variables, including the gene expression (denoted by x rna,g ), the single-nucleotide mutation (denoted by x snv,g ), and the copy number alternation (denoted by x cna,g ) for gene g. Note that the elements in x rna,g and x cna,g are real-valued; the elements in x snv,g are binary-valued, i.e., "1" for mutation and "0" for wild type. In the two datasets, some cell lines missed the values of the response variable, the single-nucleotide mutation features, and the copy number alteration features. There was no missing value in the gene expression features. We first removed the features with the cell lines missing values more than 50%. Then, we removed the cell lines with more than 50% features missing values from the datasets. For the remaining cell lines with missing values, we used a weight mean method to compensate the missing values as follows: 1. Let z * c,g denote the missing value for the cell line c in the response variable or the genetic feature g. Let x rna,c denote the vector of gene expression features for the cell line c.
2. Assume the cell line k has no missing data for the features involved in z * c,g . The diversity between the cell lines c and k is obtained by d(c, k) = x rna,c − x rna,k 2 2 . Search K cell lines nearest to g with respect to d(c, i). 3. If g is the response variable or the copy number alternation feature, z * c,g is compensated by 4. If g is the single-nucleotide mutation feature, z c,g is compensated by where 1() = 1 for the true statement in the parenthesis and 1() = 0 for the negative statement in the parenthesis.
We set K = 10 for the preprocessing of GDSC and CCLE datasets.

Label Assignment for Cell Lines According to IC50
This research is to construct classification models for predicting how the cell lines respond to the drugs under study. The drug responses can be divided into two categories: "sensitivity" and "non-sensitivity" (Liu et al., 2016). So far, several works have used various threshold values of IC50 to classify the drug responses (Brubaker et al., 2014;Li et al., 2015). Brubaker et al. (2014) used a hard threshold 0.1 to label sensitivity for IC50< 0.1 and to label non-sensitivity (i.e., resistance in this work) for IC50≥ 0.1. However, by investigating the histograms of IC50, we found that the statistics of drugs are various. It can be supposed that the decision of labels should be driven by the data of individual drugs. To this end, we adopted the strategy introduced in Li et al. (2015), which used the median of the observed IC50 values as a data-driven threshold.
We labeled a cell line as "sensitivity" if its IC50 is smaller than the median overall the cell lines for an individual drug. We labeled a cell line "non-sensitivity" if its IC50 is equal to or larger than the median overall the cell lines for an individual drug.

Classification Model
The drug response data are often of imbalanced classifications. Because RF is outstanding for the imbalanced classification problem, we used it as the classification model. In RF, we used classification and regression trees (CART) algorithm as the basic classifier. RF randomly generalizes 1,000 CARTs. Each CART is trained by using ⌈0.632 × N sample ⌉ bootstrapping samples, where N sample is a total of cell lines. The ultimate results were determined through voting with the prediction results of all CARTs.

Feature Selection With the Autoencoder and Boruta Algorithm
Feature selection is crucial for improving the prediction performance of the classification models. We used the Boruta algorithm, which aims to the feature selection problem for RF (Kursa et al., 2010) (Figure 1). The considerable cardinality of the feature candidate set leads to the curse of dimensionality for the Boruta algorithm. Thus, we first used the autoencoder network, to roughly screen out the features to a proper dimension. The detailed two-stepwise feature selection procedure is described as follows: Step 1: We trained two single-hidden-layer autoencoder networks, with hyperbolic tangent being the activation functions, for screening out the features of the gene expression and the features of the copy number alteration, respectively. Different from the straight application of the hidden layers of the autoencoder, we used Gedeon method (Gedeon, 1997) to calculate the proportional contributions to select the significant genes. The contribution of the ith input (gene) to the jth output (gene) is calculated as Here K denotes the total number of the neurons of the hidden layer. P ik is the contribution of the ith input to the kth neuron of the hidden layer calculated by with G being the total number of the inputs and W i * k s being the weights linking the corresponding neuron couples. P kj is the contribution of the kth neuron of the hidden layer to the jth output, whose calculation is similar to that of P ik . The total contribution of the ith input is calculated by We ranked the inputs of the autoencoder in the descending order with respect to q i and removed the last 50% features. We also removed the features, whose means of correlation coefficients with other features were more than 0.95.
Step 2: From the features obtained by Step 2, the Boruta algorithm was used to select features for RF as follows:

EasyEnsemble for Imbalanced Datasets
The total number of cell lines sensitive to drugs is much smaller than that of cell lines non-sensitive to drugs. Thus, the datasets in this research are the class imbalance. Let N and R denote the sample set of majority class (non-sensitivity) and that of minority class (sensitivity), respectively. The imbalance ratio IR = |N |/|R| is used to measure the class imbalance, with | · | being the cardinality of a set. For the various drugs under study, the values of IR are different. In this research, for the drugs with IR≤ 2, the feature selection and classification method were directly used; for the drugs with IR> 2, we used EasyEnsemble (Liu et al., 2009) resampling strategy to deal with the imbalance class problem. The core procedure of EasyEnsemble used here is described as follows: 1. Equally divide N into T subsets {N i |i = 1, 2, · · · , T}, with T = ⌊IR⌋. Such that |N i | ≈ |R|. 2. The RF classifier F i (x) is constructed on each training subsets {N i , R} for i = 1, 2, · · · , T. 3. Take the majority vote according to the T predictions of {F i (x)|i = 1, 2, · · · , T}.

Evaluation Criteria
We used the following metrics to evaluate the performance of the classification models: Besides the metrics above, AUC was also obtained. Because the total number of samples was much smaller than that of the features, the above evaluation criteria were obtained by using 10-fold cross validation (CV). The dataset was randomly partitioned into 10 equal sized subsets. Of the ten subsets, a single subset was used as the test set to calculate the evaluation criteria of the models trained by the remaining nine subsets. The above process was then repeated 10 times, and the mean of the evaluation criteria obtained in the 10 times was used as the final criteria. In this way, the test datasets can be ensured to be independent of the training datasets.

Data Description
There are missing data in both datasets. These missing data were compensated by using the weighted mean method described in the section Materials and Methods. The total numbers of samples for each variable are listed in Table 1.
According to their histograms, the most of distributions of drug responses of cell lines in two datasets can be approximated by the Gauss distribution (Figure 2). t-hypothesis test showed that the significance of two groups divided by median of IC50 in GDSC is of p-values from 4.27 × 10 −160 to 6.89 × 10 −46 ; such significance in CCLE is of p-value from 7.14 × 10 −95 to 4.05 × 10 −4 .

Prediction Performance of AutoBorutaRF
To illustrate the effectiveness of our AutoBorutaRF method, we demonstrated its prediction performance on GDSC and CCLE datasets. Meanwhile, we compared it with other four algorithms,  (363) The number in the parenthesis means a total of cell lines corresponding to the features. including naive Bayes classifier (Barretina et al., 2012), SVM-RFE (Dong et al., 2015), FSelector for k-nearest-neighbors (KNN) algorithm (Soufan et al., 2015), and AutoHidden. The naive Bayes method first selected the top 30 features using either nonparametric Wilcoxon Sum Rank Test (for the gene expression features) or Fisher Exact Test (for the gene mutations). Then, the remaining significant features (p < 0.25) were clustered using a message-passing algorithm for each type of features. Then, they combined these two-part features and used a naive Bayes classifier for the drug response classification prediction. SVM-RFE is a wrapper method using a recursive feature selection and SVM classifier. The parameters of feature number, gamma and cost were set to be 10, 0.5, and 10, which were the optimal parameters selected by SVM-RFE. FSelector selected features using FSelector based on the information entropy and applied to the KNN algorithm. In AutoHidden, we directly use the hidden layer of the autoencoder constructed in our AutoBorutaRF, as the features. The overall prediction performance of the five methods for the two datasets is illustrated in Tables 2, 3 and Figure 3. All the metrics in the figure were obtained by using 10-fold CV. Figure 3 showed that our method was of the best performance with respect to AUC, accuracy, recall, specificity, F 1 score, and Matthews correlation coefficient.
Among the 98 drugs in GDSC, ABT-888 presented the worst prediction with AUC being 0.5935, and the best prediction is for RDEA119 with AUC being 0.8282. Meanwhile, RDEA119, PD-0325901, 17-AAG, and Vorinostat were the only four drugs with AUC >0.8. However, there were 59 drugs, whose AUCs were higher than 0.7. Among the 24 drugs in CCLE, the worst prediction is for AEW541 with AUC being 0.6509. The best three predictions are for Nutlin-3, LBW242, and AZD6244, with AUC being 0.9633, 0.9300, and 0.9079, respectively. The AUCs of Irinotecan, Panobinostat, PD-0332991, PD-0325901, PHA-665752, PLX4720, and Topotecan are higher than 0.85. The receiver operating characteristic (ROC ) curves are listed in Supplementary File 1.

Identified Biomarkers Are Associated With Cancer and Drug Target Pathway
We used 95 lung cell lines in the GDSC database to illustrate the biological significance of the identified biomarkers. Figure 4A  shows the prediction performance of AutoBorutaRF for the lung cell lines. AutoBorutaRF showed satisfying prediction performance for predicting the drug responses for the lung cell lines. We used the non-parametric Wilcoxon sum rank test for the genetic features of gene expression and copy number alternation and a Fisher exact test for the genetic feature of single-nucleotide mutation, to test the significant difference of the genetic features between the sensitive and nonsensitive populations. Among all the identified 1,087 features (Supplementary File 2), a total of features with p < 0.05 was 1029, shown by Figure 4B. These results showed that most of the identified features were of significantly different genetic profiles between two classes (Supplementary File 3).
We further use PLX4720 and BIBW2992 as two examples to illustrate the biological significance of the features selected for the lung cell lines. Prediction metrics of these two drugs are shown in Figure 5. PLX4720 is the inhibitor for B-raf and targets at MAPK signaling pathway (Michaelis et al., 2014). The selected significant features for PLX4720 were CCL19, CCRL2, CST7, GPR143, HDAC5, and IDO1. CCRL2 inhibits p38 MAPK phosphorylation and up-regulates the expression of E-cadherin . Besides, CCR7, CST7, GPR143, HDAC5, and IDO1 are also related to lung cancer or the MAPK pathway (Liu et al., 2014(Liu et al., , 2018Li and Seto, 2016;Matthews et al., 2016;Rose et al., 2016).
BIBW2992 inhibits ERBB2 and EGFR and targets at EGFR signaling pathway (Iorio et al., 2016) and has been widely investigated for cancers, like lung cancer and melanoma (Rinehart et al., 2004;Nehs et al., 2010;Varmeh et al., 2016). The selected significant features were FYN, KCNH2, REST, CDH12, FIGURE 3 | Box plots of the six evaluation metrics overall the cell lines in the (A) GDSC and (B) CCLE datasets. Our method was of the best performance with respect to AUC, accuracy, recall, specificity, F 1 score, and Matthews correlation coefficient. The naive Bayes classifier and SVM-RFE outperformed at specificity.  LRRC8E, SCG2, PHF8, PCSK1, ANXA2, and MIR6730. FYN was an authentic Effector of oncogenic EGFR signaling, by limiting EGFR tumor cell motility (Lu et al., 2009). CDH12 plays an important role in non-small-cell lung cancer(NSCLC) geneses, resulting from that the mutations of CDH12 and other PRAME family members were equally distributed among tumors of different grades and stages (Bankovic et al., 2010). SCG2 is in connection with the alteration of miRNA profiles in A549 human non-small-cell lung cancer cells (Shin et al., 2009). KCNH2, REST, LRRC8E, PHF8, PCSK1, ANXA2, and MIR6730 have been also proved to be related to signaling pathway EGFR and lung cancer (Bonilla and Geha, 2006;de Castro et al., 2006;Kreisler et al., 2010;Wang et al., 2012;Demidyuk et al., 2013;Shen et al., 2014;Díaz-Rodríguez et al., 2018). The function descriptions and interaction networks of the identified features for PLX4720 and BIBW2992 are included in Supplementary File 4.

DISCUSSION
The prediction of anticancer drug response is crucial for many applications, like the preclinical setting and clinical trial design.
The prediction models for drug response include regression models and classification models. This research developed AutoBorutaRF for predicting the drug response for a twofold aim: achieving proper features for RF and investigating biologically significant biomarkers for the explaining drug response. Because the genetic feature candidates are a vast set, we cannot directly apply the well developed Boruta algorithm for feature selection. We first drastically reduced the dimension by constructing the autoencoder network. Different from the typical application of a hidden layer of the autoencoder, we extracted the inputs with large contributions evaluated by the Gedeon method.
Considering AUC= 0.7 as a pass mark, 22 of 24 drugs in CCLE were of qualified prediction performance; 59 of 98 drugs in GDSC were of qualified prediction performance. Further analysis should be conducted to investigate the reasons leading to the prediction difference between two datasets.
We further investigated the biological significance. We proved that most of the identified genetic features between the sensitive and non-sensitive cell lines were significantly different. By using PLX4720 and BIBW2992 as two examples, we illustrated that many genes identified by AutoBorutaRF were reported to have close relationship with tumorigenesis or cancer progression. The detailed function explanations and interaction networks of the selected features can be referred to Supplementary File 4. Thus, AutoBorutaRF can be considered to be a capable machine learning method for determining the biomarkers for predicting the drug response for the preclinical and clinical purposes.
Note that our proposed method used no prior information to obtain the optimal feature set in the sense of prediction performance. In future research, the pre-determined information, like pathway knowledge, and the prior distribution describing the uncertainties of anticancer drugs can be considered to be embedded in our method.

DATA AVAILABILITY
The source code and datasets for this study can be downloaded from https://github.com/bioinformatics-xu/AutoBorutaRF.

AUTHOR CONTRIBUTIONS
XX and PQ processed the data, designed the algorithm, and the programming codes, and wrote the manuscript. YW supported result interpretation and manuscript writing. JW and HG supervised the project and contributed to writing the manuscript.