DSPLMF: A Method for Cancer Drug Sensitivity Prediction Using a Novel Regularization Approach in Logistic Matrix Factorization

The ability to predict the drug response for cancer disease based on genomics information is an essential problem in modern oncology, leading to personalized treatment. By predicting accurate anticancer responses, oncologists achieve a complete understanding of the effective treatment for each patient. In this paper, we present DSPLMF (Drug Sensitivity Prediction using Logistic Matrix Factorization) approach based on Recommender Systems. DSPLMF focuses on discovering effective features of cell lines and drugs for computing the probability of the cell lines are sensitive to drugs by logistic matrix factorization approach. Since similar cell lines and similar drugs may have similar drug responses and incorporating similarities between cell lines and drugs can potentially improve the drug response prediction, gene expression profile, copy number alteration, and single-nucleotide mutation information are used for cell line similarity and chemical structures of drugs are used for drug similarity. Evaluation of the proposed method on CCLE and GDSC datasets and comparison with some of the state-of-the-art methods indicates that the result of DSPLMF is significantly more accurate and more efficient than these methods. To demonstrate the ability of the proposed method, the obtained latent vectors are used to identify subtypes of cancer of the cell line and the predicted IC50 values are used to depict drug-pathway associations. The source code of DSPLMF method is available in https://github.com/emdadi/DSPLMF.


INTRODUCTION
Cancer is a genetic disease that results when cellular changes and accumulation of different types of mutations cause the uncontrolled growth and division of cells. There are more than 200 different types of cancer, having a significant global impact on public health. Since cancer is a disease of genetic complexity and diversity, the drug response for different patients can be different. The main reason for this occurrence is the difference in the molecular and genetic information of individuals, such as gene expression data, the type of mutation in the genome and copy number alteration information. These findings and achievements have recently made a significant challenge in the prediction of drug response for an individual patient in the research of precision medicine.
High-throughput drug screening technologies on several panels of cancer cell lines have been provided. For instance, two recent consortiums Genomics of Drug Sensitivity in Cancer (GDSC) Yang et al. (2012) and Cancer Cell Line Encyclopedia (CCLE) Barretina et al. (2012) have collected around 1,000 cell lines and their pharmacological profiles for several cancer drugs. The IC50 measure (minimal concentration of drug that induced 50% cell line death) is usually used as a sensitivity measure. To facilitate and speed up drug discovery and prediction process, many methods have been developed in these fields by researches from numerous domains such as computational biology, machine learning, and data mining approaches.
In the challenge of the DREAM project, the performance of 44 drug response prediction algorithms was considered for breast cancer cell lines. The introduced algorithms were evaluated using the weighted probabilistic c-index (WPC-index) and resampled Spearman correlation Costello et al. (2014). Various machine learning methods have been proposed in this area. Barretina et al. proposed a method for predicting drug response based on naive Bayes classifier that selected importance features by two steps. First, they used Wilcoxon Sum Rank Test and Fisher Exact Test to select the 30 top features and then they applied naive Bayes classifier for drug response prediction Barretina et al. (2012). SVM-RFE method is a wrapper that used SVM classifier and recursive feature selection method Dong et al. (2015). FSelector method used k-nearest neighbor (KNN) algorithm based on selected features that are achieved by information entropy Soufan et al. (2015). Suphavilai et al. (2018) proposed the CaDRReS method as a predictor cancer drug response model based on the recommender system and learning projections for drugs and cell lines into a latent space. AutoBorutaRF was presented by Xu et al., based on feature selection for classification of anticancer drug responses. The method first built a subset of essential features, then used Boruta algorithms Kursa et al. (2010) to select some features for applying Random-Forest classifier to predict drug response Lu et al. (2019).
In this paper, we modeled the cancer drug sensitivity problem based on "Recommender Systems" approach. A logistic matrix factorization algorithm was used for predicting drug cancer response. By applying the proposed model to GDSC and CCLE datasets, we proved that DSPLMF is of excellent prediction accuracy.

MATERIALS AND METHOD Datasets
The performance of drug response prediction algorithms was evaluated on two benchmark datasets, including GDSC and CCLE. The datasets were downloaded by using R package PharmacoGx Smirnov et al. (2015). In these datasets, there are several types of information such as IC50 values according to the set of cell lines and drugs and some other information such as gene expression profile, copy number alteration, and singlenucleotide mutation that used in the model designing for more efficiency. Since in these datasets some of the above information is missing, the method of compensating for missing values given by Lu et al. (2019) is used. The missing value for a cell line can belong to response value, copy number alteration, and singlenucleotide mutation features. The cell lines with more than 50% missing value were removed from the dataset and for remaining, the missing values were predicted from the known values of knearest cell lines. At the end, 555 cell lines and 98 drugs remain without any missing value for GDSC and 363 cell lines and 24 drugs for CCLE datasets.

Method
The main idea of the model DSPLMF is to construct a classification model for predicting how a cell line responds to a drug. Since drug response can be divided into two classes "sensitivity" and "resistance," there are many ways for the purpose of classification based on IC50 values. By considering the histograms of IC50, we observed some histograms are normal-like, and others have skewness. Also, it can be supposed that the labels of classes should be determined by the data of individual drugs. For normal-like histograms, median, and mean are the same. If the histogram is skewed right, the mean is greater than the median, and if the histogram is skewed left, the mean is smaller than the median. We chose medium because we wanted to set a single, universal standard threshold for all drugs. So, the strategy introduced by Li et al. (2015) was used and the median of IC50 values were applied as a threshold for classification. The "sensitivity" or class with label 1 was assigned to a cell line if its IC50 is smaller than the median of cell lines for an individual drug and "resistance" or class with label 0 to a cell line was assigned, otherwise. DSPLMF method has four main steps as follows.
In the first step, by converting the model to a classification problem, a 0,1 observation matrix was achieved, as cell lines and drugs are rows and columns of the matrix, respectively. Then, a logistic matrix factorization method for constructing the latent vectors for each cell line and drug is applied. In the second step, for improving the prediction accuracy of the model, the similarity information for cell lines and drugs are used. In the third step, a model is applied to learn to predict the probability that a new cell line would sensitive to a drug. Subsequently, with applying the threshold to predicted probabilities of the cell linedrug pairs, we classified each pair to sensitive or resistance class. In the next section, first the similarity matrices used in the model, were introduced and then the details of each step are explained in the following steps. The main scheme of DSPLMF algorithm is represented in Figure 1.
• Gene expression Similarity, Sim exp Gene expression information is an auxiliary feature for similarity between cell lines. Let e i denoted the gene expression vector of cell line c i in cancerous conditions. For pair of cell lines c i and c j , Sim exp (c i , c j ) is defined as the Pearson correlation between the vectors e i and e j and the gene expression similarity matrix between cell lines considered as FIGURE 1 | Scheme of DSPLMF algorithm. First, similarities between each pair of cell lines are constructed based on the information of gene expression, singlenucleotide mutation, copy number alteration, and IC50 values. Also, similarity between each pair of drugs is defined based on chemical substructure and the median of IC50 values are applied as a threshold for classification. Using DSPLMF model, the latent vectors for each cell line and drug are achieved. For each new cell line, decision tree classifier is applied to find its t-most nearest neighbors and the probabilities that this cell line is sensitive to drugs are estimated based on the latent vectors of its neighbors. Eventually, a threshold is applied on probabilities to assign sensitive or resistance class to each new cell line-drug pair.
To aggregate these similarities to a single matrix, Sim total = [SC ij ] n × n , the following formula is used: where g, l, f and y are parameters that represent the importance of each of the matrix and tuned in the model. The numbers of considered genes for two datasets GDSC and CCLE for Sim exp are 11,712 and 19,389, respectively. The mutation information of 54 genes is accessible for cell lines in GDSC dataset and 1,667 genes for cell lines in CCLE dataset.
The information of copy number alteration of 24,959 and 24,960 genes for two GDSC and CCLE datasets are accessible, respectively. Since three matrices Sim exp , Sim cnv , and Sim mut have been constructed by different sets of genes (the number of common genes between them is about 50%), there is not an additive relation between them. In general, an absolute correlation coefficient of >0.7 among two or more predictors indicates the presence of collinearity. But as Table 1 shows, all correlation coefficients between similarity matrices are very low, so there is not collinearity between matrices and they can be linearly combined.

Drug Similarity, Sim drug
Since it is expected that similar drugs have the same effect on cell lines, drug similarity information for predicting drug response was used in the proposed method. A drug can be represented as a binary feature vector, by using drug substructures, drug transporters, drug targets, drug enzymes, drug pathways, drug indications, or drug side effects information. Since there is only information about chemical substructures, for each drug we have a zero-one vector of size 881, where 881 is the number of known chemical substructures of a drug. In this vector one indicates the presence of a substructure of drug and zero otherwise. We downloaded the substructure for each drug from PubChem. The PubChem system generates a binary substructure fingerprint for chemical structures. These fingerprints are used by PubChem for similarity neighboring and similarity searching. Let V d i and V d j are the vectors correspond to the drugs d i and d j . Similarity (di, dj) is considered as Jaccard similarity between these two vectors. We construct the matrix Sim drug = [SD ij ] m × m as similarity matrix between each pair of drugs.

Logistic Matrix Factorization
Assume the set of cell lines is denoted by C = {c 1 , c 2 , …, c n } and the set of drugs is denoted by D = {d 1 , d 2 , …, d m }, where n and m are the numbers of cell lines and the numbers of drugs, respectively. The relationship between cell lines and drugs are represented by a binary matrix Q = [q ij ] n × m , where each element q ij ∈ {0, 1}. If a cell line is c i sensitive to a drug d j , q ij = 1 and otherwise q ij = 0. The probability of sensitivity of a cell line to a drug is defined by a logistic function as follows: where u i nd v j are the latent vectors of size L corresponding to i-th cell line and j-th drug, respectively and the latent vectors of all cell lines and all drugs are denoted by U and V, respectively. On the other hands, the non-negative values b c i and b d j are the bias parameters according to cell line i and drug j, respectively. Moreover, we denoted b c ∈ R n × 1 and b d ∈ R m × 1 as bias vectors for cell lines and drugs, respectively. Bias parameters are considered because some cell lines respond significantly to many drugs and there are cell lines that respond to few drugs. Similarly for some drugs, there are many cell lines that respond to them, and there are drugs that most cell lines do not respond to significantly. Thus, by applying these parameters, we try to reduce bias. The vectors considered as bias vector of the model.
In this model, all the data in the training set are assumed to be independent. So the probability that matrix Q occurred, considering the latent and bias vectors, can be computed as: When q ij = 1 then both r (1q ij ) and 1q ij are zero. Similarly, when q ij = 0, rq ij = q ij = 0. So, formula 3 is rewritten as follows: Finally, the above probability is shown as follows: Where (r ≧ ̸ 1) is used to control the importance levels of observed interactions. In some classification problems with two classes (0 and 1), lack of information make us to assign label zero to some objects. But, it may be that the real label of these objects are one. So, the members of class one are highly trusted, while some members assign to class zero because of lack of information. As an example, in drug-target prediction or drug-drug interaction prediction models, the observed interacting drug-target pairs or drug-drug pairs have been experimentally verified; thus, they are more trustworthy and important than the unknown pairs. Toward more accurate modeling for these prediction models, the authors can assign higher importance levels to the interaction pairs than unknown pairs. This importance weighting strategy (considering r > 1) has been demonstrated to be effective for personalized recommendations. On the other hand, in DSPLMF model, both classes (sensitivity and resistance) have the same importance and validity. So, we set r to be one. We also deposited zero-mean spherical Gaussian priors on latent vectors of cell lines and drugs as: where I denotes the identity matrix and s 2 c and s 2 d are parameters for controlling the variances of prior distributions of cell lines and drugs. Based on Bayesian theorem we have: Since U, V, b c , b d are the parameters in the model M, Bayesian theorem is as follows: So we can conclude the following relation: According to the Bayesian theorem and equations 5, 6, and 7, the log of the posterior distribution is estimated as follows: In formula 11, regarding how Bayesian theorem is applied to classification problems, we could convert the direct proportional relation between the left hand side and the numerator of the fraction of equation 10 to equalized, by adding constant term T to the formula. Where T is independent of the model parameters Hand et al. (1999).
The parameters of the model can be learned by maximizing the above formula, which is equivalent to minimizing the following objective function: where ‖·‖ F denotes the Frobenius norm of matrix. For regularization the objective function 12, for each cell line c i, we choose the set N k (c i ) that denotes the k-most similar cell lines to c (except c i ) using Sim total matrix. We constructed adjacency matrix A= [a ij ] nxn that represents cell line neighborhood information as follow: A is an n × n matrix, which for the row corresponding to cell line c i , the entries of columns corresponding to the kmost similar cell lines of c i are obtained from their similarities, Sim total matrix, and the other elements of this row are zero.
Similarly, for a drug d i , the set N k (di) denotes the k-most similar drugs to d i (except d i ) using Sim drug matrix. The adjacency matrix B to describe the drug neighborhood information is denoted by B is an m × m matrix, which for the row corresponding to drug d i , the entries of columns corresponding to the k-most similar drugs of d i are obtained from their similarities, Sim drug matrix, and the other elements of this row are zero.
To illustrate the data structure of these similarity matrices, as an example, for k = 5 and 24 drugs in CCLE dataset, the similarity matrix B is denoted in Figure 2A. Figure 2B, shows the graph corresponding to this matrix. As it can be seen from Figure 2A, each row i of the matrix has five nonzero elements corresponding to the five-most similar drugs of d i in Sim drug matrix, and the other elements are zero. In Figure 2B, the degree of each node is five and the red edges denote the neighbors of the nutlin-3. 5-most similar drugs to Nutlin-3 based on sim drug matrix are AEW541, AZD0530, Lapatinib, crizotinib, and sorafenib.
To minimize the distance between feature vector corresponding to cell line i and vectors of its nearest neighbors in latent space, we minimize two objective functions in formulas 15, 16 as follows: . a and b are two parameters for weighting the similarity between cell lines and drugs, respectively.
The values of two matrices A and B show the similarity of the cell lines to each other and the similarity of the drugs to each other, respectively. Using the calculation of Frobenius norm multiplied by the elements in A and B is because of we would like more similar cell lines (drugs), have closer latent vectors in the latent space. But, the parameters a and b determine the effectiveness of these two matrices A and B in the objective function. By these strategies and tuning the parameters a and b, we determine the impact of cell line similarity and drug similarity in DSPLMF method.
By plugging two equations 15, 16 into formula 12, we will have the following:  Finally, we upgrade the formula 17 as follows: By this function, we try to predict the latent vectors of cell lines and drugs, where the similar cell lines or drugs have closer latent vectors to their KNNs.
For optimization the above function, the alternating gradient descent method was used. In each iteration of this algorithm, first U and b c i are fixed to compute V and b d j and then V and b d j are fixed to compute U and b c i . Besides, to accelerate the convergence, the AdaGrad algorithm was applied and the details of this algorithm are deposited in the Supplementary File 3 (Data Sheet 3). The objective function in formula 18 is denoted by Y and the partial gradients of biases and latent vectors are calculated as follow: Once the latent matrices U and V and the biases b c i and b d j have been learned, the probability of sensitivity cell line i to drug j can be estimated by logistic function in formula 2. Since in our model, the importance of the positive observations and negative observations are the same, we set r = 1 in this logistic function.

Prediction
When a new cell line is given, its information of IC50 of the drugs is unknown and Sim IC50 matrix values cannot be calculated, while it must be calculated to predict the latent vectors of this new cell line. In this section, we introduced a classification model for predicting t-most nearest neighbors by using the similarity values between cell lines which are obtained from gene expression profile, copy number alteration and singlenucleotide mutation information. The purpose of this model is to find t-most nearest neighbors for the new cell line and then to estimate the latent vector for this new cell line based on average of latent vectors of its neighbors. After obtaining the latent vector, we can predict the IC50 values across all drugs for the new cell line. For training the model, 10-fold cross validation technique is used on cell line dataset, so the dataset was partitioned into 10 equal-sized subsets, nine subsets were used as the train set for learning this classification model. A single subset was used as the test set to predict the t-most nearest neighbors for each cell line of this set. In this classification model, the amounts of Sim IC50 matrix of train set were converted to 0 or 1. To do this, the values of each row of the matrix are sorted in descending order and then t-largest values are set to 1 and remaining values are set to 0. Among the methods available for classification, we chose "Decision Tree Classifier" method. It is one of the predictive modeling approaches that used tree models to predict the value of a target variable based on several input features. Where leaves represent class labels and branches denote conjunctions of features that lead to those class labels. Learned trees can be represented as sets of ifthen rules. Decision tree classifier is a heuristic and nonbacktracking search through the space of all possible decision trees. The main idea of decision tree classification is recursively partition data into subgroups. The functionality of decision tree classification is as follows: Polat and Güneş (2007) • Choosing an attribute and formulating a logical attribute test.
• Branching on each test result, transferring subset of examples (training information) to the appropriate child node to satisfy that result. • Running each child's node recursively.
• The end rule indicates when a leaf node is to be declared.
For decision tree classifier, the three features of train set, Sim exp , Sim cnv , and Sim mut , are considered as input and 0 or 1 value of each pair (c i ,c j ) are considered as output and then as the classifier train. If the number of predicted nearest neighbors for a cell line was less than t, we considered them as nearest neighbors for this cell line. If this number was greater than t, t neighbors were selected randomly. Finally, u i was estimated as the average of latent vectors of neighbors of the new cell line c i .
When the latent vector of the new cell line is predicted, the probabilities that this cell line is sensitive to drugs are estimated. Eventually, a threshold on probabilities to assign sensitive or resistance class to each cell line-drug pair is applied. So if the predicted value is lower than this threshold for a cell line-drug pair, the resistance class is assigned to it; otherwise, it is labeled as a sensitive class.

RESULT
We empirically evaluate our proposed approach and compare it against some of the state-of-the-art methods. This section first describe evaluation criteria and then demonstrate the performance of DSPLMF method.

Evaluation Criteria
To evaluation the performance of DSPLMF method, the 10-fold cross-validation Was performed and this process was repeated 30 times. The mean of following criteria was obtained in the 30 times and it was used as the final criteria to evaluate the predictive performance of the methods.
where TP or true positive prediction is the number of cell lines labeled with sensitivity and predicted as sensitivity. TN or true negative is the number of cell lines labeled with resistance and predicted as resistance. FP or false positive is the number of cell lines labeled with resistance and predicted as sensitivity. FN or false negative is the number of cell lines labeled with sensitivity and predicted as resistance.
In addition to the above metrics, we used area under the receiver operating characteristic curve (AUC), which is one of the most important evaluation metrics for checking the performance of any classification model. This metric was calculated for the methods.

Comparison With the State-of-the-Art-Methods
To demonstrate the effectiveness of our method, we compared the predictive performance of the proposed model against the state-of-the-art-methods such as naive Bayes Barretina et al. (2012), SVM-RFE Dong et al. (2015), FSelector Soufan et al. (2015), CaDRReS Suphavilai et al. (2018), AutoBorutaRF Lu et al. (2019), and the AutoHidden method, which is constructed based on the hidden layer of the autoencoder in AutoBorutaRF method as features Lu et al. (2019).
All the methods mentioned above are classification models except the CaDRReS, since this method predicted IC50 values as output, a threshold was applied for its output. So if the value predicted for a cell line-drug pair is smaller than this threshold, the resistance class was assigned to it; otherwise, it was labeled with sensitive class. The median of the IC50 values was chosen as the best threshold for this algorithm. The results of the mentioned methods on two datasets GDSC and CCLE are shown in Tables 2 and 3, and the bold number represents the best result. The results of Table 2 show that the value of Accuracy criterion by DSPLMF has increased by 0.03 compared to the result of the best algorithm, AutoBorutaRF. Furthermore, the value of Recall, F 1 Score, MCC, and AUC criteria have increased by 0.10, 0.05, 0.06, and 0.05 compared to the best algorithm. Only in the case of the Specificity criterion, the naive Bayes method performs significantly better than the other methods. The reason is that this method has predicted zero class data for most of the data, and by looking at the result of other criteria, such as Accuracy, Recall, and F 1 Score for this method, we can see that this method does not predict sensitive class data very well. The results of Table 3 are the same as those in the previous table, except that the best result for the AUC criterion belongs to the AutoBorutaRF method, demonstrating the effectiveness of this method. The best result for the Specificity criterion belongs to the AutoHidden method; the low performance of other criteria indicates that this method is weak in predicting sensitive data. In general, the results of these two tables show that the DSPLMF significantly outperforms other methods. Thus, it is evident our method able to find much The 10-fold cross validation is applied on the evaluation metrics and the mean value of them is used as criteria for comparison. The 10-fold cross validation is applied on the evaluation metrics and the mean value of them is used as criteria for comparison.

Emdadi and Eslahchi DSPLMF
Frontiers in Genetics | www.frontiersin.org February 2020 | Volume 11 | Article 75 more useful features for drug response prediction rather than other methods. Overall, DSPLMF improvement on the GDSC dataset is stronger.

Performance of the Novel Regularization Approach
To evaluate the improvement of the logistic matrix factorization method by applying the novel regularization approach, we compared the predictive performance of the DSPLMF model against the logistic matrix factorization method without the novel regularization approach. In this model, the classification method for predicting t-most nearest neighbors for each new cell line by using the similarity values between cell lines which are obtained from gene expression profile, copy number alteration and singlenucleotide mutation information, is not applied. The result of the above algorithm based on seven criteria on GDSC and CCLE datasets is calculated, and the 10-fold cross-validation is applied on the evaluation metrics, and the mean value of them is used as criteria for comparison. The results of Tables 2 and 4 show that the value of Accuracy criterion by DSPLMF on GDSC dataset has increased by 0.10 compared to the result of the logistic matrix factorization method without the novel regularization approach. Furthermore, the value of Recall, Precision, Specificity, F 1 Score, MCC, and AUC criteria have increased by 0.04, 0.10, 0.17, 0.07, 0.21, and 0.14 compared to this algorithm. The results of Tables 3 and 4 show that the value of Accuracy criterion by DSPLMF on CCLE dataset has increased by 0.10 compared to the result of the logistic matrix factorization method without the novel regularization approach. Furthermore, the value of Recall, Precision, Specificity, F 1 Score, MCC, and AUC criteria has increased by 0.05, 0.11, 0.10, 0.09, 0.16, and 0.10 compared to this algorithm. So, using of the classification method for predicting t-most nearest neighbors of each new cell line in logistic matrix factorization algorithm, will increase the performance by 10%.

Specific Tissue of Cell Line Type
The data in the GDSC dataset is related to different cancers. To demonstrate the performance of DSPLMF method on cancer tissue type, 73 hematopoietic cell lines and 98 drugs from GDSC dataset are considered. This specific type of cell lines are used to train the proposed model and predicted responses for the drugs based on this tissue type. Figure 3 shows the results of all mentioned criteria on these cell lines for the DSPLMF method using 30 times 10-fold cross-validation. The mean of these values are shown in Table 5. As the table shows, if the algorithm is specifically run on a particular type of cancer, it would be expected to yield better results than when considering different types of cancer. These results indicate that DSPLMF can also achieve consistent performance on a specific type of cancer.

Correlation Between Predicted and Observed Responses Values
For further evaluation and to demonstrate the performance of the proposed algorithm, the scatter plots of observed versus predicted responses values for four drugs in CCLE are illustrated in Figure 4. The values predicted by our model are probabilities that cell lines are sensitive to the drugs. For calculation correlation between predicted and observed responses values, the values (u i v T j + b c i + b d j ) in Formula 2 as the predicted IC50 values for cell line c i and drug d j were used. As the plots indicate, there is a high correlation between observed and predicted response values. The scatter plots of all 24 drugs in the CCLE dataset are illustrated in the Supplementary File 2 (Data Sheet 2: Figures S1-S4).

Learning Hyperparameters
For tuning hyperparameters, GDSC dataset has been used, and the obtained hyperparameters are considered for both datasets. The 10-fold cross-validation procedure is applied on GDSC and hyperparameters are chosen empirically by maximizing the summing up of the Accuracy, Recall, Precision, Specificity, F 1 Score, and MCC criteria. For each set of hyperparameters, the whole 10-fold process is repeated 30 times and the average value of the above summing has been calculated. Since the search space of hyperparameters values is large, a grid-search procedure for choosing the hyperparameters was applied.
The dimension of latent space, L, was selected between 1 and 98, the number of drugs. The number of KNNs for building N k (c i ) in equation 13 and the number of t-nearest neighbors in prediction section, were selected from 1 to 50 by step 2. The impact factors of nearest neighbors a and b in equations 15 and 16 were picked from {2 -5 , 2 -4 ,…, 2 2 } and the variance controlling parameters, l c and l d , were chosen from{2 -5 , 2 -4 ,…, 2 1 }. The g, l, f and y parameters The 10-fold cross validation is applied on the evaluation metrics and the mean value of them is used as criteria for comparison. represent the importance of each similarity measure between cell lines in formula 1 and were selected from 1 to 10. Threshold parameter applied on equation 2 for determining the label of the class for each new cell line c i , and was picked from 0.1 to 1 by step 0.1, and the best accuracy of the result is obtained by threshold=0.6. In Table 6, the learned hyperparameters using GDSC dataset is shown. For both datasets, these tuned hyperparameters are used to design the model, except to L, that is calculated for CCLE dataset separately and for this dataset it is set as 23.

Cell Line Subtypes in Latent Space
We used 555 cell lines from different cancerous tissue types in GDSC dataset. For representing the higher similarity between latent vectorsũ i of the cell lines from the same tissue type rather than the cell lines from different tissue types, the t-SNE plot for some tissue types of cancer cell lines is shown in Figure 5. Top five most frequent tissue types including, breast, central nervous system, hematopoietic and lymphoid tissue, COREAD, and lung cancer were considered. As it can be seen from Figure 5 (A), the embedded latent vectors of the cell lines with the same tissue type are located closer than the cell lines with diverse tissue types. This suggests that the proposed method assigned more similar latent vector to cell lines with the same tissue type. In the following, we consider an example of some latent vectors and the similarities between them: Let v 1 , v 2 and v 3 are three latent vectors obtained DSPLMF method of length 95 corresponding to Breast cancer cell line BT − 20, Breast cancer cell line BT − 549 and hematopoietic cancer cell line CA46, respectively. v 1 = [0.01, 0.23, −0.14,…, 0.12] 1x95 , v 2 = [0.17,0.67, −0.1,…,0.34] 1x95 and v 3 = [0.89, −0.9, 0.55,…, −0.17] 1x95 . Similarity (v 1 ,v 2 ) = 0.78, Similarity (v 1 ,v 2 ) = 0.13 and Similarity(v 2 ,v 3 ) = 0.04. As the results show, two vectors belonging to the same tissue types are more similar than two vectors that belong to two different tissue types. Also, in the t-SNE plot, these two vectors belonging to the same tissue types are closer than two vectors that belong to two different tissue types.
In Figure 5B, the latent vectors of different subtypes of lung cancer were considered. These different subtypes are: adenocarcinoma, large cell, squamous cell, and small cell carcinoma. In this figure, the closeness of vectors The 10-fold cross validation is applied on the evaluation metrics and the mean value of them is used as criteria for comparison. corresponding to cell line of the same subtype in this cancer justifies the efficiency of obtained latent vectors.

Investigation Drug-Pathway Association
For inferring drug-pathway associations, the heatmap of Pearson correlation between predicted drug responses and pathway activity scores similar to Suphavilai et al. (2018) is used. We considered 50 Biocarta pathway gene sets from MSigDB Liberzon et al. (2011), and pathway activity scores for CCLE cell lines were calculated as follows: Let PW is a pathway and G(PW) = {g 1 ,g 2 ,…, g r } is the set of genes corresponding to pathway PW. Let fold-change value of g i in cell line c j is x ij , which is obtained by: x ij = Log 2 (expression intensity of g i in cell line c j )− median(Log 2 (expression intensity of g i in all cell lines)) Pathway activity score of pathway PW for cell line c j , PAS j (PW) was calculated by formula 22.
Pathway activity score of PW for all cell line, PAS(PW), are considered as the vector PAS(PW) = [PAS 1 (PW), PAS 2 (PW),…, PAS n (PW)], where n is the numbers of cell lines. Also, the predicted drug responses by DSPLMF for each drug were considered as the vector IC50 predicted = [IC 1 , IC 2 ,…,IC n ].
Then, the association between drug d j and pathway PW is computed by the Pearson correlation between IC50 predicted for drug d j and PAS(PW). A positive correlation indicates that a pathway plays a role in drug resistance and negative correlation demonstrated that a pathway is important in drug sensitivity. The result of the Pearson correlation of 30 pathway gene sets and 24 drugs of CCLE dataset is shown in Figure 6 and the result of 20 other pathways is represented in the Supplementary File 1 (Data Sheet 1). In this figure, the blue is represented the assistance and the red is represented the resistance case. Below, we investigated several instances that indicates consistency between the result of calculated Pearson correlation and previous studies and researches.
• The activation score of the HDAC (Histone deacetylases) pathway is negatively correlated (assistant association) with predicted IC50 value of some drugs such as Panobinostat.   inhibitors drugs, namely, Lapatinib, Erlotinib, Vandetanib, and AZD0530. These observations matched the previous study that denoted the amplification of the EGFR gene is correlated with a high response to EGFR inhibitors Normanno et al. (2006). Moreover, the predicted IC50 values of the Crizotinib (ALK-inhibitor) were positively correlated with the activity score of this pathway and this issue was confirmed in the previous studies Sasaki et al. (2011). • The MTA3 (Downregulated of MTA-3 in ER-negative Breast Tumors) pathway was associated (positively correlated) with two predicted IC50 vectors belong to L-685458(gammasecretase) and PD-0332991(CDK4/6) drugs. Therefore, the cell lines with inactivated MTA3 pathway tend to sensitive to these two drugs Suphavilai et al. (2018). • The VEGF-Hypoxia-Angiogenesis (VEGF) pathway was assistance associated with two RAF inhibitors drugs, namely, PLX4720 and RAF265 drugs that were verified in the previous researches. One of these studies considered inducing the VEGF expression by Raf promotes angiogenesis and blocking RAF/MEK/ERK pathway by RAF inhibitors McCubrey et al. (2007). Moreover, the activity FIGURE 6 | Drug-pathway association based on Cancer Cell Line Encyclopedia (CCLE) dataset. For visualization, 30 Biocarta pathways across 24 drugs were selected. Negative and positive correlations between pathway activity and drug sensitivity scores are denoted as being "assistant" and "resistant" associations, respectively. The blue color is represented the assistance and the red color is represented the resistance.
score of the VEGF pathway was negatively correlated with Sorafenib drug Liu et al. (2006). • The activity score of the mTOR Signaling Pathway that is a central regulator of metabolism and physiology was negatively correlated with predicted Ic50 vector of some drugs such as Panobinostat. Various preclinical studies have been performed to combine panobinostat with several drugs as mTOR inhibitor Singh et al. (2016). • It has been shown that c-met inhibitor drugs such as PHA-665752 and Crizotinib can inhibit WNT pathway activity in tumour cells. We observed the activity score of this pathway was negatively correlated with predicted IC50 vectors of these drugs Tuynman et al. (2008); Zhang et al. (2018). • The assistant association was observed between L − 685458 drug and IGF-1 MTOR pathways. These observations were also reported by Shih et al Shih and Wang (2007). • We observed that the MEK inhibitors such as AZD6244 and PD − 0325901 were positively correlated with activity scores for the EIF2 pathway. Therefore, as mentioned in the previous researches, the cell lines with inactivated EIF2 p athway were sensitive to these drugs Quevedo et al. (2000); Liberzon et al. (2011).

Conclusion
In this work, we introduce a novel method for cancer drug sensitivity prediction based on a recommender system approach. A logistic matrix factorization is applied to predict the extent to which a cell line is sensitive to a drug. The advantage of this method is to obtain latent features of cell lines and drugs for better prediction performance. Since the similarity information of cell lines and drugs can improve higher predictive power, some information such as gene expression profile, copy number alteration and singlenucleotide mutation data for cell lines and Chemical structures of drugs are used.
To demonstrate the validity of DSPLMF method for identifying drug response 10-fold cross validation on CCLE and GDSC datasets are performed. The comparison of DSPLMF with six other the state-of-the-art prediction methods showed that DSPLMF outperformed other methods. The results indicated that the proposed method was able to uncover much more effective features than the other methods for drug response prediction.

DATA AVAILABILITY STATEMENT
The source code of proposed method and Datasets folder for GDSC and CCLE datasets as input data are available in https:// github.com/emdadi/DSPLMF and Supplementary File 4 (Data Sheet 4).

AUTHOR CONTRIBUTIONS
AE designed the algorithm, performed the experiments, and wrote the main manuscript text and the programming codes. CE conducted the experiments and analyzed the results. All authors reviewed the manuscript.

SUPPLEMENTARY MATERIAL
The Supplementary Material for this article can be found online a t : h t t p s : / / w w w . f r o n t i e r s i n . o r g / a r t i c l e s / 1 0 . 3 3 8 9 / fgene.2020.00075/full#supplementary-material SUPPLEMENTARY FILE 1 (DATA SHEET 1) | Results of drug pathway association on CCLE dataset.
SUPPLEMENTARY FILE 2 (FIGURES S1-S4) | The scatter plots of all 24 drugs in the CCLE dataset.