Gradient Boosting Decision Tree-Based Method for Predicting Interactions Between Target Genes and Drugs

Determining the target genes that interact with drugs—drug–target interactions—plays an important role in drug discovery. Identification of drug–target interactions through biological experiments is time consuming, laborious, and costly. Therefore, using computational approaches to predict candidate targets is a good way to reduce the cost of wet-lab experiments. However, the known interactions (positive samples) and the unknown interactions (negative samples) display a serious class imbalance, which has an adverse effect on the accuracy of the prediction results. To mitigate the impact of class imbalance and completely exploit the negative samples, we proposed a new method, named DTIGBDT, based on gradient boosting decision trees, for predicting candidate drug–target interactions. We constructed a drug–target heterogeneous network that contains the drug similarities based on the chemical structures of drugs, the target similarities based on target sequences, and the known drug–target interactions. The topological information of the network was captured by random walks to update the similarities between drugs or targets. The paths between drugs and targets could be divided into multiple categories, and the features of each category of paths were extracted. We constructed a prediction model based on gradient boosting decision trees. The model establishes multiple decision trees with the extracted features and obtains the interaction scores between drugs and targets. DTIGBDT is a method of ensemble learning, and it effectively reduces the impact of class imbalance. The experimental results indicate that DTIGBDT outperforms several state-of-the-art methods for drug–target interaction prediction. In addition, case studies on Quetiapine, Clozapine, Olanzapine, Aripiprazole, and Ziprasidone demonstrate the ability of DTIGBDT to discover potential drug–target interactions.


INTRODUCTION
Computational prediction of drug-target interactions (DTIs) plays a key role in drug discovery and repositioning Yu et al., 2015Yu et al., , 2017b. Drugs exert their functions by interacting with various targets, of which genes are one important group. Through binding, drugs can either enhance or inhibit the expressions of genes and thereby affect disease processes (Overington et al., 2006;Yu et al., 2016;Santos et al., 2017). However, in most cases, drugs may cause multiple side-effects because they can interact with several unintended targets. The identification of targets that interact with drugs by biological and chemical experiments is very laborious and expensive (Langley et al., 2017). Therefore, many studies have attempted to predict DTIs by using computational methods, to reduce the workload and costs in providing candidate DTIs for biologists to verify (Ding et al., ,b, 2019Shen et al., 2017).
Several prediction methods concentrate primarily on incorporating information from drug-target homogeneous networks (Mei et al., 2012;Xu et al., 2014aXu et al., ,b, 2016Li et al., 2015;Hao et al., 2017;Yu et al., 2017a). For example, Bleakley and Yamanishi constructed a support vector machine (SVM) framework named BLM, which is based on a bipartite local model, to predict DTIs (Bleakley and Yamanishi, 2009). However, because this method is trained with a large-scale bipartite graph model, high computational power is needed. Mei et al. analyzed DTI features from neighbors and predicted novel interactions (Mei et al., 2012); it is difficult to obtain enough neighbor information for this method. Ezzat et al. and Luo et al. incorporated topological information by applying a random walk on the homogeneous network and used graph regularized matrix factorization to calculate the propensities of DTIs (Ezzat et al., 2017;Luo et al., 2017). However, the accuracy of the results may be influenced when the features are projected into low-dimensional space, because some valuable information may be lost. Hao et al. proposed a method based on non-linear integral of similarity measurements (Hao et al., 2017). Although this method showed good performance, its accuracy depended heavily on the similarity measurements. DTI prediction has been treated as a binary classification problem in Lee's methods (Lee and Nam, 2018). The features of drugs and targets that were used for training a k-nearest-neighbors model were weighted by random walks. However, the known and unknown DTIs have a serious class imbalance, which has an adverse impact on prediction accuracy. In DDR, which was applied by Olayan et al., path category-based feature vectors were constructed to incorporate the topological information of the network, and a random forest was used for DTI prediction (Olayan et al., 2017). Random forest does not perform as well as in classification when it solves the regression problem, because it cannot yield a continuous output.
In this work, in order to further improve the accuracy of DTI prediction and mitigate the impact of class imbalance, we propose a novel computational method named DTIGBDT. We construct a drug-target heterogeneous network to extract features. A gradient boosting decision tree (GBDT)-based prediction model is used for calculating the propensities of interactions. We compare our approach with other prediction methods using various performance measurements: the results show that DTIGBDT outperforms the other methods.

MATERIALS AND METHODS
Our goal is to predict novel (that is, unknown) interactions between drugs and targets. In order to integrate the information of various connections and the node attributes, we construct a drug-target heterogeneous network. We then design a novel prediction model based on GBDT for the network, to obtain the interaction scores of drug-target pairs. The higher the score, the more likely they are to interact Zeng et al., 2017a).

Dataset for DTI Prediction
We obtained the drug-target interaction data from a published work (Luo et al., 2017). In this dataset, there are 1923 known DTIs, involving 708 drugs from DrugBank 5.0 (Wishart et al., 2017) and 1,412 targets from HPRD 9.0 (Keshava Prasad et al., 2008). For each pair of drugs and each pair of targets, we also extracted the similarities between them from these two databases. The similarity between two drugs was calculated by using the Tanimoto coefficient (Francesco et al., 2010), based on their chemical structures. The similarity between two targets is measured by the Smith-Waterman score (Wenhui et al., 2014), based on their primary sequences.

Construction of Drug-Target Heterogeneous Network
We defined a set of DTIs, which consists of a set of drugs D and a set of targets T, where D = {d 1 , d 2 ,..., d m } includes m drug nodes, and T = {t 1 , t 2 ,..., t n } contains n target nodes. The drug-target network can be considered as a heterogeneous network, which is constructed by a drug network and a target network. In these two networks, we added an edge to connect two drug nodes or two target nodes when the similarity between them were >0. Furthermore, the edge was weighted by the similarity between the two nodes. The edge between a drug and a target represented a known DTI and was weighted by 1. This heterogeneous network can be represented as in Figure 1A.
The interactions between D and T could also be represented as a matrix Y where Y ij is 1 if drug d i and target t j are observed to interact and 0 otherwise. The set of similarities between drugs was represented by S D ǫR m * m and the set of similarities between targets was represented by S T ǫR n * n . The element values in S D or S T are in the range of [0, 1] which represents how similar drugs or targets are to each other.

Similarity Calculation Based on Network and Selection of k Neighbors
Random walk with restart, a network diffusion algorithm, has been widely used to analyze complex biological network data (Köhler et al., 2008;Tong et al., 2008;Berger et al., 2010;Li and Patra, 2010;Xu et al., 2016;Cheng et al., 2018b;Gao et al., 2018). Random walk can consider the topological information of the network to fully analyze the potential associations between nodes. We conduct random walks on the drug and target networks separately, to extract the topological information of the networks. Based on these similarities, we select the k most similar neighbors for each node.
We take the drug network as an example to illustrate the random walk procedure. We defined a matrix N D , in which each element N D (i, j) describes the probability of a transition from d i to d j .
where S D (i, j) represents the similarity between two drugs, d i and d j . Next, we defined a matrix W t D ǫR m * m where W t D (i, j) is the probability that the walker reaches d j from d i after t iterations in the random walk process. The matrix W D t can be calculated as Equation (2).
where parameter a is the restart probability. The matrix W 0 D can be initialized by Equation (3).
The convergence condition of the random walk procedure is W t D − W t−1 D 1 < 10 −6 . After the condition is satisfied, the converged probability W t D (i, j) can be regarded as a similarity score between two drugs. This score incorporates the topological information in the drug network and is used to update the weight of the edge between d i and d j . Next, we selected the k most similar neighbors of d i based on the similarities. We obtained the matrix K D ǫR m * k where the ith row stores the k most similar neighbors of d i . Similarly, we conducted random walk on the target network to obtain the similarity matrix W t T (i, j)ǫR n * n and the matrix of the k most similar neighbors, K T ǫR n * k (Figures 1B,C).

Path Category-Based Features
Based on the assumption that similar drugs can usually interact with the same target and vice versa, we extracted an 18dimensional feature vector based on the path category for each drug-target pair. In this study, we worked with the path categories whose lengths are 2 and 3 (but not longer than that, because of the computational cost). If we limit paths to start at the drug nodes and end at the target nodes, there are only two path categories with length 2. These two categories can be denoted as C 1 : (D-D-T) and C 2 : (D-T-T), where D represents a drug node and T represents a target node. The four categories with paths of length 3 are C 3 :(D-T-T-T), C 4 :(D-D-T-T), C 5 :(D-D-D-T), and C 6 :(D-T-D-T). We considered these six categories of paths to predict whether the drug can interact with the target. In this process, we started from a given drug d i to reach a given target t j through a specific path category C h , where h is selected from {1, 2, 3, . . . , 6}. We only considered paths that pass through the k nearest neighbors of d i or t j . We denoted the set of such paths as R ijh . Next, for the qth path p q between d i and t j , we calculated a weight s by multiplying all weights on the edges of path p q as Equation (4).
where e x is the xth edge of p q , and w x is the weight of the edge. We defined three matrices V 1 ǫR i * j * h , V 2 ǫR i * j * h , and V 3 ǫR i * j * h , to store the features between d i and t j under each path category is the number of paths in the set.
We combined the three matrices into a new matrix where the row V f (i, j) represents the feature vector of d i and t j ( Figure 1D). We take the drug-target pair (d 7 , t 3 ) in Figure 1A as an example to describe the process of heterogeneous networkbased feature extraction. The paths from d 7 to t 3 are shown in Figure 2A, and the values of s for each path are listed in Figure 2B. There are two paths in the set R 733 , p 1 : d 7 -t 5 -t 2 -t 3 and p 2 : d 7 -t 5 -t 4 -t 3, and the values of s for these paths are 0.03 and 0.05, respectively. V 1 (7,3,3) is set as the sum of these s-values, 0.08. V 2 (7,3,3) is set as the maximum of them, 0.05. V 3 (7,3,3) is set as the number of the paths, 2.

DTI Prediction Model Based on GBDT
In our dataset, there are only 1,923 known drug-target interactions, while more than 300,000 interactions are unknown, which causes a serious class imbalance. Aiming to reduce the impact of class imbalance and make full use of the negative samples in the dataset, we constructed an ensemble learning model based on GBDT (Ye et al., 2009), and refer to it as DTIGBDT.
The feature of a drug-target pair was obtained by randomly sampling some of the features from V f (i, j). For each element in X i,j , we built a decision tree model that is used for predicting the potential DTIs. In this way, we obtained a set T i,j = {T 1 , T 2 . . . , T z } that denotes z decision trees. Finally, we obtained the interaction score of the pair by summing the score of all decision trees. This can be calculated as Equation (8).
where T k (x k ) represents the score of the decision tree T k . λ k is used to adjust the contribution of T k . The greater the value of score(i, j), the more likely d i is to interact with t j . We thereby obtained a matrixŶǫR m * n whereŶ ij = score (i, j) ( Figure 1E). We used the negative log-likelihood to calculate the loss of DTIGBDT.
where Y i,j is the actual interaction between d i and t j . We defined the objective function as Equation (10).
The first term is the loss of DTIGBDT. The second term is the regular term to prevent overfitting, and λ is the regularization parameter for adjusting this term's contribution. The converged Y is the interaction score matrix, which can be calculated by Figure 3.

Performance Evaluation Metrics
To evaluate our method and the state-of-the-art methods for DTI prediction, we performed five-fold cross validation (Cheng et al., 2015;Chen et al., 2017;Lin et al., 2017;Wei et al., 2017aWei et al., , 2018Zeng et al., 2017b;Bu et al., 2018;Su et al., 2018;Xu et al., 2018b,c). All known DTIs were randomly divided into five subsets with equal size, and the same operation was applied to the unknown interactions (Liu et al., 2017;Zhang et al., 2017;Zeng et al., 2018). In each cross-validation trial, a subset of known DTIs and another subset of unknown DTIs were selected in turn as the test set, while the remaining DTIs were used for training a prediction model. The known and unknown interactions were regarded as the positive and negative samples, respectively. After the prediction is performed, each sample was given a predicted score which represents the propensity of the drug to interact with the target. The positive and negative samples were ranked by their score. The higher the positive samples were ranked, the better was the prediction performance. For a given threshold δ, if the score of a positive sample was >δ, it was considered as a true positive sample (TP), and if the score was <δ, it would be considered as a false negative sample (FN). If the score of a negative sample was lower than δ, it would be regarded as a true negative sample (TN). If the score was <δ, it would be regarded as a false positive sample (FP). We obtained a receiver operating characteristic (ROC) curve (Streiner and Cairney, 2007) by calculating the true positive rates (TPRs) and false positive rates (FPRs) for various values of δ.
The areas under the ROC curves (AUCs) were used to evaluate the performance of each method (Lobo et al., 2008;Cheng et al., 2014Cheng et al., , 2018aDao et al., 2018;Feng et al., 2018;Nie et al., 2018;Tang et al., 2018;Xu et al., 2018a;Yang et al., 2018). It is generally believed that the closer the value of AUC is to 1, the better the performance is. However, in the case of imbalanced data, AUPR (the area under the precision-recall curve) can provide a more valuable metric (van Laarhoven et al., 2011;Saito and Rehmsmeier, 2015;Patel et al., 2017;Sahiner et al., 2017;Wei et al., 2017b;Jiang et al., 2018a,b). Therefore, we also used AUPR as another measurement to evaluate the performance of each method. The precision-recall curve was constructed by precision rates and recall rates, which are defined as Equation (12).
FIGURE 3 | Algorithm for predicting the potential drug-target interactions.
In addition, biologists usually select the top section of the prediction result for a wet-lab experiment to further validate. As a result, the accuracy of the top k candidates is more important for discovering novel DTIs. We demonstrate the recall rates within the top k (k = 50, 100, 150, 200, 250, 300) candidates to reveal how many of these positive samples are identified successfully.

Comparison With Other Methods
We compared DTIGBDT with four state-of-the-art methods for DTI prediction, including GRMF (Ezzat et al., 2017), DTINet (Luo et al., 2017), Lee's method (Lee and Nam, 2018), and DDR (Olayan et al., 2017). We describe these methods in more detail below. GRMF: This method proposed a matrix factorization-based model to predict novel DTIs. The drug-target interaction matrix Y were decomposed into two low-rank latent feature matrices A (for drugs) and B (for targets) by using the SVD algorithm. Alternating least squares was used to iteratively update A and B.
The optimization problem can be described as: whereL d andL t are the normalized graph Laplacians that were computed based on the similarities between drugs or targets. λ l , λ d , and λ t are parameters that adjust the contribution of the terms. The interaction scoreŶ i,j of drug d i and target t j can be calculated as:Ŷ where a i is the ith row of A and b j is the jth row of B. DTINet: Heterogeneous data sources provide diverse information for DTI prediction, so Luo et al. integrated four types of drug similarities and three types of target similarities. The random walk with restart algorithm was applied to extract the topological information of the drug network and the target network, and the result of the algorithm was a matrix S D . The low-rank model S D ≈ XW T used X to represent the corresponding low-dimensional feature vector of each drug. Similarly, the low-dimensional feature vectors of targets could be calculated and were represented by a matrix Y. Let P denote the interactions between drugs and targets; matrix Z can then be calculated by Equation (15).
The interaction score between drug d i and target t j was defined as follows: where x i is the ith row of X and is the feature vector of d i , and y j is the jth row of Y and is the feature vector of t j . Lee's method: In this method, each drug was represented by a bit vector, in which each bit suggests whether a specific sub molecular structure is contained by the drug. In addition, Lee et al. constructed a model based on random walk with restart to extract the topological information of the drug-drug interaction network. The rows of the matrix F d were used to store the bit vectors of each drug and a matrix N d was defined to denote the result of the random walk. The final representation of drug d i , denoted by ν d i , was calculated by Equation (17): where N d i and F d i are the ith row of N d and F d , respectively. Similarly, Lee et al. can calculate a vector ν t j to represent the target t j . The feature vector of the drug-target pair (d i , t j ) can be obtained by connecting ν d i and ν t j . On the basis of the Euclidean distance between each pair of drug and target, a k-nearestneighbor model was trained to infer whether a target interacted with the drug. DDR: DDR constructed a drug-target heterogeneous graph that contains the known DTIs with multiple drug similarities and target similarities. A non-linear similarity fusion method was performed to obtain the optimized drug similarities and the target similarities. For each drug-target pair, DDR constructed a path-category-based feature, which integrates the sum of the paths' weight and the maximum weight of the paths. A random forest-based model was performed to analyze the potential associations between each drug-target pair with these features.
Several parameters may influence the performance of DTIGBDT, including the restart probability a, the number of neighbors k, and the regularization parameter λ. The ranges of a, k, and λ are set to {0.2,0.4,0.6,0.8}, {10,20,30,40,50}, and {0.01,0.1,1,10}, respectively. The results of cross validation showed that our method achieves the best performance when a = 0.4, k = 30, and λ = 0.1. For fair comparison, the parameters of the other methods were also adjusted to obtain their best performance (n = 600, k = 5 in DDR; r = 0.8 in Lee's method; η = 0.5, d = 0.1, t = 0.1, l = 2 in GRMF; and λ = 1, r = 0.8 in DTINet). The performance of each method was obtained by using the optimum parameters in each case. The ROC curves and precision-recall curves of all these methods are shown in Figure 4. DTIGBDT achieves the best performance (AUC = 0.877, AUPR = 0.129), and it achieves 2.3% higher AUC and 4.3% higher AUPR than the second-best method, GRMF. Comparing to DTINet, DTIGBDT achieves 7.3% higher AUC and 5.7% higher AUPR. Both GRMF and DTINet have applied a lowrank model to reduce the dimension of the drug features and target features. However, a great deal of valuable information may be lost in this process. Lee's method does not perform well because it only used the same quantities of negative samples as that of the positive samples to train the k-nearestneighbor model and most of the negative samples were discarded. The AUC and AUPR of DTIGBDT are 11.6% and 9.7% higher than Lee's method, respectively. DDR shows the worst performance because its' prediction model fails to accurately estimate the interaction scores, and the AUC and AUPR of DTIGBDT are 12.9 and 6.6% higher than DDR, respectively. The superior performance of DTIGBDT is mainly due to our model based on GBDT that completely exploits all the negative samples.
We performed a paired t-test to evaluate whether DTIGBDT's performance (AUC and AUPR) is significantly better than that of other methods (Ruxton, 2006). The p-values are listed in Table 1

Case Studies on Five Drugs
To demonstrate the ability of DTIGBDT to discover potential DTIs, we used it to predict novel drug-related targets. We performed DTIGBDT for all the drugs. All the known DTIs were used to train the model, and the prediction results are listed in Supplementary Table 1. In particular, we executed case studies on five drugs, including Quetiapine, Clozapine, Olanzapine, Aripiprazole, and Ziprasidone. The top-ranked five candidate targets for each drug were collected and listed in Table 2. To confirm these novel interactions, we consulted several reference databases and the biomedical literature to support them.
DrugBank (Wishart et al., 2017) is a database with annotated cheminformatics resources which combines detailed drug data with target information. As shown in Table 2, 10 of the 25 novel interactions were reported in DrugBank, which confirms the drugs were indeed interacted with the targets. CheMBL (Gaulton et al., 2016) contains the binding and functional information of drug-like bioactive compounds and the information of their binding targets. Three of the 25 interactions were contained in CheMBL, indicating that these drugs can interact with their candidate targets. KEGG (Kanehisa and Goto, 2000) is another useful database dealing with genomes, biological pathways, drugs, and chemical substances. There are 15 interactions that can be found in KEGG, which suggests the expression of the genes can be upregulated or downregulated by the drugs. For example, the drug Aripiprazole can act as a potentiator to enhance the expression of the target gene GABRA1 in combination with another drug Phenobarbital.
In addition, a database named UniProt (Consortium, 2014), which collects the protein sequence and function information from research literature, is used to find whether a drug can interact with a specific target; this database includes two interactions. Specifically, the expression of two target genes, GABRG3 and GABRA4, can be reduced by drug Olanzapine to inhibit the activity of extracellular ligand-gated ion channels.  Finally, four novel interactions, which are labeled with "literature, " were confirmed by some of the published literature that can be found in PubMed (McEntyre and Lipman, 2001). These drugs were confirmed that they can enhance or inhibit the expressions of their candidate genes. For instance, Sugawara et al. found that drug Quetiapine can decrease the DNA methylation level of the promoter region of the gene SLC6A4 (Sugawara et al., 2015). Case studies suggests that DTIGBDT has powerful ability to discover the potential drug-interacted targets.

CONCLUSIONS
In this paper, we proposed a novel method, DTIGBDT, for predicting the target genes that interact with drugs. We incorporated topological information from the heterogeneous interaction network, and the feature vectors between the drugtarget pairs were constructed based on the path categories. A GBDT-based model was constructed for predicting candidate target genes, and it can mitigate the impact of class imbalance by completely exploiting the negative samples. The results of 5-fold cross-validation experiments confirm the superiority of DTIGBDT for DTI prediction. The case studies on five drugs further prove the ability of our model to discover the potential interactions. Therefore, DTIGBDT is a powerful tool which may provide reliable candidate target genes for subsequent identification of actual drug-target interactions with wet-lab experiments. In the future, we will develop our methods on parallel platforms (Zou et al., 2013;Guo et al., 2018) for handling the big data problem.

DATA AVAILABILITY
All datasets analyzed for this study are included in the manuscript and the Supplementary Files.

AUTHOR CONTRIBUTIONS
PX and CS conceived the prediction method. PX, CS, and YY they wrote the paper. CS and TS developed the computer programs. TZ and YD analyzed the results and revised the paper.