Prediction of Drug–Target Interactions From Multi-Molecular Network Based on Deep Walk Embedding Model

Predicting drug–target interactions (DTIs) is crucial in innovative drug discovery, drug repositioning and other fields. However, there are many shortcomings for predicting DTIs using traditional biological experimental methods, such as the high-cost, time-consumption, low efficiency, and so on, which make these methods difficult to widely apply. As a supplement, the in silico method can provide helpful information for predictions of DTIs in a timely manner. In this work, a deep walk embedding method is developed for predicting DTIs from a multi-molecular network. More specifically, a multi-molecular network, also called molecular associations network, is constructed by integrating the associations among drug, protein, disease, lncRNA, and miRNA. Then, each node can be represented as a behavior feature vector by using a deep walk embedding method. Finally, we compared behavior features with traditional attribute features on an integrated dataset by using various classifiers. The experimental results revealed that the behavior feature could be performed better on different classifiers, especially on the random forest classifier. It is also demonstrated that the use of behavior information is very helpful for addressing the problem of sequences containing both self-interacting and non-interacting pairs of proteins. This work is not only extremely suitable for predicting DTIs, but also provides a new perspective for the prediction of other biomolecules’ associations.


INTRODUCTION
Prediction of drug-target interactions (DTIs) is one of the most important steps in the genomic drug discovery pipeline and drug repurposing (Knowles and Gromo, 2003;Yildirim et al., 2007), the purpose is to discover putative new drugs and new uses of existing drugs. To our knowledge, the effects of many useful protein targets on drugs are modulated by interacting with ligands, including enzymes, ion channels, G protein-coupled receptors and nuclear receptors (Yamanishi et al., 2010). The development of rapid sequencing technology and the implementation of the human genome project, which has produced massive amounts of biological data, has given birth to a new discipline-computational biology. Before this, many traditional biological experimental methods were used to discover the relationships between proteins. Such as Co-immunoprecipitation (CO-IP), Tandam affinity purification (TAP), Glutathione-Stransferase (GST) pull down, phage display technology, yeast two-hybrid, and so on. However, due to the limitation of flux, precision and cost, it is often difficult to realize largescale DTIs using traditional biological experimental methods. Therefore, computer-assisted methods are increasingly used in DTI predictions, and provide an effective means for the discovery and screening of lead compounds.
Recently, several computational methods were developed and considered to discover the DTIs Chan and You, 2016;Luo et al., 2017). Many researchers have made great efforts to develop useful algorithms to deal with various DTI-related prediction problems. The most commonly used algorithms are docking simulations, literature text mining, machine learning, and network information, among others. Luo et al. (2017) proposed a network integration method for DTI detection and computational drug repositioning from heterogeneous information. Wong et al. (2015) analyzed the docking modes of 20 drugs and 28 proteins, and determined that 13 drugs could target 11 proteins at the same time, and designed multi-target drug complexes to destroy the mechanism of action of various cancers. Heinemann et al. (2016) systematically analyzed publication patterns appearing along the drug discovery process of targeted cancer therapies in the literature, and provided a support tool for novel drug development. Mayr et al. (2018) obtained different types of molecular descriptors on a ChEMBL dataset, and made a wide range of comparison with several machine learning models for detecting DTIs. Lu et al. (2019), based on the assumption that similar drugs share similar patterns of relationships with target proteins, proposed a heterogeneous network embedding model to predict DTIs by integrating the drug-drug similarity network, targettarget similarity network and known DTIs into a heterogeneous network, called HNEDTI. Zhang et al. (2019) introduced how to calculate similarities based on drug-drug similarity and targettarget similarity, and summarized, analyzed, and compared different machine learning-base prediction models. Based on these methods, we proposed a multi-molecular network, also called molecular associations network (MAN; Guo et al., 2019) to detect the interactions between drug candidates and related target proteins.
In the MAN, we not only used DTI data, but also added other biomolecules' interactions information in the network. The main idea of this work comes from computational systems biology (Kitano, 2002;Materi and Wishart, 2007), network biology (Barabasi and Oltvai, 2004;Emmert-Streib and Glazko, 2011;Cahan et al., 2014), and network representation learning (Yang et al., 2015;Zhang et al., 2018). Computational systems biology aims to reveal new biological characteristics from a systematic perspective and use interdisciplinary tools to integrate and analyze large amounts of complex heterogeneous data from various experiments. It plays a key role in many complex processes occurring in biological systems. Subsequently, as more and more large and diverse data were collected at multiple levels of the system biology, Barabasi and Oltvai (2004) proposed network biology to understand the cell's functional organization. Network biology refers to studying the biosystem network using mathematical methods and graph theory, and the network topology model. The studies have shown that cellular networks obey the general rules of network science, and it is helpful for understanding the interactions between molecules inside a living cell. Afterward, inspired by deep learning and word embedding technology in natural language processing (NLP), vector representation of nodes in automatic learning networks has become a research hotspot (Goldberg and Levy, 2014;Pennington et al., 2014;Peters et al., 2018;Devlin et al., 2018;Yang et al., 2019). This work has been gradually applied to the field of bioinformatics.
To summarize, Guo et al. (2019) for the first time proposed a MAN by integrating the associations among miRNA, lncRNA, protein, drug, and disease, where any kind of potential associations can be predicted. In this paper, we constructed a biomolecular relationship network, which contains nine kinds of associations with five types of molecules. All the molecules in the MAN were treated as nodes and all the relationships were regarded as edges. The associations between a node and other nodes in the complex network were called the behavior of the node. This work introduced two kinds of important information: the original attribute information of node itself (e.g., sequences of proteins, molecular fingerprints of drugs) and behavior information of the biomolecules. Then, a comparative experiment was carried out with a random forest (RF) classifier. The experiment results show that the behavior of the node contains more useful information than the attribute of the node in the DTIs prediction, and better results can be obtained.

RESULTS AND DISCUSSION
In order to illustrate that the behavior features of nodes contain more useful information than the traditional attribute features of biomolecules, we compared the performances of various wellknown classifiers based on these two different types of features under five-fold cross-validation in various evaluation criteria. Cross-validation is mainly used to prevent over-fitting caused by over-complicated models. It is a statistical method used to evaluate the generalization ability of training data. For the fivefold cross-validation, the original data is randomly divided into five parts, and four parts are selected as the training set each time, and the remaining one part is used as the test set. The crossvalidation was repeated five times, and the average value for the accuracy of the five runs was taken as the evaluation index of the final model. In this work, the number of the five training sets is 17,770, 17,770, 17,770, 17,770, 17,776, respectively

Performance Evaluation With Support Vector Machine on Two Different Features
In the experiment, we employed the state-of-the-art method Support Vector Machine (SVM) to assess the performance between the two different features on the integrated dataset. The two features include attribute features and behavior features. The attribute features are obtained from the molecular sequence information. The behavior features are derived from the MAN. We hypothesized that the MAN may assist in improving prediction performance. In order to ensure reasonable fairness, we set the same parameters to compare the performances of the two different features on the model. The results are shown in Tables 1, 2.
Meanwhile, receiver operating characteristic (ROC) curves are widely applied in many fields, such as machine learning, data mining, and so on. We also used ROC curves to measure the comprehensive index between the False Positive Rate and the True Positive Rate continuous variable. The area under curves (AUC) could be shown as the prediction accuracy of the classifier. The larger the AUC, the higher the accuracy. The ROC curve of the SVM classifier based on attribute feature and behavior feature with 5-fold cross-validation is shown in Figures 1, 2, respectively. It is clear that the average of AUC is 0.7028 by using attribute information, the average of AUC is 0.8188 by using behavior information based on MAN network. Hence, the behavior information of nodes play an important role in the DTIs predictions.

Performance Evaluation With Random Forest on Two Different Features
In order to illustrate that the behavior features are indeed better than the attribute features, either on a single liner classifier or on an ensemble classifier, we also implemented the RF model on our experiment. In this experiment, we set the same parameters to compare the performances of the two different features on the model, the results are shown in Tables 3, 4. The ROC curves of the RF classifier based on attribute feature and behavior feature with five-fold cross-validation are shown in Figures 3, 4, respectively. It is obvious that the average of AUC is 0.8779 by using attribute information, the average of AUC is 0.9206 by using behavior information based on the MAN. So, the behavior information of nodes play an important role in the DTI predictions.
As mentioned above, it is apparent that the constructed MAN network can receive accurate DTI detection because more behavior information can be obtained from the complex biomolecular associations network. The presented complex network has made an indelible contribution to the prediction of DTIs. The main innovations can be summed up in the following two aspects: (1) Construction of the MAN network, which integrates five types of biomolecules and nine known relationships between them. It can provide a novel potential helpful tool for predicting new DTIs across the whole field of bioinformatics; (2) Behavior features were obtained by deep walk network embedding method, which can further optimize the performance of classifiers. This method can achieve more helpful information in the data than traditional attribute features. In a few words, experimental results revealed that our presented network is not only extremely suitable for DTI prediction, but also fit for other biomolecule associations prediction.

Datasets Construction
In this article, the heterogeneous data input to the MAN is collected from nine known relationships: DTIs, drug-disease  associations (DDAs), protein-protein interactions (PPIs), protein-disease associations (PDAs), lncRNA-target interactions, protein-miRNA interactions, lncRNA-disease interactions, lncRNA-miRNA association, miRNA-disease association; which were shown in Table 5. These known relationships were also based on five types of biomolecules: drug, protein, disease, lncRNA, miRNA; which were listed in Table 6. The MAN contained topological relationships and distributions among all the molecules in the heterogeneous network. Considering the local and global connection modes, this work describes the basic context and intrinsic connection profiles for the whole nodes. Therefore, the prediction of DTIs can be determined by the connection relationships of the other nodes in the network.

Multi-Molecular Network
From the collection of nine known relationships between five types of biomolecules annotated in many well-known databases which are mentioned above, we constructed a multi-molecular network, also called MAN by linking two arbitrary association nodes. The complex MAN is shown in Figure 5. Based on the known associations, some biomolecules are suggested to interact with each other. In the network graph, the heterogeneous nodes correspond to five types of biomolecules (drug, protein, disease, miRNA, and lncRNA), and edges correspond to associations among them. The construction of the systematic MAN network provides a new perspective for predicting interactions between drug and target.

Drug Molecular Fingerprint
The drug molecular data was extracted from DrugBank database. To further process these data better, we calculated the Morgan fingerprints of drug molecules with the RDKit (Landrum, 2013) tool in python. The main idea of the molecular fingerprint method is that molecular structure is encoded as many substructure fingerprints in a series of binary bits, and a kernel is then applied to a molecule to generate a bit vector or count vector. Substructure pattern matching can be done using query molecules built from SMARTS which is first determined as a predefined dictionary (Guba et al., 2015). As we all know, there is a SMARTS-based implementation of the 166 public MACCS keys (Cereto-Massagué et al., 2015). As shown in Figure 6, each fingerprint bit corresponds to a fragment of the molecule, if its corresponding known fragment appears in the given molecule, the corresponding bit in the fingerprint is set to 1; otherwise, it is set to 0. Thus, each molecule can be represented as a Boolean array. In this method, although the whole molecule was divided into a great many of fragments, it still retains all the complexity of drug molecules.

Protein Sequence
The total protein sequence information was collected from the STRING database. For protein sequences, 20 types of amino acids were classified into four categories by the polarity of the side chain information, which contained (Ala, Val, Leu, Ile, Met, Phe, Trp, Pro), (Gly, Ser, Thr, Cys, Asn, Gln, Tyr), (Arg, Lys, His), and (Asp, Glu). Similarly, each protein sequence was transformed into a 64-dimensional (4 × 4 × 4) feature vector by counting the frequency of every subsequence appearing in the whole protein sequence, and each dimension of the vector is the normalized frequency of the corresponding 3-mer in the sequence (Rizk et al., 2013).

Network Embedding-DeepWalk
In 2014, Perozzi et al. (2014) proposed DeepWalk, which can learn latent representation of vertices in a network. Analogous to word2vec, it uses the co-occurrence relationship among the whole nodes in the graph to learn the vector representation of nodes. There are two stages in the process of the deepwalk method: (1) A sequence of nodes is constructed. The locally associated training data is obtained by applying a random walk generator for sampling from each node in the homogeneous network. Then, to obtain a sequence for each node by imitating the process of text generation; (2) The Skip-Gram is used to train the sampling data, and the discrete nodes are represented as vectors in the network, and the Hierarchical Softmax is used to classify the ultra-large-scale classification.

Generation of Sequence of Nodes
In the MAN, a homogeneous network was constructed by five research objects (miRNA, lncRNA, drug, protein, and disease) at the cellular level. On the assumption that there is a network graph G a random vertex v i is uniformly sampled as the root of the random walk. Then, a walk samples uniformly from each vertex to the adjacent nodes until it reaches the maximum length. In this way, the process of text generation is simulated to find sequence information for each node in the network, e.g., V 14 ->V 11 ->V 12 ->V 13 , V 27 ->V 23 ->V 24 ->V 21 ->V 22 , V 34 ->V 32 ->V 36 ->V 31 ->V 37 , and so on. Random walks on MAN is shown in Figure 7. Afterward, the sequence of each node will be treated as a sentence in NLP as input of word2vec, and the vector representation of nodes is obtained.

Skip-Gram Model
Skip-Gram is one type of the word2vec model, which was proposed by McCormick (2016). It uses nodes to conjecture context, and learns vector representation by maximizing the cooccurrence probability of words within a window, and ignores the order in which nodes appear in sentences. The representation of nodes with the same context is similar. The higher the frequency of two nodes appearing in a sequence at the same time, the higher the similarity between the two nodes. The co-occurrence probability can be transformed into the product of conditional probability according to independence assumption, which can be summarized as follow: where, v i−c and v i+c are the left and right context of the word v i , c is the size of the window. In addition, we map each vertex v k to its current representation vector (v k )∈ R d . The conditional probability of each vertex in the sequence is calculated, that is, the log value of the probability of other nodes in the sequence when the node appears, and the vector representation of the node is updated with the help of the stochastic gradient descent algorithm.

Classification Models
Classification is one of the important tasks in data mining. The so-called classification is to classify the unknown data into existing categories according to its characteristics or attributes. That is to say, using given categories and known training data to learn classification rules and classifiers, and then predicting the unknown data.

Support Vector Machines
Support Vector Machine (SVM) is a supervised machine learning algorithm, which is mainly used for binary classification problems (Suykens and Vandewalle, 1999). In this algorithm, each data was considered as one point in n-dimensional space (n is the number of features), and each eigenvalue is a value of a specific coordinate. Then, classification is carried out by finding the hyper-planes that distinguish the two classes. In the sample space, the partition of hyper-planes can be described by the following linear equations: Assuming that it has completed the separation of samples and the labels of the two samples are {+1, −1}, for a classifier, f (x) > 0 represents the class that label is +1, otherwise, it is −1. In order to maximize the distance between the nearest two classes of samples on both sides of the plane, we need to find two hyper-planes parallel to and equal to the hyper-plane. (4) Then, to maximize the interval between these two hyper-planes max(1/||w||). Thus, SVM can provide a good generalization ability for classification problems.

Random Forest
Random forest is a relatively novel machine learning model. In the 1980s, Breiman (2017) developed the classification tree, which achieved classification and regression by repeating binary data, and the amount of calculation was greatly reduced. In 2001, Breiman combined classification trees into RFs, which randomized the use of variables (columns) and data (rows) to generate many classification trees, and then summarized the results of all the classification trees (Breiman, 2001). Random forest contains many decision trees in the forest, but there is no correlation between these trees. When a new sample is input to the forest, each decision tree will judge which category the sample should belong to. And then, the sample was predicted to be of the most selected category.
In the process of feature importance assessment using RF, it depends on the contribution of each feature to each tree in the RF. The contribution is usually measured by Gini index or error rate of out-of-bag (OOB) data. Assuming that there is n features f 1 , f 2 , f 3 , . . . , f n , the Gini variable importance measures (VIM) of each feature f i can be described as follows: Where, m represents m classes. p nm is the proportion of class k in node n.

Performance Measurement Tools
In our study, in order to size up the effectiveness and steadiness of our constructed model, we counted the results of five parameters: Accuracy (Acc), recall (sensitivity, hit rate, or true positive rate (TPR), specificity (selectivity, or true negative rate (TNR), precision (positive predictive value (PPV) and Matthews's Correlation Coefficient (MCC), respectively. These parameters can be represented as follows: where TP is the count of true interacting pairs correctly predicted, i.e., the number of true positives. FP refers to the quantity of false positives, which is described as the number of true non-interacting pairs falsely predicted. TN means the quantity of true negatives, in other words, it represents the number of true non-interacting pairs predicted correctly. FN represents the quantity of false negatives, i.e., the true interacting pairs falsely predicted to be non-interacting pairs. According to these parameters, a Receiver Operating Characteristic (ROC) was plotted to evaluate the performance of the random projection method. Then we can calculate the AUC to assess the performance of the model.

CONCLUSION
In this study, we investigated the relationship among drug, protein, miRNA, lncRNA and disease. Then, we developed a novel method to discover the potential interaction between drug and target on a large scale. We constructed a novel scheme based on the above five molecules and nine relationships arbitrarily between two molecules, which is called the MAN network. By focusing on this network, each node can obtain a feature vector by using node behavior information (the relationship of each node with others could be described by the deepwalk network embedding method). To our knowledge, this is the first report to predict DTIs from a complex heterogeneous network in an overall view at the cellular level. Experimental results demonstrated that our model has achieved good prediction results, which is a new attempt to predict DTIs. This work would have potential applications for drug discovery and repositioning.

DATA AVAILABILITY STATEMENT
The raw data required to reproduce these findings cannot be shared at this time as the data also forms part of an ongoing study. Requests to access the datasets should be directed to the corresponding author.

AUTHOR CONTRIBUTIONS
Z-HC and Z-HY conceived the algorithm, carried out analyses, prepared the data sets, carried out experiments, and wrote the manuscript. Z-HG and H-CY designed and performed the experiments. G-XL and Y-BW analyzed the experiments and checked the manuscript. All authors read and approved the final manuscript.

FUNDING
This work is supported in part by the National Natural Science Foundation of China, under Grants 61373086 and 61572506.