MGRL: Predicting Drug-Disease Associations Based on Multi-Graph Representation Learning

Drug repositioning is an application-based solution based on mining existing drugs to find new targets, quickly discovering new drug-disease associations, and reducing the risk of drug discovery in traditional medicine and biology. Therefore, it is of great significance to design a computational model with high efficiency and accuracy. In this paper, we propose a novel computational method MGRL to predict drug-disease associations based on multi-graph representation learning. More specifically, MGRL first uses the graph convolution network to learn the graph representation of drugs and diseases from their self-attributes. Then, the graph embedding algorithm is used to represent the relationships between drugs and diseases. Finally, the two kinds of graph representation learning features were put into the random forest classifier for training. To the best of our knowledge, this is the first work to construct a multi-graph to extract the characteristics of drugs and diseases to predict drug-disease associations. The experiments show that the MGRL can achieve a higher AUC of 0.8506 based on five-fold cross-validation, which is significantly better than other existing methods. Case study results show the reliability of the proposed method, which is of great significance for practical applications.


INTRODUCTION
In recent years, the long hours and high costs of developing new drugs have been significant constraints (DiMasi et al., 2003;Adams and Brantner, 2006). Most new drugs already cost more than billions of dollars to build, and it will take many years to bring them to market (Wei et al., 2019). Unfortunately, as the cost of drug development has risen, drug profits have fallen. Identifying potential drug-disease associations is a top priority in drug discovery, and the side effects of some drugs have been confirmed by clinical observation.
Recently, a large number of computing methods based on drug-disease associations prediction have been proposed (Huang et al., 2013;Li et al., 2016;Zickenrott et al., 2016;Zhang et al., 2017a;Xue et al., 2018;Yella et al., 2018;Cui et al., 2019;Xuan et al., 2019;Chen et al., 2020;Jarada et al., 2020). Gottlieb et al. (2011) proposed the prediction method based on the computational similarity framework between drug-drug similarity and disease-disease similarity and predict unknown correlations by constructing similar characteristics of recently known drug-disease associations. Luo et al. (2018) proposed a drug repositioning recommendation system to predict new drug-disease associations by constructing a heterogeneous drugdisease interactions network. Wang et al. (2014) designed a computing framework based on a heterogeneous network model to calculate the similarity between drug pairs of diseases through heterogeneous graphs of drug-target information. Zhang et al. (2017a) constructed the known drug-disease association into a drug-disease bipartite graph network and proposed a similaritybased graph to predict the new drug-disease associations method. Liang et al. (2017) proposed a new computational method that integrates the chemical, target region, and target labeling information of a drug. Jiang et al. (2020) combined various disease characteristics and drug characteristics and proposed a sparse automatic coder and a rotating forest fusion method for humans.
Most of the existing drugs are used to discover the relationship between potential drugs and diseases by extracting similarities between drugs and diseases (Li and Lu, 2012;Zhang et al., 2014Zhang et al., , 2017bZhang et al., , 2018Luo et al., 2016). Chen et al. (2020) used network embedding and traditional attributes to predict drug targets by integrating the correlation between various molecules. According to research, graph neural network has been widely used in related biological and medical fields Wang et al., 2020;Yue et al., 2020). Wang et al. (2019) proposed a prediction method for embedding drugdisease associations networks using graph neural networks. Based on the similarity between drugs and diseases, Yu et al. (2020) introduced graph convolutional neural networks to predict potential drug-disease associations. As a result, only a handful of drugs and diseases with rich information can be used for prediction. Therefore, how to solve these challenges is urgent. Inspired by existing research Yi et al., 2019Yi et al., , 2020. We propose a computational method of representation learning based on multi-graph by learning features from local and global perspectives, respectively.
In this paper, we propose a novel computational model based on Multi-graph representation learning (MGRL) to predict drug-disease associations, which is mainly divided into three parts. First of all, The self-attributes of drugs and diseases are pre-trained by using the graph convolutional neural network to generate the graph convolutional neural network features. Then, node2vec (Grover and Leskovec, 2016) was used for network representation of the drug-disease associations. Finally, the two obtained multi-dimensional information features were combined, and the latent drugdisease associations were predicted using Random Forest Classifier (Amaratunga et al., 2008). The overall workflow of the Multi-graph representation learning (MGRL) is demonstrated in Figure 1. Experiments results show that the MGRL have higher accuracy and AUC for predicting new drug-disease associations and comparing state-ofthe-art methods. The case study shows that the model MGRL could better help medical researchers discover new drug-disease associations.

Datasets
The Comparative Toxicogenomics Database (CTD) (Davis et al., 2017) provides information about the relationship between chemicals and gene products and diseases. Concentrate and combine molecular pathways to uncover real chemicals and understand environmental influences on etiology and disease mechanisms. According to Zhang et al.'s (2018) treatment method of drug-disease associations in CTD, we obtained 18,416 drug-disease relationship pairs. We use the DrugBank (Law et al., 2014) database to obtain the chemical structure of drugs. The database is an open and comprehensive drug resource library, including the chemical structure of drugs, drug targets, various proteases, and so on. The description of the disease information collection from the Medical Subject Headings (MeSH). Therefore, the benchmark dataset contain 18,416 drug-disease pairs, including 269 drugs and 598 diseases.

Drug Morgan Molecular Fingerprint
In this paper, the simplified molecular-input line input specification (SMILES) is adopted (Weininger, 1988), which describes the chemical structure of drug molecules. The characteristics of chemical molecules are represented by RDkit (Landrum, 2013), a tool kit that can be used to represent chemical information.

Disease Semantic Description Information
In the experiment, the network descriptors in the MeSH database were used to process the disease data (Wang et al., 2010). The data is downloaded from the National Library of Medicine (http:// www.nlm.nih.gov/). The MeSH database provides a strict disease classification system, so it plays an essential role in the study of the attributes of diseases and the relationship between diseases. In general, the MeSH descriptor is described as a directed acyclic graph (DAG) of diseases, where diseases are represented by the nodes. In other word, each disease can be represented as a structure of DAG. For instance, DAG A = (A, T A , E A ), in which the collection of all the ancestor nodes of A is represented by T A , including node itself, and E A is a collection of links to the node. Therefore, by assuming that the contribution of disease t to the semantics of disease a is D (a), the following formula can be obtained: where µ is the semantic contribution factor of the connection edge E (T) between the parent node T and the child node t. Therefore, the semantic value of disease can be defined as: In conclusion, a measure of semantic similarity between the two diseases can be calculated by their relative locations. The formula is as follows: where D a (t) and D b (t) are the semantic values of disease t related to disease a and disease b, respectively.

Graph Convolutional Neural Network
Graph convolutional neural network (GCN) (Kipf and Welling, 2016) is considered as a graph-based semi-supervised learning method for node classification. GCN directly encodes the graph structure by using the neural network model and learns from the supervised target of labeled nodes. Its essence is the first-order local approximation of spectral convolution.
In this work, we consider the multi-layer graph convolutional network as follows: where H is the network input of layer l (initialized input H = X),D is degree matrix ofÃ.Ã = A + I is the adjacency matrix added to the self-loop, W is the weight of training in the neural network, σ is the activation function, and the ReLU function is used.
The traditional graph convolutional neural network is an endto-end system. How to use it to train the attributes of nodes and get the attributes of nodes after training is the core of the problem we need to solve. Therefore, we have designed a unique graph convolutional neural network. Specifically, let us assume given an adjacency matrix A n×n , where n represents all nodes (including drugs and diseases),Ã = A + I, wherẽ I is a unit matrix of size n × n. Then, define the attribute of the node as X nk = [x 1 , x 2 , x 3 , · · · , x nk ] T in which k is the attribute dimension of all nodes. Finally, the weight W k×m is initialized randomly, and m is equal to 64. The following formula can be obtained: We used this simplified definition of graph convolution in this work.

Node2vec
Node2vec (Grover and Leskovec, 2016) is a method that can learn the continuous feature representation of each node in the network. It can map the node to low-dimensional feature space and preserve the network neighborhood of the node to the maximum. Node2vec provides a biased random walk method to obtain the nearest neighbor sequence of vertices, effectively combining DFS (Depth First Search) and BFS (Breath First Search). We assume that node v is the current vertex, then the probability of accessing the next vertex x is: where π is a vertex v and not normalized transition probability between x, Z is a normalized constant. c is the node in the walk and initial c = u. Consequently, two super parameters p and q are introduced to control the strategy of the random walk. It is assumed that the current random walk reaches the vertex v after passing the edge (t, v). Here, the unnormalized transition probability is set as π vx = α pq (t, x) · w vx , where: which w is the weight of the edge between the vertices v and x, d is the shortest path distance between vertex t and vertex x.

Five-Fold Cross-Validation
Cross-validation has absolute authority in evaluating the predictive performance of the model, especially for assessing the performance of the model with completed training on new data, which can better solve the problem of model overfitting.
To visualize, the ROC curve (receiver operating characteristics) was used to assess our method. The appropriate right approach to the ROC curve should be close to the unit square in the upper left corner. If the ROC curve follows a diagonal line of negative classifiers and connecting identifier points, the predictive effect of random guesses on classifiers is also lacking. AUC was used as an evaluation index, which is the area under the ROC curve. The higher the value, the higher the accuracy. Moreover, the precision-recall diagram (PR) was added to evaluate our model, where AUPR is the area under the PR curve, which can directly reflect the recall rate and accuracy of learners in the whole sample and prevent errors caused by the small number of positive samples. Although the benchmark dataset is stable, we still hope that these evaluation indexes can provide references for the later models. The details of results under five-fold cross-validation are shown in the Table 1 and Figure 2. Through the analysis, it is clear that the MGRL results are outstanding. AUC, AUPR, and various evaluation indexes illustrate that the proposed model has excellent predictive ability.

Evaluate the Impact of Different Feature
To verify the performance differences between different features and the advantages of the proposed method, we compared three targeted features, including Attribute, Embedding, and GCN+Embedding. Table 2 and Figure 3 show the benefits of the proposed method under different evaluation indexes. The comparison experiment shows the performance of different features. The attribute performance of the node is the weakest, possibly because the attribute is relatively single. The establishment of multi-graph for node feature extraction has a decisive advantage.

Comparison With Different Classifiers
The performance of different machine learning classifiers in various fields may be different. In the dataset of this paper, we try to compare the differences of different machine learning algorithms, including SVM, Logistic Regression, KNN, Gradient Boosting Decision Tree (GDBT), and Random Forest Classifier.
To better reflect the performance of each classifier on the dataset, they all go through parameter tuning and choose the optimal parameter for comparison. Here, we used the iterative method to find the optimal parameters. Detailed results of five-fold crossvalidation based on different classifiers are shown in Table 3 and

Comparison With Other Association Prediction Methods
To conduct a comprehensive analysis of MGRL, we demonstrate the superior performance of our method by comparing MGRL with the most advanced methods. Here, we compare MGRL with TL-HGBI , DeepDR (Zeng et al., 2019), the resource allocation method (Zhou et al., 2010), and DRRS (Luo et al., 2018) in the benchmark dataset by the five-fold cross-validation. The resource allocation method is a prediction method for predicting the problems of unobserved links in the bipartite graph. The results show that our method improves the AUC by 0.1477, 0.0295, 0.0098, and 0.0077 compared with other existing methods, and the results are shown in Figure 5. The proposed method constructed two kinds of node association graphs, trained the self-attribute of the node and the features of the association network, respectively, and significantly improved the prediction ability of the node.

Case Study
To evaluate the performance of our model in practical application, we carried out case studies on five drugs Doxorubicin, Etoposide, Levodopa, Clonidine, and Ciprofloxacin. According to the model prediction, we obtained the predicted diseases and ranked them, and selected the top 10 candidate diseases, as shown in Table 4. Specifically, five drugs are selected from the benchmark dataset, and interactions Frontiers in Genetics | www.frontiersin.org    between the drugs and the rest of the disease (excluding the original drug-disease associations) are established. These drugdisease interactions are used as the test set, and then MGNRL is used to make the prediction and get the corresponding score. Finally, the prior evidence of the drug and diseases was searched in the database and the literature. In addition, for drugs Doxorubicin and Etoposide, our model predicted that the top 10 candidates could be confirmed in CTD. For the remaining drugs, only one case of clonidine was unconfirmed, two cases of Levodopa were unconfirmed, and three ciprofloxacin cases were unconfirmed. The case studies demonstrated that our method can be used as an available tool for predicting the drug-disease associations.
And it can help biomedical specialists to improve efficiency in clinical trials.

CONCLUSION
The increasing cost and duration of new drug development make the repositioning of existing drugs using computational methods a significant focus of medical or biological research. In this paper, we proposed a novel method MGRL to predict potential drug-disease associations. The proposed MGRL model establishes a high-dimensional feature vector through the deep integration of two graph representations of drugs and diseases, to enhance the feature information of nodes. The two kinds of graph feature vectors are spliced to get the final input feature vectors. In particular, the attributes of nodes are used, and perform further in-depth training through the graph convolutional neural network to improve the local characteristics of nodes. Experiments show that MGRL can achieve high-precision prediction of unobserved drug-disease associations, which is significantly better than other advanced methods. In future work, we will build a more complex drug-disease interactions network to mine more characteristic information and further improve the predictive ability of our model.

DATA AVAILABILITY STATEMENT
The datasets presented in this study can be found in online repositories. The names of the repository/repositories and accession number(s) can be found in the article/supplementary material.