Heterogeneous Graph Convolutional Networks and Matrix Completion for miRNA-Disease Association Prediction

Due to the cost and complexity of biological experiments, many computational methods have been proposed to predict potential miRNA-disease associations by utilizing known miRNA-disease associations and other related information. However, there are some challenges for these computational methods. First, the relationships between miRNAs and diseases are complex. The computational network should consider the local and global influence of neighborhoods from the network. Furthermore, predicting disease-related miRNAs without any known associations is also very important. This study presents a new computational method that constructs a heterogeneous network composed of a miRNA similarity network, disease similarity network, and known miRNA-disease association network. The miRNA similarity considers the miRNAs and their possible families and clusters. The information of each node in heterogeneous network is obtained by aggregating neighborhood information with graph convolutional networks (GCNs), which can pass the information of a node to its intermediate and distant neighbors. Disease-related miRNAs with no known associations can be predicted with the reconstructed heterogeneous matrix. We apply 5-fold cross-validation, leave-one-disease-out cross-validation, and global and local leave-one-out cross-validation to evaluate our method. The corresponding areas under the curves (AUCs) are 0.9616, 0.9946, 0.9656, and 0.9532, confirming that our approach significantly outperforms the state-of-the-art methods. Case studies show that this approach can effectively predict new diseases without any known miRNAs.


INTRODUCTION
MicroRNAs (miRNAs) are a class of short non-coding single-stranded RNA molecules (22 nt) encoded by endogenous genes (Ambros, 2001). Studies have shown that miRNAs are involved in the emergence and development of various human diseases (Alvarez-Garcia and Miska, 2005;Jopling et al., 2005). Therefore, finding the associations between miRNAs and diseases could contribute to pathological classifications, individualized diagnoses, and disease treatments.
However, experimental methods for identifying associations between miRNAs and diseases are expensive and timeconsuming. Therefore, computational methods have drawn wide attention to reveal potential associations between miRNAs and diseases.
Based on the known miRNA-disease associations, a number of computational methods have been proposed to predict candidate miRNAs for diseases. These methods cover three main categories: network algorithms, machine learning, and matrixbased methods. Jiang et al. (2010) proposed the first computational method, which integrated a miRNA functional similarity network, disease phenotype similarity, known disease-miRNA association network and discrete probability distribution named the hypergeometric distribution to predict the potential associations. Xuan et al. (2013) developed a model named HDMP. The miRNA functional similarity was calculated according to disease terms and the disease phenotype similarity. HDMP could not predict candidate miRNAs for new diseases without any known associated miRNAs, however. Both methods considered only local neighbor similarity information of each miRNA, so they did not achieve satisfactory performance. To make full use of network information, Chen et al. (2012) developed the global network method RWRMDA that implemented random walks on a miRNA functional similarity network. However, this model could not address new diseases associated with no miRNAs. Many other models have incorporated complex interaction networks to present the relationship between miRNA and disease. For example, Mørk et al. (2014) proposed a model of miRNA-protein-disease (miRPD) association prediction with proteins as mediators. The authors verified the associations between miRNAs and diseases by integrating both miRNAprotein and protein-disease associations.
Recently, some machine-learning-based models were also developed to predict potential miRNA-disease associations. Based on the K-nearest-neighbor approach for miRNAs and diseases, RKNNMDA  was used to rank Knearest neighbors with SVMs and utilized weighted voting for each predicted miRNA-disease association. Zhao et al. (2019) developed a novel model of adaptive boosting for miRNAdisease association prediction (ABMDA). They used a decision tree as a weak classifier and combined weak classifiers, which could score samples to form a strong classifier based on corresponding weights.
Based on the information of known miRNA-disease associations and the similarity matrix, an inductive matrix completion algorithm was used to complete the missing entries of a known miRNA-disease association matrix. Li et al. (2017) released a method of matrix completion for an miRNA-disease association prediction model (MCMDA), which updated the adjacency matrix of known miRNA-disease association networks using matrix completion algorithms. Chen et al. (2018) also developed a model of inductive matrix completion for miRNA-disease association prediction (IMCMDA).
The methods of the three categories mentioned above have their own strengths and limitations. Combining the network algorithm, machine learning and matrix completion, we developed a matrix completion method based on graph convolutional networks for miRNA-disease association prediction. First, we constructed a heterogeneous network by integrating the miRNA similarity network, disease similarity network and known miRNA-disease associations. Inspired by Wan et al. (2019), we then obtained new node embedding by aggregating neighborhood information derived from the heterogeneous network based on graph convolutional operations, which can pass the information of a node to its intermediate and distant neighbors. To the largest extent, to preserve the topological information of the heterogeneous network, the loss function of reconstructing the entire heterogeneous network (matrix) was minimized during the training process. Finally, by comparing the reconstructed and original matrices, we discovered novel miRNA-disease associations. To evaluate the effectiveness of the proposed method, we implemented 5-fold cross-validation, leave-onedisease-out cross-validation (LODOCV), and global and local leave-one-out cross-validation (LOOCV) and obtained AUCs of 0.9616, 0.9946, 0.9656, and 0.9532, respectively. Furthermore, two types of case studies were carried out. As a result, most of the predicted miRNAs were confirmed by related databases.
In conclusion, the proposed method can effectively predict potential miRNA-disease associations.

MiRNA-Disease Network
To construct the known miRNA-disease network, we downloaded the verified miRNA-disease associations from the HMDD database (Li et al., 2014). We used an adjacency matrix RD to describe the network. The element RD(i, j) is 1 if miRNA m i is associated with disease d j and 0 otherwise. We obtained 6,441 associations between 577 miRNAs and 336 diseases after duplicates were removed.

Disease Functional Similarity Network
Similar diseases have a great probability of being regulated by similar genes. Therefore, we constructed a disease similarity network based on the gene functional information. The data can be downloaded from the HumanNet database (Li et al., 2011), which contains an associated log-likelihood score (LLS) of each interaction between two genes or gene sets. The similarity DS(i, j) between diseases d i and d j can be calculated as follows: where S(d i ) represents the gene sets related to disease d i ; |S(d i )| represents the cardinalities of S(d i ); and LLS(x, S(d j )) is the LLS between gene x and gene set S(d j ).

MiRNA Similarity Network
MiRNA families feature a common sequence or structure configuration in sets of genes (Kaczkowski et al., 2009). The miRNA cluster is a set of two or more miRNAs that are FIGURE 1 | Schematic workflow of the proposed method. (A) A miRNA similarity network, disease similarity network, and miRNA-disease association network are used to construct a heterogeneous network. (B) To extract information from the neighborhood, a neighborhood information aggregation operation (Formula 5) is applied on every node. Then, each node updates its feature representation by concatenating its current representation with the aggregated information (Formula 6). (C) A feature matrix is constructed, each row of which is a new node feature representation (Formula 7). Then, the feature matrix is used to reconstruct the heterogeneous network, and topology-preserving learning is implemented by minimizing the reconstructed error (Formula 8).
transcribed from physically adjacent miRNA genes. MiRNAs belonging to the same family or cluster are expected to have similar functions and thus be associated with the same diseases. Therefore, we constructed a miRNA similarity network by combining verified miRNA-target associations, family information, cluster information, and verified miRNAdisease associations. In this process, first, the verified miRNAtarget associations is downloaded from miRTarBase (Hsu et al., 2014). Two miRNAs are connected if they share common targets. The element value of RST (miRNA similarity based on target) represents the number of shared targets between miRNAs. Then, we can obtain the family information of miRNAs from miRBase (Griffiths-Jones et al., 2003). If two miRNAs belong to the same miRNA family, we set their RSF (miRNA similarity based on family) value to 1; otherwise, we set it to 0. Third, the miRNA cluster information is accessible in miRBase (Kozomara and Griffiths-Jones, 2014). If two miRNAs belong to the same cluster, then the RSC (miRNA similarity based on cluster) value is set to 1. Finally, we utilize MISIM, a miRNA similarity network based on verified miRNA-disease associations, to define RSD (miRNA similarity based on disease). Once the data are prepared, we combine the four matrices to calculate the similarity RS(i, j) between miRNA r i and miRNA r j : where α = 0.2, β = 0.1, γ = 0.2, and θ = 0.5 are described as in the work (Zeng et al., 2018).

Heterogeneous Network Construction
As shown in Figure 1, we constructed a heterogeneous network based on the miRNA similarity network RS, disease similarity network DS, and miRNA-disease network RD. The heterogeneous network can be represented as follows: where N is the node set that contains two kinds of nodes NT = {miRNA, disease} , and E is the edge set ET = {miRNA-miRNA, miRNA-disease, disease-disease}. The three kinds of edges and their weights are described as miRNA similarity network, miRNA-disease network, and disease functional similarity network, respectively. For s ∈ ET and network A s ∈ {RS, RD, DS}, normalization is first implemented before further processing as follows: where Col(A s ) is the size of the A s column dimension and A s (i, j) is the element in the i th line and j th column.

Neighborhood Information Aggregation
To take full advantage of the heterogeneous network information, we adopted the neighborhood information aggregation strategy. First, an initial node embedding function f : N → R d maps each node u to its d-dimensional vector representation f (u). In our experiment, d is equal to 1024, and f is a function that outputs random values from a truncated normal distribution. Then, the neighborhood information aggregation can be defined as: where W s ∈ R d×d and b s ∈ R d are the parameters trained in the neural network. In addition, σ (·) is the activation function in the neural network, and we used the RELU function here. Based on the graph convolutional operation, we pass the information of a node to its intermediate and distant neighbors and therefore realize the implicit influence among nodes on the network level.

Updating the Node Embedding
Obtaining the aggregated neighbor information a u , the process of updating the node embedding can be defined as: where f 1 (u) is a new node embedding, W 1 ∈ R d×2d is the weights, b 1 ∈ R d is the bias and || · || 2 is the l 2 norm.

Topology-Preserving Learning
Considering the same importance of preserving the known miRNA similarity (RS), disease similarity (DS) and miRNAdisease association (RD), we share all the parameters among these three subnetworks and minimize the loss function of reconstructing the entire heterogeneous network during the training process, as shown in Figure 1C. First, we use RF ∈ R m×d Frontiers in Bioengineering and Biotechnology | www.frontiersin.org and DF ∈ R n×d to represent the feature matrix of miRNA and disease, respectively, where each row of the feature matrix represents a new node embedding f 1 (u), m is the number of miRNA nodes, n is the number of disease nodes and d is the dimension of new node embedding. Then, topology-preserving learning of the node embedding can be defined as: where P ∈ R d×k and H ∈ R k×d are projection matrices used to extract the principle features from node representations, and G is the graph constructed in Equation (3). We set k to 512 in our experiment. The unknown parameters can be trained in an end-to-end manner by performing gradient descent to minimize the total squared reconstruction error. In the training phase, we iterate 2,000 epochs to establish the optimal parameters with the minimum error value.

Interaction Probability Between MiRNA and Disease
Finally, the predicted interaction probability between miRNAs and diseases can be obtained from the reconstructed heterogeneous network as follows: RD predicted = (RD ′ + DR ′T )/2 (10) By comparing the reconstructed RD predicted and the original RD matrix, we can discover potential miRNA-disease associations. The prediction procedure is summarized in Algorithm 1. The code and data can be obtained online 1 .

Baseline Methods
We choose the following state-of-the-art methods as our baseline methods: for each node u do 10:     applies label propagation on the similarity networks to obtain relevant scores. 6) Path-based computational model for miRNA-disease association prediction (PBMDA): You et al. (2017) constructed a heterogeneous graph consisting of three interlinked subgraphs and further adopted a depth-first search algorithm to infer potential miRNA-disease associations. 7) Heterogeneous graph inference for miRNA-disease association prediction (HGIMDA): Chen et al. (2016) developed the computational model of HGIMDA to uncover potential miRNA-disease associations by integrating miRNA functional similarity, disease semantic similarity, Gaussian interaction profile kernel similarity, and experimentally verified miRNA-disease associations into a heterogeneous graph. HGIMDA adopts an iterative process to find the optimal solutions based on global network similarity information, which leads to superior performance over local network similarity-based methods.

Performance Evaluation
Considering the uniqueness and limitedness of available miRNA and disease samples, we implemented LOOCV, LODOCV, and 5fold cross-validation to evaluate the performance of our method (Jiao and Du, 2016). In each framework, we selected 5 stateof-the-art baseline models and plotted the receiver operating characteristic (ROC) curves of our method and the selected methods by calculating the false-positive rate (FPR) and truepositive rate (TPR) at varying thresholds. LOOCV is conducted in two different ways: global and local LOOCV. In the framework of global LOOCV, one of the known  miRNA-disease associations is left out in turn as a test sample, and the other known associations are regarded as training samples. All the unknown associations in the original RD matrix can be candidate samples. We ranked the predicted interaction scores of the test sample and the candidate samples. If the ranking of the test sample was higher than a threshold for a given true-positive rate (TPR), it was marked as positive. In the framework of local LOOCV, only the unknown associations of a specific disease are ranked with the test sample.
In 5-fold cross-validation, all the known miRNA-disease associations were randomly divided into five subsets. Each subset was taken as test samples in turn, and the others were considered training samples. All unknown miRNA-disease associations were considered candidate samples.
To further test the performance of our method in predicting associations for diseases without any known related miRNAs, we adopted LODOCV (Fu and Peng, 2017). In this framework, all the known miRNAs associated with a given disease were regarded as test samples.
The area under the curve (AUC) was then calculated to evaluate the performance of our method. As a result, our method obtained AUCs of 0.9656, 0.9532, and 0.9616 in global LOOCV, local LOOCV, and 5-fold cross-validation, respectively, as shown in Figure 2. The performance of our method outperformed the baseline methods. For LODOCV, our method achieved the highest AUC value of 0.9946, which proved that our method could effectively predict new associations between miRNAs and diseases. We also note that the AUC value of LODOCV was much higher than that of LOOCV. The reason may be that the test samples of LODOCV are from the known miRNA-disease associations, the predicted interaction scores of which can be higher than those of the original unknown associations.

Case Studies
Two types of case studies were conducted to further validate the performance of the proposed method for novel miRNA-disease association prediction.
For the first type of case study, we applied the proposed method to predict novel miRNA-disease associations for three common human diseases (breast neoplasms, lung neoplasms, and prostate neoplasms) based on the known associations from HMDD. For a specific disease, known associations of all diseases were regarded as training samples, and unknown associations with this disease were regarded as candidate samples. After training the network, we ranked the prediction score of the candidate associations and selected the top 30 candidate associations with this disease. The prediction results were then verified by two databases: dbDEMC V2.0 (Yang et al., 2017) and PhenomiR (Ruepp et al., 2010). As a result, 28 out of the top 30 miRNAs were verified to be associated with breast neoplasms (Table 1), 27 out of the top 30 miRNAs were verified to be associated with lung neoplasms ( Table 2), and 27 out of the top 30 miRNAs were verified to be associated with prostate neoplasms ( Table 3). The results proved that our method can effectively predict potential miRNA-disease associations.
In the second case study, we evaluated the ability of the proposed method to predict new associations for diseases without any known related miRNAs. We selected pancreatic neoplasms as an example in this case study. First, we set the known associations of pancreatic neoplasms as unknown associations, and all miRNAs were considered candidate miRNAs. Then, we implemented our method to obtain the prediction scores of these candidate miRNAs associated with pancreatic neoplasms. We found that 50 out of the top 50 miRNAs were confirmed by at least one database from dbDEMC v2.0 and Phe-nomiR v2.0 ( Table 4). The results demonstrate that our method can be applied to predict potential associations for disease without any known related miRNAs.

DISCUSSION
In this paper, we propose a novel method to predict potential associations between miRNAs and diseases. The method constructs a heterogeneous network composed of the miRNA similarity network, disease similarity network, and known miRNA-disease association network. The miRNA similarity depends on the miRNAs and their possible families and clusters. The information of each node in this network is obtained by aggregating neighborhood information through graph convolutional networks. We compared the method with several state-of-the-art baseline methods. The method performed well in four types of cross-validations. Furthermore, two types of case studies were implemented. The results demonstrate that our proposed method is powerful in discovering potential diseaserelated miRNAs. In addition, the method can be used to predict the related miRNAs of diseases without any known association.
The reliable performance of the proposed method is due mainly to the following several important factors. First, we integrated useful datasets to construct a heterogeneous network. Second, the method made full use of the available information by aggregating neighborhood information derived from the heterogeneous network. Third, the parameters of the neural network were learned by minimizing the error of reconstructing the whole heterogeneous network, rather than that of just the miRNA-disease network.
However, there are still some limitations in our method. First, the datasets we used to construct the network possibly contain noise and outliers. Second, the heterogeneous network we constructed was insufficient to represent the complex relationships between miRNAs and diseases. Thus, our future research will focus on the diverse relationships between miRNAs and diseases.

DATA AVAILABILITY STATEMENT
All datasets generated for this study are included in the article/ Supplementary Material.

AUTHOR CONTRIBUTIONS
RZ, YC, and HW conceived of the presented idea. RZ carried out the experiment and wrote the draft. CJ and YW helped shape the research, analysis, and manuscript. All authors discussed the results and contributed to the final manuscript.