SNFIMCMDA: Similarity Network Fusion and Inductive Matrix Completion for miRNA–Disease Association Prediction

MicroRNAs (miRNAs) that belong to non-coding RNAs are verified to be closely associated with several complicated biological processes and human diseases. In this study, we proposed a novel model that was Similarity Network Fusion and Inductive Matrix Completion for miRNA-Disease Association Prediction (SNFIMCMDA). We applied inductive matrix completion (IMC) method to acquire possible associations between miRNAs and diseases, which also could obtain corresponding correlation scores. IMC was performed based on the verified connections of miRNA–disease, miRNA similarity, and disease similarity. In addition, miRNA similarity and disease similarity were calculated by similarity network fusion, which could masterly integrate multiple data types to obtain target data. We integrated miRNA functional similarity and Gaussian interaction profile kernel similarity by similarity network fusion to obtain miRNA similarity. Similarly, disease similarity was integrated in this way. To indicate the utility and effectiveness of SNFIMCMDA, we both applied global leave-one-out cross-validation and five-fold cross-validation to validate our model. Furthermore, case studies on three significant human diseases were also implemented to prove the effectiveness of SNFIMCMDA. The results demonstrated that SNFIMCMDA was effective for prediction of possible associations of miRNA–disease.


INTRODUCTION
MicroRNAs (miRNAs) belong to small non-coding RNAs, which effectively control the expression of their mRNA targets through RNA cleavage or translation repression (Ambros, 2004;Bartel, 2004;Ambros, 2001). In recent years, researchers have discovered various of miRNAs in many living organisms (Bruce et al., 1993;Calin and Croce, 2006). The expression of a great quantity of target genes is controlled by miRNAs, with the result that the whole miRNA pathway is an important technique for gene expression control (Xu et al., 2004;Miska, 2005;Bartel, 2009). The dysregulation of miRNAs results in progression of various diseases and conduces to developmental defects (Meola et al., 2009). Hence, identifying miRNAs that are associated with diseases is helpful in understanding the consequences of complex diseases and genetic causes. During the past few years, traditional experiments have confirmed a large number of connections of miRNA-disease (Thomson et al., 2007;Mohammadi-Yeganeh et al., 2013). Previous experimental methods such as polymerase chain reaction can reveal the relationship between miRNA and disease, but which are time-consuming and costly. Thus, revealing the more unknown relationship between miRNAs and diseases need effective experiment methods. Researchers have made every effort to achieve effective and accurate prediction methods so that future biological experiments will reliably obtain more and more reasonable and valid relationship of miRNA-disease (Han et al., 2014).
In the past period of time, a great deal of computationbased algorithms and methods were developed to predict possible relationship between miRNAs and diseases (You et al., 2017;Chen et al., 2018a). Based on an assumption that miRNAs with similar functions are highly likely to be related to diseases that were phenotypic similar and vice versa (Zeng et al., 2015), Jiang et al. (2010) established a novel model that identified the feasible connections of miRNA-disease by using hypergeometric distribution. However, the model had the disadvantage that it only used local similarities between two miRNAs with a large number of shared target genes. In addition, Mørk et al. (2014) constructed an miPRD model to infer the miRNA-protein connections and disease-protein connections. Then, these connections were exploited to predict the possible relationship between miRNAs and diseases. The Jaccard similarity was first introduced by Chen et al. (2018) in the model of BLHARMDA to recognize possible miRNA-disease connections. The model of BLHARMDA also introduced the system of KNN into the bipartite local model method.
Obviously, the authenticity of global network similarity measures is superior to that of local network similarity measures (Köhler et al., 2008;Zhang et al., 2014). Considering this fact, Chen et al. (2012) constructed the novel RWRMDA model to infer unknown connections of miRNA-disease. Compared to local network similarity measures, RWRMDA discovered that global network similarity was more valid to find the potential relationship between miRNAs and diseases. Therefore, the performance of previous local network-based methods was worse than RWRMDA model. However, the RWRMDA model was unsuitable for new diseases that did not associate with miRNAs. Random walk method had been proposed by many researchers so as to effectively solve this problem. Liu et al. (2016) put forward an unused method that implemented the random walk algorithm to seek miRNAs associated with diseases. The method was to construct a heterogeneous graph by integrating various similarities of diseases and miRNAs. Then, the random walk with restart method in the heterogeneous graph is applied to seek unknown connections between miRNAs and diseases. Luo and Xiao (2017) established a heterogeneous network, which was made up of the similarity of miRNA, disease semantic similarity, and verified connections of miRNA-disease. Different from the method of Liu et al. (2016), they applied the imbalanced bi-random walk method to look for diseases that related to miRNAs. Furthermore, Chen et al. (2016b) presented the WBSMDA, which integrated the various of similarities of miRNA and disease, respectively. This model also could reliably obtain the possible relationship of miRNA-disease. Another model HGIMDA was also presented by Chen et al. (2016a). The heterogeneous graph was generated by combining the verified miRNA-disease association network and the processed similarity networks of miRNA and disease in HGIMDA. It was important that an iterative equation was used in the model for the accurate prediction of potential miRNA-disease association. The model of HGIMDA performed better than previous methods, but the problem was the choice of parameters that was still not well resolved. For the purpose of inferring feasible and reasonable relationship of miRNA-disease, an identification medium was proposed by Yu et al. (2017). The medium changed the methods of maximizing information flow in existence, which consisted of functional similarity of miRNA, semantic similarity, and phenotypic similarity of disease. The verified connections and unknown connections of miRNAdisease were all adopted into a phenome-miRNAome network in this method. The NCMCMDA (Chen et al., 2020) model integrated neighborhood constraint with matrix completion algorithm to change the recovery task into an optimization problem. This model applied the fast iterative shrinkagethresholding algorithm to recover missing interactions between miRNAs and diseases.
Recently, considerable amount of models that based on machine learning was gradually applied to expose the potential relationship of miRNA-disease. Xu et al. (2014) introduced a new method that prioritized novel disease-related miRNAs based on the miRNA target-dysregulated network (MTDN). In this model, the SVM classifier was constructed to extract the feature of network topologic information, which could effectively identify positive associations from negative associations of miRNA-disease. However, because negative samples were hard to obtain, the sets of negative samples were usually obtained by removing the pairs of positive sample sets from all pairs of miRNA-disease. In addition, Chen and Yan (2014) constructed the novel model of RLSMDA to infer potential miRNAs that were related to diseases. The association scores of miRNA-disease were effectively calculated by the model of RLSMDA. Therefore, RLSMDA could provide prediction score to new disease. Different from MTDN, RLSMDA could avert using negative miRNAs diseases associations, which could improve experimental efficiency and get more accurate results. The RBMMMDA (Chen et al., 2015) method was developed according to the restricted Boltzmann machine. RBMMMDA used the two-layer undirected graph to obviously represent the relationship of miRNA-disease. The two-layer undirected graph contained visible layer and hidden layer. RBMMMDA could gain new connections of miRNA-disease with the corresponding scores. Furthermore, another model named RKNNMDA  started to apply KNN method to deal with miRNAs and diseases. The support vector machine ranking model was also implemented in this method to handle these KNNs obtained by KNN method. The last ranking result of feasible connections between miRNAs and diseases was obtained by the weighted voting in this model. The disadvantage of this model was that miRNAs might associate with more known diseases owning to the bias. The BHCMDA (Zhu et al., 2020) model utilized biased heat conduction (BHC) algorithm to predict unknown connections between miRNAs and diseases through combining miRNA similarity matrix, disease similarity matrix, and miRNAdisease association matrix. The probabilistic matrix factorization (PMF) algorithm was used in IMIPMF (Ha et al., 2020) model to infer potential miRNA-disease interactions. The PMF was widely used in recommender systems, so it could effectively make use of all information to recommend miRNAs, which are strongly associated with the disease.
Because there were several limitations existing in previous models, we constructed a new model that was Similarity Network Fusion and Inductive Matrix Completion for miRNA-Disease Association Prediction (SNFIMCMDA). We used the method of similarity network fusion (SNF) to obtain similarity of miRNA, which was gained by integrating function similarity and Gaussian interaction profile (GIP) kernel similarity of miRNA. And we also used the same way to obtain the disease similarity, which was gained by integrating semantic similarity and GIP kernel similarity of disease. After collecting data and integrating similarity for miRNA and disease, we used inductive matrix completion (IMC) method to efficiently obtain possible connections of miRNA-disease. The global leave-oneout cross-validation and five-fold cross-validation were carried out to evaluate the effectiveness of SNFIMCMDA. Furthermore, colon neoplasms, lung neoplasms, and breast neoplasms were performed as case studies. As a consequence, the 44, 43, and 43 of the top 50 miRNAs inferred by SNFIMCMDA, which were validated to associate with these human diseases according to the HMDD v3.2 (Huang et al., 2019) database and dbDEMC v2.0 (Zhen et al., 2017) database, respectively. Experimental results showed that our model was effective and reliable for predicting possible relationship of miRNA-disease.

Human miRNA-Disease Associations
In this article, we downloaded the verified association data of miRNA-disease from HMDD v2.0 database . There are 5,430 experimentally verified links of miRNAdisease in the known association data. Furthermore, we defined an adjacency matrix A ∈ R nd×nm to describe the verified connections of miRNA-disease. There is no doubt that nd is defined as the amount of diseases, and nm is defined as the amount of miRNAs. The element A i, j is equal to 1 if disease d i is validated to be related to miRNA m j , and 0 otherwise.

miRNA Functional Similarity
If functions of two miRNAs are similar, they have a high probability of being related to diseases that are similar and vice versa (Cui, 2010;Goh et al., 2007). Obviously, the miRNA functional similarity is obtained by this assumption. miRNA functional similarity information that we obtained was downloaded from the website of http://www.cuilab.cn/files/ images/cuilab/misim.zip. In addition, we indicated the matrix MF to stand for the miRNA functional similarity. The value of similarity between miRNA m i and miRNA m j is represented by element MF m i , m j .

Disease Semantic Similarity
The Directed Acyclic Network (DAG) based on the Mesh descriptor (Lipscomb, 2000) can be utilized to describe diseases. The DAG of disease D includes two parts: nodes and edges. The nodes in DAG represent not only the D itself but also ancestor nodes of D. The edges in DAG are applied to connect child nodes with their parent nodes directly. Then DAG (D) = (D, T (D) , E (D)) is utilized in our article to intuitively represent the DAG of disease D, where T (D) and E (D) indicated the node set and edge set, respectively. The semantic score of disease D is calculated according to the following equation: where the contribution score of disease d is obtained by the following formula: (2) here, the semantic contribution factor = 0.5 in our article based on previous literature (Xuan et al., 2013).
The equation to calculate semantic similarity score between disease d i and disease d j is as follows:

Gaussian Interaction Profile Kernel Similarity
If functions of two miRNAs are similar, they are likely to relate to similar or same diseases and vice versa (Lu et al., 2008;Sanghamitra et al., 2010). Therefore, the miRNA similarity and disease similarity can use the GIP kernel similarity to represent (Chen et al., 2016;Cheng et al., 2017). First, after observing whether there is known association between disease d i and each miRNA or not, the interaction profile of disease d i was represented by vector K d i . We used vector K (m i ) to represent the interaction profile of miRNA m i in a similar way. Then, the equations to calculate GIP kernel similarity of diseases and miRNAs are as follows: where the GKD and GKM represent GIP kernel similarity of disease and miRNA, respectively. The ρ d and ρ m are utilized to regulate the bandwidths of kernel. ρ d is calculated by normalizing the original bandwidth ρ d . The specific formula is described as follows: The ρ m can be obtained in a similar way:

Similarity Network Fusion to Integrate Similarity
The similarity between miRNAs is calculated by functional similarity and GIP kernel similarity of miRNA, respectively. Similarly, the similarity between diseases is calculated by semantic similarity and GIP kernel similarity of disease, respectively. In this section, we introduced SNF  method to obtain ultimate similarity networks of disease and miRNA. The SNF method integrated similarity for disease included the following main steps. First, normalized weight matrices of disease similarity networks can be obtained by the below formulas: where DSP denotes the normalized weight matrix of disease semantic similarity network, and KDP denotes the normalized weight matrix of GIP kernel similarity for diseases. Then, we used KNN method to calculate disease local relationship by the following two formulas: where N i denotes the number of neighbors of disease d i . DSK denotes the local relationship matrix of disease semantic similarity; KDK represents the local relationship matrix of GIP kernel similarity for diseases. Based on the previous literature , the essence of SNF method could be described as an iterative update of similar matrices. In our article, after we brought disease data into the network fusion formula of SNF, the specific process of network fusion corresponded to each FIGURE 1 | Flowchart of SNFIMCMDA model.
Frontiers in Cell and Developmental Biology | www.frontiersin.org data type is presented by the following equations: The final similarity matrix of disease that integrated all data types is presented by the below formula: where S d denotes the finial similarity matrix of disease. Similarity network fusion for miRNA is defined in a similar way by the following formulas: where S m denotes the miRNA similarity matrix.

Inductive Matrix Completion
After collecting data and using SNF to integrate similarities for miRNA and disease, we utilized IMC method to obtain final prediction result. The specific flowchart of SNFIMCMDA is presented in Figure 1. The IMC method was employed according to the verified connection matrix of miRNAdisease A ∈ R nd×nm , miRNA similarity matrix S m ∈ R nm×nm , and disease similarity matrix S d ∈ R nd×nd . Here, the feature matrix of nm miRNAs was used S m ∈ R nm×nm to represent, and the feature matrix of nd diseases was used S d ∈ R nd×nd to represent. The feature vector of miRNA m j was denoted by S m j , and the feature vector of disease d i was denoted by S d (i). Then we made A = UV T , where U ∈ R nd×r and V ∈ R nm×r . Here, the r is desired rank that also is the same as min rank (U) , rank (V) . The convergence speed of the IMC algorithm is also affected by r. The matrices U and V can be treated as the answers of the optimization problem as follows: is Frobenius norm of matrix that is set to solve overfitting problems. λ 1 and λ 2 are equal to 1 · F that are regularization parameters.
In addition,U ∈ R nd×r and V ∈ R nm×r were two random dense matrices by the iterative equation to update. In our experiment, when the convergence criterion met 10 −6 , U and V would be obtained by iterative process. The process of IMC algorithm to obtain U and V are presented by the following formulas: The S d i , m j indicates the predicted association chance between d i and m j . S d i , m j can be obtained by applying U and V to calculate: Furthermore, if the feature vector of disease newd i is acquired, S newd i , j can be utilized to obtain association score between this disease and any miRNA. We will realize disease newd i associated with which miRNAs effectively.

Performance Evaluation
For the purpose of affirming the accuracy of predicted result of SNFIMCMDA, we compared our model with three previous computational models: IMCMDA (Chen et al., 2018b), GRL 2,1 -NMF , and MSCHLMDA . Based on the verified connections of miRNA-disease that were downloaded from HMDD v2.0 database, global leave-one-out cross-validation (global LOOCV) and five-fold cross-validation (5-CV) were utilized to validate the actual performance of these computational models.
In the framework of global LOOCV, we applied the associations of miRNA-disease to train model. First, we selected each verified connection of miRNA-disease in turn for testing, whereas other experimentally confirmed associations were training sets. In addition to verified associations, there still were some connections between miRNAs and diseases without evidence that were treated as candidate samples. Then we calculated all association scores after implementing SNFIMCMDA, the test samples would obtain the predicting rankings by comparing with the candidate samples. If a given threshold was inferior to the ranking of each test sample, we thought SNFIMCMDA was valid. Furthermore, we could draw receiver operating characteristic (ROC) curve by plotting the true-positive rate against the false-positive rate. Finally, for the purpose of evaluating performance of SNFIMCMDA, we calculated the areas under ROC curve (AUCs) of all models. The ultimate result clearly indicated that the AUC values of SNFIMCMDA, IMCMDA, GRL 2,1 -NMF, and MSCHLMDA reached 0.9540, 0.8384, 0.9280, and 0.9287, respectively (Figure 2). Obviously, the AUC of SNFIMCMDA was higher than other methods.
In the framework of 5-CV, first, all observed connections of miRNA-disease were randomly divided into five parts; where the test set was held by each one of the five parts for each round, the training set consisted of the other four parts in turn. In addition to observed connections, there still were several connections without evidence that were treated as candidate samples. After implementing the SNFIMCMDA, we could obtain the predicted rankings of test samples compared with those of the candidate samples. Furthermore, we performed 100 times repeated segmentations on known connections, so as to avoid the possible deviations generated in the process of random sample segmentation. Finally, similar to global LOOCV, we could obtain ROC curve and AUCs of these models. The specific result indicated that the AUC values of SNFIMCMDA, IMCMDA, GRL 2,1 -NMF, and MSCHLMDA were 0.9539, 0.8330, 0.9276, and 0.9263, respectively (Figure 3). Obviously, the AUC of SNFIMCMDA was also higher than other methods.

Case Study
In this article, several types of human diseases that included colon neoplasms, breast neoplasms, and lung neoplasms were applied to validate the prediction result of SNFIMCMDA. These diseases actually pose a great threat to human beings. Colon neoplasms belong to the common malignant tumor in the gastrointestinal tract (Jemal et al., 2011). There were a large amount of new cases and deaths that were caused by colon neoplasms in recent years (Thackeray et al., 2011). Several miRNAs that relate to

DISCUSSION
The researches for inferring possible relationship of miRNAdisease would provide deep insight into the pathogenesis of diseases and contribute to the treatment of diseases. Therefore, we constructed the novel model of SNFIMCMDA. The prediction score of each miRNA-disease pair was calculated by combining the known association between miRNAs and diseases and integrated similarities of both miRNA and disease in the SNFIMCMDA. Different from the model of IMCMDA that had been published in previous years, we made a change in integrating similarity for miRNAs and diseases. The method of SNF was used to integrate similarity in place of a previous method in IMCMDA. After adopting SNF, there was a significant improvement in the prediction results. In the framework of global LOOCV, the AUC of SNFIMCMDA was 0.9540, which was higher than 0.8330 calculated by IMCMDA. And in the framework of 5-CV, the AUC of SNFIMCMDA was 0.9539 that was also higher than 0.8330 obtained by IMCMDA. Moreover, the AUC of SNFIMCMDA performed better than other previous methods in both global LOOCV and 5-CV. Furthermore, three different human diseases were performed as case study that had effectively certified the reliable performance of the SNFIMCMDA. Therefore, SNFIMCMDA could be utilized as a reliable biological tool for extracting the most promising disease-related miRNAs, thereby enhancing our comprehension on the disease mechanisms of miRNAs and contributing to the prevention, discovery, and diagnosis of complex diseases in the future. In our article, the model of SNFIMCMDA completed the missing association scores between miRNAs and diseases, which utilized the feature vector method to succinctly represent disease and miRNA, respectively. Furthermore, if we had the feature vector of the disease without any known associated miRNAs, the SNFIMCMDA could reliably predict this disease associated with which miRNAs and obtained the scores between them. In addition, our model belonged to semi-supervised model, so it had no use for negative samples. The obvious advantage of our model was that it only needs positive and unlabeled samples, which effectively lowered the level of difficulty of modeling to a large extent. In addition, the function of SNF was to combine different types of experimental data. We applied the SNF algorithm to combine different-type similarity data of miRNA and disease so that it makes the prediction result more reliable. Finally, the alternating gradient descent of IMC algorithm was used to find the optimal solution, which ensured the reliability of the eigenvectors of miRNA and disease.
There were some limitations that influenced the performance of SNFIMCMDA. First, the materials that we used included verified connections of miRNA-disease, miRNA function similarity, and disease semantic similarity, which may obtain noise and outliers. In addition, SNFIMCMDA used the least square error function that would cause noises and outliers. Furthermore, the model of SNFIMCMDA included several parameters. It was an obvious challenge to discover optimal parameters. Therefore, with the increasing of verified biological data, we would develop optimization strategy to improve accuracy of our model.

DATA AVAILABILITY STATEMENT
The original contributions presented in the study are included in the article/Supplementary Material, further inquiries can be directed to the corresponding author/s.

AUTHOR CONTRIBUTIONS
LL designed the experiment, performed the experiment, and wrote the manuscript. ZG and YW performed the experiment. C-HZ processed the data. Y-TW and J-CN revised the manuscript. All authors contributed to the article and approved the submitted version.

FUNDING
This work was supported by the National Natural Science Foundation of China (Nos. U19A2064, 61873001, 61872220, 61861146002, and 11701318). This article is recommended by the 5th Computational Bioinformatics Conference.