LPI-IBNRA: Long Non-coding RNA-Protein Interaction Prediction Based on Improved Bipartite Network Recommender Algorithm

According to the latest research, lncRNAs (long non-coding RNAs) play a broad and important role in various biological processes by interacting with proteins. However, identifying whether proteins interact with a specific lncRNA through biological experimental methods is difficult, costly, and time-consuming. Thus, many bioinformatics computational methods have been proposed to predict lncRNA-protein interactions. In this paper, we proposed a novel approach called Long non-coding RNA-Protein Interaction Prediction based on Improved Bipartite Network Recommender Algorithm (LPI-IBNRA). In the proposed method, we implemented a two-round resource allocation and eliminated the second-order correlations appropriately on the bipartite network. Experimental results illustrate that LPI-IBNRA outperforms five previous methods, with the AUC values of 0.8932 in leave-one-out cross validation (LOOCV) and 0.8819 ± 0.0052 in 10-fold cross validation, respectively. In addition, case studies on four lncRNAs were carried out to show the predictive power of LPI-IBNRA.


INTRODUCTION
LncRNA, a class of ncRNAs (non-coding RNAs) of more than 200 nucleotides, that do not encode proteins, has gained increasing scientific interest in recent years (Jorge et al., 2012;Hajjari et al., 2016). Only 2% of RNAs in the human transcriptome can encode proteins while others, called ncRNAs, cannot. Note that most ncRNAs are lncRNAs. Compared to other ncRNAs, lncRNAs are longer and have complex secondary or higher-order structures (Bonasio and Shiekhattar, 2014), and their genes often have independent regulatory elements such as promoters and enhancers (Ulitsky and Bartel, 2013). There is increasing evidence that lncRNAs are related to the regulation of gene expression levels such as epigenetic regulation, transcriptional regulation, and multiple levels of post-transcriptional regulation (Sarah and Jeff, 2013), but only a few functions and mechanisms of lncRNA have been studied (Maarabouni et al., 2008;Lee et al., 2016). Moreover, interactions of lncRNAs with other molecules also have become hot spots in oncology research over the past years. The studies found that an important way for lncRNAs to function is by interacting with proteins (Khalil and Rinn, 2011). LncRNAs play a broad and important regulatory role in various processes such as tumorigenesis, cancer progression and metastasis by interacting with proteins. Thus, the prediction and identification of lncRNA-protein interactions can further reveal lncRNA-related functions and is beneficial for the study on the pathogenesis of complex diseases at the molecular level (Faghihi et al., 2008;Chen and Yan, 2013;Cui et al., 2013;Li et al., 2013;Chen et al., 2016aChen et al., ,c, 2017bChen et al., , 2018c. Numerous biological experimental methods were exploited to confirm protein-related RNAs (Ule et al., 2005;Galgano and Gerber, 2011;Zambelli and Pavesi, 2015). However, such experimental methods are laborious, time-consuming, and costly (Huang et al., 2012). Recently, various computational methods have been proposed to address the challenges in bioinformatics (He et al., 2018a,b;Zou et al., 2018), such as lncRNA-protein (Hu et al., 2017;Shen et al., 2019a,b), miRNA-disease (Chen and Huang, 2017;Chen et al., 2018a,b,d,e,f;Jiang et al., 2018a,b;Xie et al., 2018), drug-target (Chen et al., 2016b;Wang et al., 2017;Wu et al., 2018) and microbe-disease associations predictions (Chen et al., 2017a;Peng et al., 2018). The methods for inferring lncRNA-protein associations can roughly be classified into two types: the machine learning methods and the network-based methods. The so-called machine learning methods usually use the biological features of lncRNAs and proteins, and then employ a supervised classifier to identify whether proteins have potential interactions with a specific lncRNA (Zhan et al., 2018). For example, Bellucci et al. (2011) proposed to utilize secondary structure, hydrogen bonding and van der Waals contributions for feature integration, which has a beneficial effect for inferring the binding propensity of protein and ncRNA. Protein and lncRNA sequence information is utilized in Muppirala et al. (2011), with the employment of a support vector machine (SVM) and random forest (RF). Suresh et al. (2015) proposed an SVM-based method named RPI-Pred, which uses high-order 3D structural features and sequences of the lncRNA and protein. Hu et al. (2018) developed a method called HLPI-Ensembl, adopting the ensemble strategy based on extreme gradient boosting (XGB), SVM and RF. However, the main drawback of these methods is the insufficiency of negative samples of lncRNA-protein interactions. The lack of negative samples may cause the unstable performance of the supervised classifier. Moreover, selecting appropriate features to predict lncRNA-protein interactions is not an easy task.
Apart from the aforementioned methods, there are other approaches for potential lncRNA-protein interaction prediction, with the employment of network analysis algorithms. For instance, Li et al. (2015) presented a method called LPIHN, which constructs a heterogeneous network, and implements a random walk with restart on the heterogeneous network. In order to improve prediction performance, some recent networkbased methods use recommender algorithms to infer lncRNAprotein interactions. For example, Ge et al. (2016) proposed a method called LPBNI, which only uses known lncRNA-protein interactions and implements the two-step propagation on a bipartite network. Zhao et al. (2018b) introduced an approach based on the bipartite network called LPI-BNPRA, which infers lncRNA-protein interactions by constructing bias ratings for lncRNAs and proteins, using agglomerative hierarchical clustering. By implementing two-round resource allocation on bipartite networks, these approaches achieved impressive results. But predictive validity of these investigations remains insufficient due to the existence of high-order correlations, which might have a negative effect on the lncRNA-protein interaction prediction. For example, the proteins directly correlated by the same lncRNA, could also be indirectly correlated by other media proteins, resulting in correlation redundancy. Properly eliminating the redundancy induced by the second-order correlation might further enhance the accuracy of the prediction. This inspired us to develop an effective network-based recommender algorithm for lncRNA-protein interaction prediction.
Motivated by the effectiveness of high-order correlation elimination in the study of Qiu et al. (2014), we propose a novel method named LPI-IBNRA for inferring new lncRNA-protein interactions. LPI-IBNRA uses known lncRNA-protein and protein-protein interactions, and lncRNA expression similarity, and then eliminates second-order correlations on the bipartite network appropriately to enhance the prediction accuracy. Compared with previous machine learning methods, our method does not require negative samples. Compared with many existing network-based methods (Ge et al., 2016;Zhao et al., 2018b), our method yields comparable or even better results due to secondorder correlation elimination. Both 10-fold cross validation and LOOCV were carried out to assess the prediction ability of the proposed method. Experimental results illustrated that LPI-IBNRA outperformed five other methods by achieving higher AUC values. In addition, case studies on four lncRNAs further demonstrated the predictive power of LPI-IBNRA. Therefore, we conclude that LPI-IBNRA is feasible and effective for inferring potential lncRNA-protein interactions.

Human LncRNA-Protein Interactions
The known ncRNA-protein interaction dataset was downloaded from the NPInter v2.0 database . We limited the organism to "Homo sapiens" and the type of ncRNAs to "NONCODE", in order to filter ncRNAs and their interacting proteins. The lncRNAs were further filtered from these ncRNAs, through a human lncRNA dataset from the NONCODE 4.0 database . We deleted duplicate interactions. Considering the sample requirement of LOOCV, we removed the lncRNAs and proteins that have only one interaction. We then obtained 4796 distinct experimentally confirmed lncRNAprotein interactions, containing 26 proteins and 1105 lncRNAs. We denoted np as the number of known proteins, nl as the number of known lncRNAs, and matrix I ∈ R np * nl as the adjacency matrix of protein-lncRNA interactions. The interaction between protein p i and lncRNA l j could be denoted as follows:

Protein-Protein Interaction Score Matrix and Similarity Matrix
Protein-protein interactions (PPI) were obtained from the STRING 9.1 database (Franceschini et al., 2013), which included weighted protein-protein interactions through co-expression data, genomic context predictions, automated text mining, and high-throughput lab experiments. We then deleted the redundant PPI data, and obtained 214 PPI data, and the corresponding interaction scores based on the known lncRNAprotein dataset. The symmetric matrix AP was denoted as an interaction score matrix based on PPI data, where AP(p i , p j ) is the interaction score between proteins p i and p j . AP could then be standardized as follows: where R(p i ) is the sum of the elements in i-row of AP.
Considering the hypothesis that similar proteins tend to exhibit a similar interaction and non-interaction pattern with lncRNAs (Zheng et al., 2017), we calculated the protein similarity with the utilization of Gaussian kernel interaction profiles. We denoted X(p i ) as the ith row vector of matrix I, in which the nonzero values occur at the indices where the corresponding lncRNA have one interaction with a protein p i . Then the similarity between proteins p i and p j based on Gaussian kernel interaction profiles could be calculated as follows: where the adjustment coefficient β p for the kernel bandwidth is defined as follows:

LncRNA-LncRNA Similarity Matrix
LncRNA expression profiles were downloaded from the NONCODE 4.0 database . After removing the superfluous data, we obtained the expression profiles of 1,105 lncRNAs in 24 human tissues or cell types. Then the Pearson correlation coefficient (PPC) was applied for the calculation of lncRNA expression similarity between each pair of lncRNA expression profiles Ganegoda et al., 2013;Tang et al., 2014). We denoted E(i) = {e i1 , e i2 , . . . , e i24 } and E(j) = {e j1 , e j2 , . . . , e j24 } as the expression profiles of l i and l j . The expression similarity AL(l i , l j ) between lncRNAs l i and l j was calculated as follows: where AL(l i , l j ) denotes the absolute value of PCC between l i and l j , cov(E(i), E(j)) is the covariance between E(i) and E(j), σ E(i) and σ E(j) are standard deviations of E(i) and E(j), respectively. We denoted X(p i ) as the ith column vector of matrix I, in which the nonzero values occur at the indices where the corresponding protein has one interaction with the lncRNA l i . Similar to the aforementioned protein case, the Gaussian interaction profile kernel similarity for lncRNAs could be computed as follows: where

Integrated Similarity Matrix for Proteins and LncRNAs
Note that the Gaussian interaction profile kernel similarity is an association information-based measurement, which can be utilized to complement protein-protein interactions and lncRNA expression similarity. Motivated by the study of Chen (2015), we constructed the integrated protein similarity matrix Sim P and integrated the lncRNA similarity matrix Sim L as follows:

LPI-IBNRA
The flow chart of LPI-IBNRA is shown in Figure 1. At first, we denoted S P ∈ R np * nl as the resource score matrix based on protein similarity, S L ∈ R np * nl as the one based on lncRNA similarity. These two matrices were computed as follows: where S P (p i , l j ) represents the score between protein p i and lncRNA l j based on protein similarity, and S L (p i , l j ) represents the score between protein p i and lncRNA l j based on lncRNA similarity. Then the integrated resource score matrix was initialized as the weighted sum of S P and S L as follows: where parameter γ ∈ [0, 1] is a scalar controlling the relative contributions of protein similarity and lncRNA similarity in S ini . Following the general setting, we set the parameter γ = 0.5 in this paper, making S P and S L equally weighted. The final score matrix can be obtained by updating the S ini column by column. In other words, the calculation process can be partitioned into nl runs, each of which corresponds to a specific lncRNA. Thus, at the beginning of the kth run, the score for protein p i interacting with the given lncRNA l k can be initialized as follows: Then the 1st-round of our allocation model is to allocate the score of the lncRNA l k from the protein p i , which can be calculated as follows: where d(p i ) = nl x=1 S ini (p i , l x ) is obtained by a summing operation over all initial scores from lncRNAs interacting with protein p i .
The score of lncRNA l k can be obtained by summing scores over all proteins connected with l k : In the 2nd-round, resource scores were allocated in a similar way as the first round. The score allocated from the lncRNA l k to the protein p i was calculated as follows: where d(l k ) = np y=1 S ini (p y , l k ) is the sum of initial scores from all proteins interacting with lncRNA l k .
The score of protein p i was allocated from all lncRNAs that interacted with p i as follows: As described from Equation (13) to (17), we first initialized the score of protein p i from the given lncRNA l k and then updated it by a two-round resource allocation. An example is given in Figure 2. We defined S fin ∈ R np * nl as the final resource score matrix, which can be represented as follows: S fin (p i , l k ) = s 2 (p i ).
S can also be computed in a vectorized form as: where S fin is a column vector of S fin , S ini is a column vector of S ini , and W ∈ R np * np is the weight matrix. Then Equation (17) can also be represented as: where In the lncRNA-protein interaction network, the proteins interacting with the same lncRNA are considered to be directly correlated, i.e., having the low-order correlation, while higherorder correlations between these proteins might also arise from indirect associations. Such high-order correlations might have a negative effect on the lncRNA-protein interaction prediction. Based on the studies of Zhou et al. (2009) and Liu et al. (2010), we eliminated second-order correlations in an appropriate way to further enhance the accuracy of the prediction: where the parameter α ∈ (−1, 0). The final score matrix for inferring potential lncRNA-protein interactions can then be calculated as follows: After the calculations, we can recommend proteins to the given lncRNA l k in descending order by the kth column of S ′ fin .

Performance Evaluation
We evaluated the classification performance of the proposed LPI-IBNRA method by applying two types of classification schemes, i.e., LOOCV and 10-fold cross validation. The performance FIGURE 2 | The basic idea of LPI-IBNRA. First, two resource score matrices which are computed based on protein similarity and lncRNA similarity, respectively, are combined to construct the initial integrated resource score network. Secondly, each protein gains its initial score from a specific lncRNA. Next, in two-round resource allocation, the score is allocated from proteins to lncRNAs, and then propagated back to proteins. Finally, the weight matrix is optimized by second-order correlation eliminations to obtain the final scores of proteins.
of LPI-IBNRA was evaluated in terms of several widely-used indicators, including precision (PRE), sensitivity (SEN), accuracy (ACC), F1 score, and Matthews correlation coefficient (MCC), expressed as follows: where TP, TN, FP, and FN count the number of true positives, true negatives, false positives, and false negatives, respectively. As a popular method for performance evaluation, the receiver operating characteristic (ROC) curve was also utilized in our experiments. The area under the ROC curve (AUC) = 1 indicates perfect performance, while AUC = 0.5 indicates random performance. The precision-recall curve (PR curve) and the area under the PR curve (AUPR) are also used to reduce the negative influence of false positive data on the method performance. The larger the AUC and AUPR is, the better performance the evaluated method has.

Comparison With Other Methods
We used the aforementioned 4,796 known human lncRNAprotein interactions to carry out the above-mentioned two cross validation schemes. In each LOOCV trial, each known lncRNAprotein interaction was used as a test sample while the rest were used as training samples. To analyze the influence of parameter α on the performance of LPI-IBNRA, we applied LOOCV for the selection of parameter α. As shown in Figure 3A, the performance of LPI-IBNRA drops a lot when α is smaller than -0.70. When α is larger than -0.70, the performance of LPI-IBNRA decreases slightly. Thus, the parameter α is set to -0.70 due to the optimal performance.
Five previous approaches were used for comparison in the experiments, including collaborative filtering (CF), random walk with restart (RWR), LPBNI, LPIHN, and LPI-BNPRA. LPBNI, LPIHN, and LPI-BNPRA are network-based methods that infer potential lncRNA-protein interactions, while CF and RWR have been used as benchmark methods in Ge et al. (2016) and Wen et al. (2017). RWR is often utilized as a powerful tool for networkbased methods to forecast association (Zhao et al., 2018a,c;Zhu et al., 2018), while CF is a well-known recommender algorithm which can infer the information from similar neighborhoods (Fu et al., 2014;Zeng et al., 2017). In our experiments, RWR was implemented to make predictions based on the protein-protein similarity network, while a simple version of the CF algorithm was adopted to calculate the prediction scores between lncRNAs and proteins.
Here, we reproduced these methods on the same dataset by ourselves. See Figures 3B,C and Table 1 for the results of LOOCV. We can see from Figure 3B that our proposed method achieved an AUC of 0.8932, which exhibited a considerable improvement over the five previous methods (i.e., 12.81% for CF, 10.71% for RWR, 1.56% for LPBNI, 2.00% for LPIHN and 3.39% for LPI-BNPRA). In addition, the comparison of these methods, in terms of precision vs. recall, is presented in Figure 3C. It can be seen that LPI-IBNRA almost achieved a higher precision than the other methods at every recall value. Moreover, LPI-IBNRA outperformed the other methods in terms of AUPR, PRE, SEN, ACC, F1 score and MCC, which is presented in Table 1. As shown in Figure 3D, in 10-fold cross validation, LPI-IBNRA achieved an AUC of 0.8819 ± 0.0052 and was superior to the comparison methods, including CF (0.7655 ± 0.0069), RWR (0.7800 ± 0.0076), LPBNI (0.8695 ± 0.0047), LPIHN (0.8591 ± 0.0044), and LPI-BNPRA (0.8413 ± 0.0351).
The aforementioned results indicate that in both LOOCV and 10-fold cross evaluation, LPI-IBNRA outperforms other methods in terms of the AUC values. The outstanding performance of LPI-IBNRA demonstrates its stable and satisfying abilities in inferring potential lncRNA-protein interactions. The superior performance of the proposed method could be attributed to second-order correlation elimination, which is more suitable for our task and can lead to better prediction performance.

Case Studies
In addition, four case studies have been carried out to further evaluate the effectiveness of LPI-IBNRA. The interactions in our benchmark dataset were obtained in NPInter v2.0 which was established in 2013. NPInter was then upgraded to NPInter v3.0 in 2016 (Hao et al., 2016), which includes newly discovered lncRNA-protein interactions. Thus, we predicted novel lncRNA-protein interactions based on known interactions in the benchmark dataset, then confirmed our predictions in NPInter v3.0. For each lncRNA, the proteins ranked within the top 5 were considered as potential proteins that interact with the given lncRNA. Case studies were carried out on four lncRNAs, including lncRNA DLEU2, CRHR1-1T1, LRRC75A-AS1 and SNHG5. Table 2 shows the prediction results and whether there were confirmations for these lncRNAs. It indicates that five (DLEU2), five (CRHR1-1T1), five (LRRC75A-AS1), and four (SNHG5) out of the top five predicted lncRNA-interacted proteins, were confirmed by NPInter v3.0. The rankings of these lncRNAprotein interactions in other benchmark method predictions are also listed in Table 2. It can be observed that several novel interactions did not have high rankings in the predictions of other methods, and these interactions are likely to be ignored by these methods. Therefore, LPI-IBNRA has great potential to predict new lncRNA-protein interactions.

DISCUSSION AND CONCLUSION
In this article, we proposed a novel method LPI-IBNRA for predicting lncRNA-protein interactions, based on the known lncRNA-protein interactions, lncRNA expression similarity and protein-protein interactions. We integrated the known interactions and similarity as the initial resource scores for a two-round resource allocation of a bipartite network recommendation. Furthermore, we optimized the weight matrix by eliminating second-order correlations appropriately, to obtain the final result of lncRNA-protein interaction prediction. We finally acquired gratifying and reliable prediction performance in LOOCV, 10-fold cross evaluation and case studies. Thus, we believe that LPI-IBNRA can make reliable predictions and might guide future experimental studies on lncRNA-protein interactions.
LPI-IBNRA has the following improvements over several previous methods in predicting lncRNA-protein interactions. First, with the employment of the bipartite network recommender algorithm, we utilized the known lncRNAprotein interactions to construct a bipartite network between lncRNAs and proteins, and then allocated the resource scores via interaction edges between lncRNA nodes and protein nodes. Therefore, the negative sample set is not required in our methods. Second, we assigned weights to each edge on the bipartite network, which is distinct from most former bipartite network methods. Thus, the resource scores would not be evenly distributed during the resource allocation process. Finally, we eliminated second-order correlations on the bipartite network appropriately, to enhance prediction accuracy.
Although impressive results have been achieved, there is still much room for improvement in our method. At first, though known lncRNA-protein interactions have been more than before, it is still very difficult for the proposed method to obtain adequate results based on the prediction. Moreover, as the resource allocation of the bipartite network recommendation algorithm is based on known lncRNA-protein interactions, LPI-IBNRA is not suitable to predict interactions of lncRNAs without any known interacted protein.