LPI-SKF: Predicting lncRNA-Protein Interactions Using Similarity Kernel Fusions

Long non-coding RNAs (lncRNAs) play an important role in serval biological activities, including transcription, splicing, translation, and some other cellular regulation processes. lncRNAs perform their biological functions by interacting with various proteins. The studies on lncRNA-protein interactions are of great value to the understanding of lncRNA functional mechanisms. In this paper, we proposed a novel model to predict potential lncRNA-protein interactions using the SKF (similarity kernel fusion) and LapRLS (Laplacian regularized least squares) algorithms. We named this method the LPI-SKF. Various similarities of both lncRNAs and proteins were integrated into the LPI-SKF. LPI-SKF can be applied in predicting potential interactions involving novel proteins or lncRNAs. We obtained an AUROC (area under receiver operating curve) of 0.909 in a 5-fold cross-validation, which outperforms other state-of-the-art methods. A total of 19 out of the top 20 ranked interaction predictions were verified by existing data, which implied that the LPI-SKF had great potential in discovering unknown lncRNA-protein interactions accurately. All data and codes of this work can be downloaded from a GitHub repository (https://github.com/zyk2118216069/LPI-SKF).

Studying lncRNA-protein interactions is of great value in understanding the functional mechanism of lncRNAs. However, wet experiments to determining lncRNA-protein interactions are always costly and time-consuming. Therefore, it is crucial to develop efficient and accurate computational methods to predict potential lncRNA-protein interactions.
Recently, a number of computational methods have been developed to predict novel lncRNA-protein interactions. Generally, these methods fall into two categories, the supervised binary classification-based methods and semi-supervised learning-based methods. The most significant difference between these two categories is whether the non-interacting lncRNA-protein pairs are regarded as negative samples or unlabeled samples.
In the binary classification methods, the non-interacting lncRNA-protein pairs are regarded as negative instances. Muppirala et al. encoded RNA-protein pairs using sequence information and trained the model RPISeq, using SVM (support vector machine) and RF (random forest) classifiers (Muppirala et al., 2011). By encoding RNA-protein pairs in different ways, two more models were built by SVM or RF classifiers in the following years (Suresh et al., 2015;Xiao et al., 2017). Wang et al. applied a novel extended naive-Bayes classifier on sequence-based features to predict potential protein-RNA interactions (Wang et al., 2013). Ensemble learning was widely applied in combining various machine learning algorithms in predicting lncRNA-protein interactions (Deng et al., 2018;Hu et al., 2018;Wekesa et al., 2019). Despite all these efforts, selecting veracious negative instances is still the most challenging problem in training binary classification-based models. Moreover, the dataset in predicting lncRNA-protein interactions is always highly imbalanced in nature, which could influence the prediction performances in many ways.
In the semi-supervised learning methods, non-interacting lncRNA-protein pairs were considered as unlabeled instances. Lu et al. introduced the matrix multiplication method to score each potential protein-RNA pair (Lu et al., 2013). Li et al. (2015) utilized the RWR (random walk with restart) algorithm on the lncRNA-protein-protein heterogeneous network to predict lncRNA-protein interactions. Serval prediction models were established by the MF (matrix factorization) algorithm, which separates the adjacency matrix into two talent feature vectors (Liu et al., 2017;Ma et al., 2019;Zhang T. et al., 2020). Zhao et al. integrated the RWR and MF algorithm to construct a prediction model . A label propagation algorithm is another common recommendation algorithm, two models were built based on label propagation algorithms (Zhang et al., 2018a;Zhu et al., 2019). Meanwhile, some other machine learning algorithms were also adapted in the prediction of lncRNAprotein interactions, including feature projection ensemble learning (Zhang et al., 2018b), KATZ scoring schemes (Zhang et al., 2019), the kernel ridge regression algorithm (Shen et al., 2019), and the depth-first search algorithm (Zhang H. et al., 2020).
Although existing computational models have achieved great performances, there are still some problems that should be solved. With the development of high-throughput sequencing technology, a large number of novel lncRNAs have been discovered. Unlike lncRNAs, that were deposited in the database long ago, little is known about the interacting proteins of these newly identified lncRNAs. Therefore, few existing models can infer potential interacting proteins for these lncRNAs (Zhang et al., 2018b;Zhang T. et al., 2020).
In this paper, we proposed a new model to predict lncRNA-protein interactions based on the similarity kernel fusion approach, namely LPI-SKF. Multiple similarities between lncRNAs and proteins were first calculated. These similarities were integrated to obtain a comprehensive similarity. Ultimately, the Laplacian regularized least squares framework was applied to build the predictive model. Five-fold cross-validation was used to estimate the performance of LPI-SKF in this work. The LPI-SKF achieved an AUROC (area under receiver operating characteristics curve) of 0.909 and an AUPR (area under precision-recall curve) of 0.685, which indicated that the LPI-SKF method could identify unknown lncRNA-protein interactions accurately. Moreover, LPI-SKF could also be used to identify interacting partners for novel lncRNA/proteins. A total of 19 out of our 20 top-ranked lncRNA-protein interaction predictions were confirmed by existing data.

MATERIALS AND METHODS
In this work, we proposed an lncRNA-protein interaction prediction model, named LPI-SKF. This model can be summarized in four steps, which are shown in Figure 1. Firstly, we collected experimentally verified lncRNA-protein interactions in the NPInter V2.0 database and constructed the heterogeneous network. Secondly, based on the assumption that similar lncRNAs tend to interact with similar proteins and vice versa, we calculated three different pairwise similarities for lncRNAs, and three different pairwise similarities for proteins, respectively. Thirdly, to synthesize the similarity information in different aspects and to also reduce noise, the SKF approach was utilized to integrate the lncRNA similarities and protein similarities. Finally, considering the network structure information, we combined the Laplacian regularization and the least squares method to build our prediction model.

Dataset Curations
NPInter is an integrated database of ncRNA interactions, which includes vast interactions between ncRNAs and biomolecules uncovered by various high-throughput sequencing approaches (Yuan et al., 2014). lncRNA-protein interactions collected in NPInter have been utilized as materials in numerous related studies. For a better comparison, we collected lncRNA-protein interactions from the NPInter V2.0 database according to the previous study (Zhang et al., 2018a). Ultimately, 4158 lncRNAprotein interactions including 990 lncRNAs and 27 proteins were obtained. Afterward, the sequences and expressions of lncRNAs and the sequences of proteins Were downloaded from the NONCODE database and the SUPERFAMILY database, separately (Fang et al., 2018;Pandurangan et al., 2019). FIGURE 1 | The flowchart of the entire work. Known lncRNA-protein interactions were downloaded from the NPInter V2.0 database to form a heterogeneous network. Three different similarities of both lncRNAs and proteins were calculated, subsequently. Afterward, the similarity kernel fusion (SKF) approach was utilized to integrate these similarities. Finally, the Laplacian regularized least squares (LapRLS) framework was used to build the prediction model.

Similarities for lncRNAs and Proteins
This work is based on the assumption that similar lncRNAs tend to interact with similar proteins and vice versa. Hence, defining appropriate similarity is of great importance in predicting lncRNA-protein interactions. We employed three different pairwise similarities of lncRNAs, including the interaction similarity, the expression similarity, and the sequence similarity. We also applied three different similarities of proteins, including the interaction similarity, the statistical feature similarity, and the sequence similarity. With all these similarity definitions, we proposed to use the similarity kernel fusion strategy to establish a universal and comprehensive similarity kernel matrix to predict potential lncRNA-protein interactions.

The Interaction Profile Similarities
For the convenience of the reader, we first defined the adjacency matrix between lncRNAs and proteins. Let l i (i = 1, 2, . . . , n) be the i-th lncRNA, and p j (j = 1, 2, . . . , m) the j-th protein. The adjacency matrix A can be defined as follows: The interaction profile of the i-th lncRNA is the i-th row of matrix A, which can be noted as the A i * , while the interaction profile of the j-th protein is the j-th column of matrix A, which can be noted as the A j . The interaction similarity between l u and l v can be defined as: where and ||.|| is the 2-norm operator. Similarly, the interaction similarity between p u and p v can be defined as: where.
lncRNA Expression Profile Similarity The expression profiles of lncRNAs in 24 different tissues can be downloaded from the NONCODE database. The expression profile of the i-th lncRNA can be noted as e i . The expression profile similarity is defined as follows: where ρ u,v is the Pearson's correlation coefficient between e u and e v . It can be calculated as follows: where cov() is the covariance, and σ is the standard deviation operator.

Protein Pairwise Sequence Alignment Similarity
Blast+ is a local alignment search tool, which was utilized to calculate the alignment score of proteins in this work (Camacho et al., 2009). We used blast+ to align p u against p v . The bit score in this alignment can be noted as b u,v . The pairwise sequence alignment similarity can be defined as: It worth noting that s p,1 is not symmetric. Therefore, we have s p,1 (u, v) = s p,1 (v, u).

Sequence Statistical Feature Similarity
RNA is composed of four types of ribonucleotide (A, G, C, U). According to the previous work, we calculated the percentage of these four nucleotides and 16 dinucleotides (AA, AG, AC, AU, . . . ) to represent each lncRNA in a 20-D vector (Zhang et al., 2018a). We employed CTD (composition-transitiondistribution) features (Li et al., 2006) in this work. Twenty different amino acids were divided into three groups, according to their hydrophobicity, normalized van der Waals volume, polarity, and polarizability. Each protein was represented as a 504-D vector. Linear neighborhood similarity (LNS), which is based on the hypothesis that each vector can be represented by their k-nearest neighbors, was adopted to compute the similarity between statistical features (Wang and Zhang, 2008;Deng et al., 2020) for lncRNA and proteins, respectively. The sequence statistical feature similarity between l u and l v can be noted as s l,2 (u, v), while the similarity between p u and p v can be noted as s p,2 (u, v).

Similarity Kernel Fusion
Three different lncRNA similarities (s lq q = 0, 1, 2) and three different protein similarities (s p,q q = 0, 1, 2) were calculated in the above sections. Furthermore, the similarity kernel fusion (SKF) algorithm was utilized to integrate these similarities and obtain a more comprehensive similarity. We take the similarities of lncRNA as an example. Firstly, we can normalize the three lncRNA similarities (s l,q q = 0, 1, 2) as follows: where θ l,q is the normalized similarity corresponding to s l,q . The matrix composed by the normalized similarity is noted as: Secondly, we created a neighbor-constrained normalization for each lncRNA similarity. Given l u and s l,q , we collected the k most similar lncRNA as a set N l,q (u, k). The neighborhood constrained normalization of the s l,q can be defined as follows: where I l,q,k (u, v) = 1 l v ∈ N l,q u, k 0 l v / ∈ N l,q u, k The matrix composed by the neighborhood constrained normalization is noted as: Frontiers in Genetics | www.frontiersin.org The three similarity matrices were integrated using the following iterative process: where α is a weight coefficient between 0 and 1, T is the transpose operator in matrix algebra, λ is the iterative round parameter, and l,r (0) = l,r .
After z rounds of the iterative process, we obtained the final integration similarity matrix as Although more information is retained in the similarity fusion, more noise is apparent simultaneously. By considering the k most similar lncRNAs of each lncRNA, we defined an indicator function as follows: The final adjusted lncRNA similarity is defined as follows: where θ l (u, v) is the element in the u-th row and the v-th column of the matrix Θ l . By applying protein similarities, and using Eqs. (9)-(18), we obtained the adjusted protein similarity matrix S p,k . The value of k in computing protein similarities is not necessarily the same as that of the lncRNAs.

Laplacian Regularized Least Squares
In this work, Laplacian regularized least squares (LapRLS) were utilized to construct the prediction model. Since we obtained the lncRNA similarity matrix and the protein similarity matrix, we could estimate the lncRNA-protein interactions from either the lncRNA similarity matrix or the protein similarity matrix. Without losing generality, we took the lncRNA similarity matrix as an example.
Let L l be the Laplacian normalized similarity matrix, which can be defined as follows: where D is the diagonal matrix of the matrix S l,k . We then found the estimation of the adjacency matrix by minimizing the following objective function: where A is the adjacency matrix, F l is the prediction matrix from lncRNA similarities, β l is a weighting parameter, and ||.|| F is the F-norm operator. We obtained the prediction matrix from lncRNA similarities by calculating the derivative of the objective function as follows: Similarly, we applied Eqs. (19)-(21) on protein similarities to obtain the prediction matrix from protein similarities, as follows: Finally, we integrated the above two prediction matrixes to obtain our final prediction matrix, as follows: where δ ε (0, 1) is a weighting coefficient.

Performance Estimation Protocol
The prediction performances of the LPI-SKF method was estimated using 5-fold cross-validations. We applied the AUROC and the AUPR as the main performance indicators. We also applied three performance statistics, including precision (pre), recall (rec), and the F1-score (f ), which can be calculated as follows: where TP, TN, FP, and FN represent the number of true positives, true negatives, false positives, and false negatives, respectively. For predicting potential lncRNA-protein interactions, all interactions in the adjacency matrix were divided randomly into five parts. Four parts were utilized as the training dataset, while the remaining part was used as the testing dataset. Through five rounds of cross-validation, we obtained the interacting score of every interaction.
As for predicting potential proteins for new lncRNAs, all lncRNAs were split into five groups. Four groups were treated as the training set and the remaining one as the testing set, which was the same as the prediction for new proteins.

Parameter Calibrations
The primary parts in LPI-SKF are SKF and LapRLS. There are three parameters in the SKF, which are the iteration times z, the number of neighbors k, and the weighting coefficient α. Since SKF was adopted to integrate the lncRNA similarities and the protein similarities separately, we calculated the AUC from lncRNA similarities and protein similarities, respectively to find the optimal α. Since the value range of α is between 0 and 1, we took α within a range of 0.1-0.9 with the step of 0.1 for calculation convenience. The prediction performances  were estimated from lncRNAs and proteins separately. As in Table 1, the optimal α for lncRNAs was 0.9, while it was 0.8 for proteins. Considering the number of lncRNAs and proteins in our work (990 lncRNAs and 27 proteins), the number of neighbors k for lncRNA was selected from {33, 99, 150, 300, 600, 900}, and the number of neighbors k for proteins from {3, 6, 9, 15, 20, 25}. To reduce calculating time and to test as much as possible, the iteration times z was taken from 5 to 30 with a step of 5. As in Figure 2, the optimal number of neighbors k for lncRNA was 99, and 3 for proteins. The optimal iteration times z was set to 5 for lncRNAs and proteins.
The weighting parameter β l and β p are the most important regularization terms in the LapRLS, which can influence the performance directly. In this work, we made β l equal to β p for convenience. To obtain the optimal performance, we searched β l and β p both from 2 −10 to 2 −1 according to a previous work (Jiang et al., 2018). Since the amount of lncRNAs is much more than proteins, we made δ range from 0.1 to 0.9 with a step of 0.1. As in Figure 3, we chose β l = β p = 2 −3 , and δ = 0.8.

Comparison With Single Similarity
Different types of similarities between both lncRNAs and proteins have been utilized in this work. To demonstrate the benefit of similarity integration, we tested the prediction performance of every single similarity. The results are illustrated in Figure 4. Considering the different numbers of lncRNAs and proteins, performance using lncRNA similarities was better than protein similarities.

Comparison With Other Fusion Methods
Similarity kernel fusion (SKF) was applied in our study to integrate different similarities, which could integrate similarity information in different aspects and reduce noise. In this part, we compared SKF with another two similarity fusion methods, similarity network fusion (SNF)  and average kernel fusion (AVG). The results are shown in Figure 5. The results indicated that SKF outperformed the other two methods.

Prediction for Uncovered Interactions
In our study, we compared LPI-SKF with two popular algorithms, RWR (random walk with restart) and CF (collaborative filtering), and three other methods, including LPIHN (Li et al., 2015), LPBNI (Ge et al., 2016), and LPI-IBNRA (Xie et al., 2019). We built six prediction models based on the same benchmarking dataset. Subsequently, the 5-fold cross-validation (5-fold CV) was applied for the comparison. The result is shown in Figure 6. Meanwhile, we selected the threshold value of six models based on the optimal F1-score. Furthermore, the recall, precision, and F1-score under the threshold value were computed to compare these models in other aspects. For a better comparison, the results of the six models are collected in Table 2. From the table, we can see that both the AUC and AUPR of LPI-SKF were higher than the other models. Specifically, for the AUC, LPI-SKF received an AUC of 0.909, which increased by 10. 05, 8.73, 6.69, 8.47, and 4.72%, respectively, compared with RWR's 0.826, CF's 0.836, LPBNI's 0.852, LPIHN's 0.866, and LPI-IBNRA's 0.864. As for another  important index: AUPR, LPI-SKF obtained an AUPR of 0.685, which was higher than all other models, RWR's 0.581, CF's 0.542, LPBNI's 0.625, LPIHN's 0.548, and LPI-IBNRA's 0.684. Meanwhile, the best F1-score of LPI-SKF was also higher than the other models. All these evaluation indexes demonstrate that LPI-SKF outperformed the other state-of-the-art methods.

Prediction for Novel lncRNAs/Proteins
While our model can predict potential interacting lncRNAs/proteins for novel proteins/lncRNAs, we also made a comparison for the prediction of new lncRNAs/proteins. As few methods could predict interacting lncRNAs/proteins for novel proteins/lncRNAs, SFPEL-LPI (Zhang et al., 2018b) was selected for the comparison. Subsequently, we evaluated the performance of the two models in new lncRNAs and new proteins prediction, respectively. The result is shown in Figure 7. For a better comparison, the AUC, AUPR, recall, precision, and F1-score of the two models are shown in Table 3. LPI-SKF obtained an AUC of 0.844 and 0.835 in the prediction of new lncRNAs and proteins, respectively. Comparing with SFPEL, LPI-SKF achieved an AUC improvement of 0.016 and 0.229 in new lncRNAs and proteins prediction, separately.

Case Studies
To evaluate the prediction effect of LPI-SKF more accurately, we tested the 20 top-ranked interactions in our model based on the NPInter V2.0 database. The result is shown in Table 4. Nineteen of these interactions have been verified in the NPInter V2.0 database, which demonstrates that LPI-SKF performed reputably in actual interaction prediction. Meanwhile, the amount of correctly predicted interactions of the 50 top-ranked interactions, the 100 top-ranked interactions, and the 500 topranked interactions are 47, 92, and 458, respectively.

CONCLUSION
This paper proposed a novel model, named LPI-SKF (lncRNAprotein interactions prediction based on the similarity kernel fusion), to predict potential lncRNA-protein interactions. Serval similarities of both lncRNAs and proteins were integrated to obtain a comprehensive similarity matrix by the SKF method. Furthermore, the LapRLS framework was applied to build the prediction model. Finally, LPI-SKF obtained an AUC of 0.909 and an AUPR of 0.685 in the 5-fold CV framework, which demonstrated that LPI-SKF can infer uncovered lncRNA-protein interactions accurately.
To evaluate the performance of LPI-SKF, serval state-ofthe-art methods were compared to LPI-SKF on the same benchmarking dataset. Finally, LPI-SKF received an AUC of 0.909 and an AUPR of 0.685 in the 5-fold crossvalidation framework, both higher than the other models. More importantly, LPI-SKF could also predict potential interacting proteins/lncRNAs for novel lncRNAs/proteins precisely. For a better comparison, we also compared LPI-SKF with another model, SFPEL, on the same database and the same random seed.
The result showed that LPI-SKF performed much better both in the prediction for new lncRNAs and new proteins.

DATA AVAILABILITY STATEMENT
Publicly available datasets were analyzed in this study. This data can be found at: https://github.com/zyk2118216069/LPI-SKF.

AUTHOR CONTRIBUTIONS
Y-KZ curated the dataset, designed, and implemented the algorithm, performed the experiments, and collected the results. JH, Z-AS, and W-YZ helped in collecting the data, and calibrating the parameters of the algorithm. P-FD directed the whole study, conceptualized the algorithm, analyzed the results, and wrote the manuscript. All authors contributed to the article and approved the submitted version.