Predicting lncRNA–Protein Interactions With miRNAs as Mediators in a Heterogeneous Network Model

Long non-coding RNAs (lncRNAs) play important roles in various biological processes, where lncRNA–protein interactions are usually involved. Therefore, identifying lncRNA–protein interactions is of great significance to understand the molecular functions of lncRNAs. Since the experiments to identify lncRNA–protein interactions are always costly and time consuming, computational methods are developed as alternative approaches. However, existing lncRNA–protein interaction predictors usually require prior knowledge of lncRNA–protein interactions with experimental evidences. Their performances are limited due to the number of known lncRNA–protein interactions. In this paper, we explored a novel way to predict lncRNA–protein interactions without direct prior knowledge. MiRNAs were picked up as mediators to estimate potential interactions between lncRNAs and proteins. By validating our results based on known lncRNA–protein interactions, our method achieved an AUROC (Area Under Receiver Operating Curve) of 0.821, which is comparable to the state-of-the-art methods. Moreover, our method achieved an improved AUROC of 0.852 by further expanding the training dataset. We believe that our method can be a useful supplement to the existing methods, as it provides an alternative way to estimate lncRNA–protein interactions in a heterogeneous network without direct prior knowledge. All data and codes of this work can be downloaded from GitHub (https://github.com/zyk2118216069/LncRNA-protein-interactions-prediction).

Long non-coding RNAs are ncRNAs with a length larger than 200 nt (Kapranov et al., 2007). Experiments show that lncRNAprotein interactions play important roles in many biological processes, such as splicing, polyadenylation, and translation (Singh, 2002;Lukong et al., 2008;Kishore et al., 2010;Licatalosi and Darnell, 2010). Therefore, studying interactions between lncRNAs and proteins makes great sense for us to understand a wide variety of biological processes.
Although we can now obtain RPIs (RNA-protein interactions) through large-scale experiments such as RNAcompete (Ray et al., 2009), RIP-Chip (Keene et al., 2006), HITS-CLIP (Licatalosi et al., 2008), and PAR-CLIP (Hafner et al., 2010), all these experiments are costly and time-consuming. Therefore, computational predictions have been recognized as an efficient alternative approach. Muppirala et al. (2011) proposed the RPISeq method for predicting RNA-protein interactions using only sequence information. Wang et al. (2013) extracted sequence-based features to represent each protein-RNA pair and used naive-Bayes classifier to predicting protein-RNA interactions. Lu et al. (2013) introduced a new method named lncPro, which scored each RNA-protein pair by encoding RNA and protein sequences into numerical vectors. Suresh et al. (2015) presented an SVM-based method, named RPI-Pred, to predict protein-RNA interaction pairs based on their sequences and structures. Li et al. (2015) developed a heterogeneous network model (LPIHN) and a random walk with restart algorithm to predict novel lncRNA-protein interactions. Ge et al. (2016) constructed the lncRNA-protein bipartite network, and scored candidate proteins for each lncRNA based on the bipartite network projection algorithm. Yang et al. (2016) constructed another lncRNA-protein bipartite network, where the HeteSim algorithm was employed to evaluate the relevance between lncRNAs and proteins. Zheng et al. (2017) applied the HeteSim algorithm on the fusion of multiple proteinprotein similarity networks to predict lncRNA-protein interactions. Hu et al. (2017) presented transformation-based semi-supervised link prediction (LPI-ETSLP) to predict lncRNA-protein interactions. Xiao et al. (2017) proposed a computational method named PLPIHS for predicting lncRNAprotein interactions using HeteSim Scores. Hu et al. (2018) presented a model named HLPI-Ensemble integrated three mainstream machine learning algorithms for predicting human lncRNA-protein interaction. Zhang et al. (2018b) combined multiple similarities and features with a feature projection ensemble learning frame to predict lncRNA-protein interactions. Zhang et al. (2018a) proposed a linear neighborhood propagation method (LPLNP) to calculate the linear neighborhood similarity of lncRNA-protein interactions. Zhang et al. (2019c) proposed the KATZLGO method to predict lncRNA-protein interactions based on the KATZ measure, which utilize the information of all paths between pair of nodes.
All existing methods rely on known lncRNA-protein interactions to construct the predictor. However, the number of experimentally verified lncRNA-protein interactions is limited, which affects the prediction performances of all existing methods. To expand the spectrum of predictable lncRNA-protein interactions, we took miRNAs as intermediates in predicting lncRNA-protein interactions.
MiRNAs are short RNA molecules with a length of 19 to 25 nucleotides (Lu and Rothenberg, 2018). Some miRNAs can regulate both lncRNAs and proteins. For example, PTEN (Phosphatase and TENsin homolog) is a kind of tumor suppressor gene, which is critical for maintaining cellular homeostasis (Poliseno et al., 2010). The miR-21 regulates the translation process of PTEN , as well as the expression of PTENpg1, which is transcribed from PTEN pseudogene as an lncRNA . Meanwhile, the PTENpg1 alpha isoform affects the transcription process of PTEN by competing transcription factors (Johnsson et al., 2013). We assumed that this triangular regulation network can be common in the gene regulation system. To validate this assumption, we collected lncRNA-miRNA interactions and protein-miRNA interactions from the RAID v2.0 database. We found that the lncRNA-protein interactions are significantly enriched in the set of lncRNAs and proteins that are sharing a common set of interacting miRNAs (chi-square test, p-value < 10 -16 ).
In the light of this observation, miRNAs were taken as mediators to predict novel lncRNA-protein interactions in this work. Both lncRNA-miRNA interactions and miRNA-protein interactions were considered as the basis to predict lncRNAprotein interactions. In the cause of improving our prediction performance, the similarity of lncRNAs and proteins was calculated in various aspects, which is based on the assumption that similar lncRNAs or proteins tend to have similar interactions. Our methods provide a way to explore novel lncRNA-protein interactions without prior knowledge of direct lncRNA-protein interactions. Since existing methods always require direct lncRNA-protein interactions as training data, our method may provide a useful supplement to the state-ofthe-arts methods.

Dataset Curation
Biomolecule interactions have become a hot research topic in computational biology. RAID v2.0 is a large database for biomolecule interaction information, which contains more than 5.27 million RNA-associated interactions, including over 4 million RNA-RNA interactions and 1.2 million RNA-protein interactions, involving nearly 130,000 genes across 60 species (Yi et al., 2017). We downloaded the protein-miRNA interactions and lncRNA-miRNA interactions as our training dataset from this database. LncRNA-protein interactions were also obtained as our independent testing dataset simultaneously.
We downloaded 2,862 lncRNA-miRNA interactions and 2,521 protein-miRNAs interactions, which are all experimentally verified, from the RAID v2.0 database (Yi et al., 2017). In order to ensure that each lncRNA has a protein linked to it via a miRNA, and vice versa, common miRNAs were extracted from these interactions. Altogether 360 miRNAs were included in our dataset.
Subsequently, the lncRNA-miRNA interactions and the protein-miRNA interactions were selected according to the common interacting miRNAs. We kept 1,356 lncRNA-miRNA interactions and 1,156 protein-miRNA interactions in our dataset. These interactions are among 331 lncRNAs, 360 miRNAs, and 103 proteins. The sequences of lncRNAs and proteins were obtained from NCBI Gene database (Brown et al., 2015) and the Uniprot database (The UniProt Consortium, 2017), respectively. For those lncRNAs, which cannot be found in the NCBI Gene database, the sequence was retrieved from the Ensemble database (Hunt et al., 2018).
In order to evaluate the performance of our predictive model, we obtained experimentally verified lncRNA-protein direct interactions from the RAID v2.0 database according to the lncRNAs and proteins in our dataset. Subsequently 1,925 lncRNA-protein interactions were chosen as our independent testing dataset, which are formed by 268 lncRNAs and 58 proteins. The interactions from the RAID database are listed in the Supplementary Materials (Table S1, Table S2 and Table S3).

Similarity Measures
Previous studies (Gong et al., 2019;Zhang et al., 2019a;Zhang et al., 2019b) have demonstrated the usefulness of similarities for network models. For convenience, let L be the set of lncRNAs, M the set of miRNAs, and P the set of proteins, e.g. L = {l 1 , l 2 , …, l x }, M = {m 1 , m 2 , …, m y } and P = {p 1 , p 2 ,…, p z }, where x denotes the number of different lncRNAs, y the number of common miRNAs, and z the number of different proteins.
The lncRNA-miRNA interaction network can be represented using a bipartite graph G 1 , as follows: where E 1 is the set of edges in this bipartite graph, and L and M as defined above. Each edge in E 1 represents an interaction between one lncRNA and one miRNA. A part of the lncRNA-miRNA interaction network is illustrated as Figure 1.
Similarly, we used another bipartite graph G 2 to represent the protein-miRNA interaction network, as follows: where E 2 is the edge set of the protein-miRNA interaction network, and P and M as defined above. Each protein-miRNA interaction corresponds to an edge in E 2 . A part of the protein-miRNA interaction network is illustrated as Figure 2.
With the definition of two bipartite graphs, similarities between lncRNAs or proteins were both calculated in three different ways, which are elaborated in the following sections, respectively.

Network Similarity
For a given miRNA, m k ∈M (k = 1, 2,...., y), we define the set of its interacting lncRNAs as L(m k ), which is a subset of L: We also define P(m k ), which is a subset of P, as follows: FIGURE1 | A part of the lncRNA-miRNA interaction network. Ten lncRNAs and 10 miRNAs formed this part of the interaction network. The network is a bipartite graph. One lncRNA can interact with multiple miRNA and vice versa.
FIGURE 2 | A part of the protein-miRNA interaction network. Ten proteins and 10 miRNAs formed this part of the interaction network. The network is a bipartite graph. One protein can interact with multiple miRNA and vice versa.
For miRNAs in M, of which the network contribution in the lncRNA-miRNA interaction network or the protein-miRNA interaction network can be calculated respectively as follows: where c 1 (m k ) is the network contribution of miRNA m k in the lncRNA-miRNA interaction network, c 2 (m k ) the network contribution of miRNA m k in the protein-miRNA interaction network, and |.| the cardinal operator on a set. For convenience, M(l i ) and M(p j ), which are both subsets of M, are defined as follows: where M(l i ) and M(p j ) represent the set of miRNAs that interact with a given lncRNA or a given protein.
With all definition above, the network similarity between two lncRNAs l u and l v (u, v = 1, 2,…, x) can be defined as follows: where n 1 (l u , l v ) is the network similarity between l u and l v . Similarly, given two proteins, p u and p v , the network similarity between p u and p v can be defined as follows: where n 2 (p u , p v ) is the network similarity between p u and p v .

Sequence Similarity
The sequence similarity was calculated by the Smith-Waterman algorithm. Given two lncRNAs, the sequence similarity between two lncRNA sequences is defined as follows: where e 1 (l u , l v ) is the sequence similarity, w(l u , l v ) the Smith-Waterman score between l u and l v , and |l u | and |l v | the length of the lncRNA l u and l v , respectively. Given two proteins, the sequence similarity between two protein sequences is defined similarly as follows: where e 2 (p u , p v ) is the sequence similarity, and w(p u , p v ), |p u |, and |p v | the length of the protein p u and p v , respectively.

Statistical Feature Similarity
Pseudo-amino acid composition (PseAAC), which was proposed by Chou in 2001(Chou, 2001, has been widely applied in all branches of computational and functional proteomics (Chou, 2011;Chou, 2015). Pseudo-k nucleotides composition (PseKNC), which is a major advancement of the PseAAC concept in analyzing nucleotide sequences, has been introduced recently (Chen et al., 2014). Because of its simplicity and effectiveness, the PseKNC methods quickly penetrate into all major topics in functional genomics, in both genome and transcriptome levels (Chen et al., 2015a;Chen et al., 2015b). The computational procedures for PseAAC and PseKNC have been elaborated in many literatures (Chou, 2011;Chen et al., 2013;Qiu et al., 2017) and some recent reviews (Chen et al., 2015a;Zhao et al., 2018).
In this work, we employed pseudo di-nucleotide composition (PseDNC), which is a special form of PseKNC when k = 2, to represent lncRNA sequences, and PseAAC for protein sequencesFor simplicity, we do not describe the computational details of the PseDNC and PseAAC algorithms here. We only describe how we apply PseDNC and PseAAC in our work.
Given a lncRNA, its PseDNC representation can be described as a numerical vector with 16+l dimensions as follows: where V 1 (l i | l, w 1 , H) is the PseDNC representation of l i , l and w 1 two parameters in computing the PseDNC representation, and H a set of di-nucleotide physicochemical properties that are applied in computing the PseDNC representations. The similarity between two lncRNAs can be defined as follows: where f 1 (l u , l v ) is the feature similarity between two lncRNAs, and ||.|| the operator that takes the length of a vector. Similarly, given a protein, its PseAAC representation can be described as a numerical vector with 20 + t dimensions as follows: where V 2 (p j | t, w 2 , H) is the PseAAC representation of l j , t and w 2 two parameters in computing the PseAAC representation, and H a set of amino acid physicochemical properties that are used in computing the PseAAC representations. The similarity between two proteins can be defined as follows: where f 2 (p u , p v ) is the feature similarity between two proteins. We utilized online webserver Pse-In-One  to generate PseDNC and PseAAC in our work.

Heterogeneous Network Model
By integrating the bipartite graph G 1 and G 2 , we can construct a heterogeneous network model, where lncRNAs, miRNAs, and proteins are connected together. A part of this network is illustrated as Figure 3. Given a lncRNA l i and a protein p j , a whole network correlation that is brought by the m k can be defined as follows: The whole network correlation function between lncRNA l i and protein p j can be defined as follows: The whole network correlation matrix can be established as K = {k(l i , p j )}, i = 1, 2,.., x and j = 1, 2,…, z.
For two given lncRNAs, the similarity between them can be noted as s 1 (l u , l v ), where l u and l v are two lncRNAs. Similarly, for two given proteins, the similarity between them can be noted as s 2 (p u , p v ). The similarity between two lncRNAs or two proteins can be measured in various aspects, which have been elaborated in the above section.
With all above definitions, we can establish the final scoring matrix as follows: FIGURE 3 | A part of the lncRNA-miRNA-protein association network. Five proteins, 15 miRNAs, and 24 lncRNAs formed this part of the interaction network. Every miRNA can interact with multiple lncRNAs, and multiple proteins as well.
The prediction of lncRNA-protein interactions is made based on the scores in W. If a value in W were larger than a given threshold, the corresponding lncRNA and protein would be predicted to interact. Otherwise, no interaction would be predicted.
The whole flowchart of our method is illustrated in Figure 4. Three different similarity measures were applied to lncRNAs and proteins, respectively. Since they can be chosen independently to each other, there are nine different combinations of the similarity choices

Performance Evaluation
Given a threshold, a set of lncRNA-protein interactions can be predicted from the matrix W. By comparing this set against the testing dataset, the number of true positives (TP), true negatives (TN), false positives (FP), and false negatives (FN) can be obtained, respectively. Five statistical measures can be calculated as follows: where TPR is for true positive rate, FPR for false positive rate, Pre for precision, Rec for recall, and Acc for accuracy.
By varying the threshold from the maximum value to the minimal value in W, a receiver operating curve (ROC) can be plotted using the TPR and the FPR values. In the meantime, a precision-recall (PR) curve can be obtained using the precision and the recall values. Due to the nature that the negatives are far more than the positives in the current topic, the area under the ROC (AUROC) and the area under the PR curve (AUPR) are used both as primary performance measures of our method.

Parameter Calibrations
In our work, there are parameters when the PseDNC and the PseAAC sequence representations are generated. We used a grid search strategy to find the optimal parameters in the PseDNC and PseAAC. The parameter l varies from 10 to 20 with a step of 1, w 1 from 0.1 to 1 with a step of 0.1, t from 10 to 20 with a step of 1, and w 2 from 0.05 to 0.5 with a step of 0.05. We finally choose l = 10, w 1 = 0.1, t = 11 and w 2 = 0.5. The physicochemical properties in the PseDNC are Rise, Tilt, Twist, Slide, Shift, and Roll, which are defined in Pse-In-One . The physicochemical properties in the PseAAC are HOPT810101, JOND750101, ZIMJ680104, KRIW790103, TAKK010101, ROSM880104, BLAS910101, and KRIW790101, which are all defined in AAIndex (Kawashima et al., 2008).

Performance Analysis
We compared the prediction performance under different combinations of similarity matrices. Figure 5 illustrates the ROC and PR curve of our method with nine different similarity combinations. The AUROC and AUPR values were collected in  Table 1. According to these values, the prediction performance of our method is optimized when the network similarity measure was applied to both lncRNAs and proteins. Under this condition, the AUROC achieved 0.821, while the AUPR achieved 0.376. It seems like the AUPR is low. However, by analyzing the PR curve, we found that the precision is low when the recall is in the range of (0.1, 0.4). That is to say, some lncRNA-protein pairs with a high correlation score have no experimentally verified interaction between them. This may be because these interactions are not discovered yet.
Due to the nature that the negatives are far more than the positives in predicting lncRNA-protein interactions, the testing dataset is highly imbalanced. In order to provide a set of valuable prediction results in a practical application, a recommended threshold is 2.147, which will balance the TPR and FPR, and will produce 74.3% accuracy.

Effects of the Two Similarity Matrices
In order to analyze the effect of every single similarity matrix individually, we combined every single similarity matrix solely with the whole network correlation matrix, respectively. In other words, either Q 1 or Q 2 is removed from Eq (21) to see the effect of the other matrix solely. The ROC and the PR curve of all six different configurations are illustrated in Figure 6. The network similarity for lncRNAs performs the best among three similarities for lncRNAs. For proteins, the best similarity measure is also the network similarity. This result consists with the other results in our work. Therefore, we can safely conclude that the network similarity best suits our method. Particularly, the network similarity matrix for protein achieved a very close prediction performance to the comprehensive form of our model. Since the number of proteins and lncRNAs are imbalanced in our dataset, the number of interactions from miRNAs to proteins is far more than that to lncRNAs on average. This may be the reason that why the network similarity matrix for proteins can achieve a very promising performance solely with the whole network correlation matrix.

Comparison With Existing Methods
HeteSim is a widely applied measure, which aims at quantifying the correlation of nodes in a heterogeneous network (Shi et al., 2014). It has been used in predicting various types of interactions and connections (Shi et al., 2014). Due to the mechanism difference between our method and existing methods, it is difficult to perform a completely fair comparison. We compared our method to Yang's work (Yang et al., 2016), where HeteSim is employed to measure the correlation between lncRNAs and proteins. In order to perform a sufficiently fair comparison, we obtained protein-protein interaction from the STRING database (Mering et al., 2003) to satisfy the requirement of Yang's work. Same testing datasets were applied to evaluate the prediction performance of Yang's work and our method simultaneously. However, due to the different mechanisms between our method and Yang's work, we tested our method using the independent testing dataset, while fivefold cross-validation was applied on Yang's method with the same dataset. Since fivefold cross-validation may produce overestimated performance values, we believe that our method achieved a comparable performance in this comparison (Figure 7). FIGURE 5 | ROC and PR curves of nine similarity combinations. The horizontal axis in ROC (left panel) is for FPR and the vertical axis for TPR. The horizontal axis in PR curve (right panel) is for recall and vertical axis for precision. Net is for network similarity. Seq is for sequence similarity. Feat is for statistical feature similarity. The first part in the legend is for similarity measures of lncRNAs and the latter part for proteins. For example, the Net+Net means that we used network similarity for lncRNAs and proteins. The Net+Seq means that we used network similarity for lncRNAs and sequence similarity for proteins.

Prediction of Novel Interactions
In order to evaluate the actual prediction effect of our method, we selected 20 interactions that are top ranked in our results. These interactions are recorded in Table 2. Fifteen out of 20 interactions in Table 2 had been verified by CLIP-seq (Ule et al., 2005) in RAID v2.0. Two of the remaining five had been verified by eCLIP (Van Nostrand et al., 2016) in NPInter database (Teng et al., 2019). Since our method does not require any prior knowledge of direct lncRNA-protein interactions, and all data in our method came only from the RAID v2.0database,ourmethodshouldhaveagoodperformance.Although other three interactions are not verified, it is possible that they are undiscovered interactions under certain conditions.

Prediction Based on Interactions of Whole Database
Since only experimentally verified interactions were obtained to compose our benchmarking dataset, a large number of predicted interactions in the RAID V2.0 database were discarded. We incorporated these predicted interactions to optimize our method. A total of 20,425 lncRNA-miRNA interactions and 1,349 protein-miRNA interactions were extracted while sharing a common set of miRNAs. These interactions are among 1,133 lncRNAs, 464 miRNAs, and 113 proteins. We also collected 2,803 lncRNA-protein interactions as our independent testing dataset. Altogether 615 lncRNAs and 65 proteins were included in this testing dataset. Our method achieved an AUROC of 0.852 on this dataset (Figure 8). Since our method can work with only known lncRNA-miRNA interactions and miRNA-protein interactions, it can be used as a supplement to state-of-the-art methods using direct lncRNA-protein interactions.

Database Coverage Analysis
Due to the mechanism of our method, we restricted the lncRNAprotein interactions within those lncRNAs and proteins, which can find a sharing miRNA interactor. This restriction narrowed the profile of applicable data in the database. There are 2,862 experimentally verified lncRNA-miRNA interactions in the RAID v2.0 database, including 358 lncRNAs and 1,208 miRNAs; 1,356 lncRNA-miRNA interactions between 331 lncRNAs and 360 FIGURE 6 | ROC curve and PR curves of single similarity matrix. Since we only use similarity matrix for lncRNAs or proteins, not the same time, we only have six different curves in each panel. The axis in both panels have the same meaning as in Figure 5, respectively. The "Net," "Seq," and "Feat" have the same meaning as the legends of Figure 5.  Due to the limited number of known miRNA-protein interactions and miRNA-lncRNA interactions, the coverage of proteins in the whole database is low. We admit that this will limit the application scope of our method. However, we believe this will get better when the number of available miRNA-protein interactions is increased, because the statistical test has already shown that the lncRNAprotein interactions are significantly enriched in the set of lncRNAs and proteins that are sharing a common set of miRNAs.

CONCLUSION
LncRNAscanaffectbiologicalprocessesfromvariouslevels.Itisofgreat importance to study the molecular functions of lncRNAs. In the meanwhile, LncRNAs perform their role mostly by their interaction with proteins. Therefore, lncRNA-protein interaction should be studied in detail. In this paper, we proposed a method to predict lncRNA-protein interactions without prior knowledge of existing lncRNA-protein interactions. Instead, we utilized the lncRNA-miRNA interactions and the miRNA-protein interactions as the basis of our prediction. The miRNAs are used as mediators to connect the realm of lncRNAs and the realm of proteins. This is basedonthehypothesisthatalncRNAandaproteinmayinteractifthey share interacting miRNAs. By quantitatively modelling the heterogeneous network that is formed by lncRNAs, miRNA, and FIGURE 8 | The ROC curve including interactions without experimental evidences in the RAID v2.0 database. In other words, interactions of the whole RAID database were utilized to train our model. The network similarity of both lncRNAs and proteins were selected to generate our final scoring matrix, which preformed the best in former experiment. proteins, we developed a simple, yet effective, method to predict the lncRNA-protein interactions. The best similarity measure in our method is the network similarity, which does not rely on sequence information. This gives our method a unique capability to predict lncRNA-protein interaction without comprehensive sequence information of both interactors. By comparing our predictions to the known lncRNA-protein interactions, we can conclude that our method has, at least, a comparable prediction performance to the state-of-the-art methods. Since our method does not rely on prior knowledge of lncRNA-protein interactions, it is a helpful supplement to existing methods.

DATA AVAILABILITY STATEMENT
All datasets generated for this study are included in the article/ Supplementary Material.

AUTHOR CONTRIBUTIONS
Y-KZ curated the dataset, designed the algorithm, implemented the algorithm, and calibrated the parameters. Z-AS and HY performed the experiments and collected the results. TL, YG, and P-FD investigated the question, designed the whole study, conceptualized the algorithm, analyzed the results, and wrote the manuscript.