Predicting lncRNA–Protein Interaction With Weighted Graph-Regularized Matrix Factorization

Long non-coding RNAs (lncRNAs) are widely concerned because of their close associations with many key biological activities. Though precise functions of most lncRNAs are unknown, research works show that lncRNAs usually exert biological function by interacting with the corresponding proteins. The experimental validation of interactions between lncRNAs and proteins is costly and time-consuming. In this study, we developed a weighted graph-regularized matrix factorization (LPI-WGRMF) method to find unobserved lncRNA–protein interactions (LPIs) based on lncRNA similarity matrix, protein similarity matrix, and known LPIs. We compared our proposed LPI-WGRMF method with five classical LPI prediction methods, that is, LPBNI, LPI-IBNRA, LPIHN, RWR, and collaborative filtering (CF). The results demonstrate that the LPI-WGRMF method can produce high-accuracy performance, obtaining an AUC score of 0.9012 and AUPR of 0.7324. The case study showed that SFPQ, SNHG3, and PRPF31 may associate with Q9NUL5, Q9NUL5, and Q9UKV8 with the highest linking probabilities and need to further experimental validation.


INTRODUCTION
Long non-coding RNAs (lncRNAs) are closely associated with many key biological processes, for example, immune response, embryonic stem cell pluripotency, and cell cycle regulation (Chen et al., 2016;Agirre et al., 2019;Gil and Ulitsky, 2020). lncRNAs regulate cellular activities to achieve their biological function through interactions with proteins (Chen and Yan, 2013;Zhang et al., 2018b). Therefore, finding potential lncRNA-protein interactions (LPIs) is important to uncover lncRNArelated biological activities. Wet experiments found a few LPIs; however, experimental methods are costly and time-consuming. Thus, computational methods are developed to identify possible associations between lncRNAs and proteins (Bester et al., 2018;Chen et al., 2018). LPI prediction methods can be roughly classified into two groups: network-based methods and machine learning-based methods. Network-based LPI identification methods integrated various biological data and network propagation methods (Peng et al., 2019). Li et al. (2015) used random walk with restart on the constructed lncRNA-protein heterogeneous network to find LPI candidates. Zhang et al. (2018a) developed a linear neighborhood propagation method to score for lncRNA-protein pairs. Ge et al. (2016), Zhao et al. (2018a), and Xie et al. (2019) applied bipartite network projection recommended methods to compute the association probabilities between lncRNAs and proteins.
Machine learning-based methods mainly contain matrix factorization-based LPI prediction methods and ensemble learning-based LPI prediction methods. Matrix factorization methods have been widely applied to various association prediction areas (Peng et al., 2020). Liu et al. (2017), Zhang T. et al. (2018), Zhao et al. (2018a), and Shen et al. (2019) used matrix factorization methods to predict possible LPIs. Hu et al. (2018) and Zhang et al. (2018b) utilized ensemble techniques and generated ensemble learning frameworks to discover potential LPIs based on the constructed benchmark datasets. Computational methods effectively revealed the possible associations between lncRNAs and proteins. However, the performance obtained by the above methods is limited and can be further improved.
In this study, we first integrated lncRNA similarity, protein similarity, known LPIs. We then developed a novel LPI prediction method based on weighted graph-regularized matrix factorization (LPI-WGRMF). LPI-WGRMF was compared with five state-of-the-art LPI methods [LPBNI, LPI-IBNRA, LPIHN, RWR, and collaborative filtering (CF)] to measure the performance of the proposed LPI-WGRMF method. LPI-WGRMF obtained the AUC value of 0.9057 and the AUPR value of 0.7324. The results showed that LPI-WGRMF is a useful tool for identifying LPIs. Case study analysis suggests that there are possibly joint links between SFPQ and Q9NUL5, SNHG3 and Q9NUL5, and PRPF31 and Q9UKV8.

MATERIALS AND METHODS
In this manuscript, we developed an LPI prediction model, LPI-WGRMF. The method can be summarized to three steps. First, experimentally validated LPIs from the NPInter 2.0 database were collected. Second, lncRNA similarity matrix and protein similarity matrix are computed based on the assumption that lncRNAs tend to associate with similar proteins and vice versa. Finally, lncRNA similarity, protein similarity, and LPI matrix were integrated to the weight graph-regularized matrix factorization model for computing the association scores for each lncRNA-protein pair.

LPI Data
We obtained experimentally validated LPI dataset, which was provided by Zhang et al. (2018a). The dataset contains 4158 LPIs between 990 lncRNAs and 27 proteins after preprocessing.
The LPI matrix between n lncRNAs and m proteins was denoted as Y n×m .

lncRNA Similarity Matrix
The sequence and expression information of lncRNAs can be downloaded from the NONCODE database. We computed lncRNA similarity matrix by integrating the sequence similarity, expression similarity, and interaction similarity to the similarity kernel fusion technique.

Sequence statistical similarity
Each lncRNA was described a 20-dimensional vector based on the methods provided by Zhang et al. (2018b). Based on the assumption that each vector can be denoted by their k-nearest neighbors, linear neighborhood similarity between two lncRNAs l i and l j can be computed and denoted as s l,0 i, j .

Expression similarity
Suppose that the expression profile of the i th lncRNA can be represented as e i and thus the expression similarity between two lncRNAs l i and l j can be defined as: where ρ i,j is the Pearson's correlation coefficient between two expression profiles e i and e j and is defined as: where cov() denotes the covariance and σ denotes the standard deviation.

Interaction profile similarity
Suppose that the interaction profile of the i th lncRNA can be represented as the i th row Y i. Of the LPI matrix Y, the interaction profile similarity between two lncRNAs l i and l j can be defined as: where where || · || denotes the 2-norm of a matrix.

Sequence alignment similarity
The sequences of proteins were downloaded from the SUPERFAMILY database. The alignment score of the u th protein against the v th protein can be computed by Blast and be denoted as b u,v . The sequence similarity between two proteins p u and p v can be defined as: Frontiers in Genetics | www.frontiersin.org

Sequence statistical feature similarity
Each protein can be represented as a 504-dimensional vector based on the method provided by Zhou et al. (2020). Linear neighborhood similarity between two proteins p u and p v can be computed and denoted as s p,1 .

Interaction profile similarity
Suppose that the interaction profile of the u th protein can be represented as the u th column Y .u of the LPI matrix Y, the interaction profile similarity between two proteins p u and p v can be defined as:

Similarity Kernel Fusion
In the above sections, three lncRNA similarity measurements and three protein similarity measurements were proposed. The similarity kernel fusion method provided by Zhou et al. (2020) was applied to integrate this similarity information to compute a more comprehensive similarity. First, the three lncRNA similarities were normalized as follows: The normalized similarity matrix was denoted as: Second, for an lncRNA l i and s l,q , the k most similar lncRNAs were collected as a set N l,q (i, k) and s l,q can be normalized in constraint based on the neighborhood information: where The neighborhood constrained normalized matrix was denoted as: The above three normalized matrices were integrated based on the following iterative process: where α was a weight parameter with 0 α 1, T was the transpose of the matrix, λ represented the iterative parameter, and l,r (0) l,r .
We computed the integrated similarity matrix after z rounds of iteration: By considering data noise, we defined the following indicator function based on the k most similar lncRNAs for each lncRNA: The final lncRNA similarity matrix can be denoted as follows: where ϑ l i, j is the (i, j) th element in the matrix l .

Nearest Neighbor Information
Based on the graph regularization theory, similar lncRNAs should tend to interact with similar proteins and vice versa in an LPI network, and thus we first observe the nearest neighbor information for lncRNAs and proteins. Given the lncRNA similarity matrix S l , we represented a p-nearest neighbor graph N as where N p (i) denotes the set of p nearest neighbors of lncRNA l i . N is applied to increase the sparsify of the lncRNA similarity matrix S l as Thus, the sparse similarity matrix of lncRNAs can be computed. Similarity, the sparse similarity matrix of protein can be done.

Low-Rank Approximation
Based on low-rank approximation idea, the LPI matrix Y ∈ R n=m can be decomposed into two low-rank latent feature matrices A ∈ R n=k (for lncRNAs) and B ∈ R m=k (for proteins) by minimizing the following low-rank approximation objective: where || · || F denotes the Fronbenius norm and k is the rank of matrices A and B, that is, the number of features in A and B.
We decomposed Y ∈ R n=m into U ∈ R n=k , S k ∈ R k=k , and V∈ R m=k so that US k V T is the closest k-rank approximation to Y where U and V are matrices with orthonormal columns, S k is a diagonal matrix, and k max = min(n, m). Thus, the feature matrices A and B can be represented as A = US

Graph-Regularized Matrix Factorization
To boost generalization ability and prevent overfitting, we minimize the following GRMF's objective function by adding Tikhonov and graph regularization terms to the above low-rank approximation: where λ f , λ l , and λ p are positive parameters, a i and b j are the i th and j th rows of A and B, respectively, and n and m are the numbers of lncRNAs and proteins, respectively. The first term is used to make the model approximate the matrix Y. The second term (Tikhonov regularization) minimizes the norms of A and B. The third and final terms are lncRNA graph regularization and protein graph regularization, respectively. The two terms are applied to minimize the distance between feature vectors of two neighboring lncRNAs or proteins. Based on graph regularization, the above model can be redescribed as where Tr(·) denotes the trace of matrix, L l = D l −Ŝ l and L p = D p −Ŝ p represent the graph Laplacian terms forŜ l and S p , respectively, and D l and D p are diagonal matrices where D l ii = rŜ l ir and D t jj = qŜ p jq . To improve LPI prediction performance, we normalize graph Laplacians L l and L p by L l = (D l ) −1/2 L l (D l ) −1/2 and L p = D p −Ŝ p . Equation (4) can be rewritten as

Weighted Graph-Regularized Matrix Factorization
To prevent unknown lncRNA-protein pairs from affecting the performance of singular value decomposition produced by Y, we add a weight matrix W into the objective function as follows: Based on the alternating least square method provided by Ezzat et al. (2016), we can solve the model (6). Let ∂L ∂a i = 0 and ∂L ∂b j = 0, run alternatingly the following two update rules until convergence: ∀i = 1, 2, ...n, where (L l ) i * and (L p ) j * are the i th and j th rows vectors of L l and L p , respectively. We can obtain A and B based on Eqs 7 and 8. Finally, the interaction probability between the i th lncRNA and the j th protein can be computed by

Evaluation Metrics
Precision, recall, f1 score, accuracy, AUC, and AUPR are widely applied to measure the performance of machine learning methods on association prediction. In this study, we used the six measurements to evaluate the performance of our proposed LPI-WGRMF. AUC is the area under the receiver operating characteristics curve. AUPR is the area under precision-recall curve. The other four criteria are defined as follows: Although accuracy computed by LPI-WGRMF was lower than LPBNI, LPI-WGRMF obtained better precision, recall, and AUC. More importantly, AUC and AUPR are more representative measurement metrics compared with other three evaluation metrics. Thus, AUC and AUPR can be more effectively applied to evaluate the performance of LPI prediction models. LPI-WGRMF is a powerful tool for LPI identification because of its better precision, recall, AUC, and AUPR. Figures 1, 2 demonstrate the AUC and AUPR values obtained by the six LPI prediction methods. The results show that LPI-WGRMF obtained the best AUC value, thereby demonstrating LPI-WGRMF's powerful LPI prediction capability.

Case Study
We further conducted four case studies after confirming the performance of LPI-WGRMF. The lncRNAs in the four cases are Splicing Factor Proline and Glutamine Rich (SFPQ), The best performance in each column (measurement metric) is denoted in bold.

FIGURE 1 | The AUC values of six LPI prediction methods.
FOrkhead boX protein D2-Adjacent Opposite Strand RNA 1 (FOXD2-AS1), Small Nucleolar RNA Host Gene 3 (SNHG3), and Pre-mRNA-Processing Factor 31 (PRPF31), respectively. We predicted possible LPIs based on lncRNA similarities, protein similarities, known LPIs, and LPI-WGRMF. Table 2 lists the predicted top five proteins associated with the above four lncRNAs. SFPQ is a multifunctional nuclear protein participating in a few cellular activities including RNA transport, apoptosis, and DNA repair. SFPQ is densely associated with several diseases including renal cell carcinoma, Xp11-associated tumor, and dyslexia. More importantly, the expression levels of SFPQ impact on the sensitivity of ovarian cancer cells to PT-induced death (Gao et al., 2019;Pellarin et al., 2020). Table 2 shows that SFPQ has joint connection with Q9NUL5 (ranked as 2). More importantly, the association between SFPQ and Q9NUL5 is ranked as 1 in all other five LPI identification methods. The fact suggests that SFPQ is possibly to link with Q9NUL5.
FOXD2-AS1 is an RNA gene and is abnormally expressed in a variety of malignant tumors. FOXD2-AS1 has close associations with many diseases, for example, nasopharyngeal carcinoma, esophageal cancer, bladder cancer, multiple pterygium syndrome, escobar variant, and ulcerative colitis (Bao et al., 2018;Chen et al., 2018;Su et al., 2018;Huang et al., 2020;. FOXD2-AS1 was predicted to be closely linking with O00425, Q9NZI8, Q9Y6M1, and Q9NUL5, which was ranked as 1, 2, 3, and 4. All these connections were ranked in the top five associations among other five LPI prediction models. Therefore, FOXD2-AS1 is associated with O00425, Q9NZI8, Q9Y6M1, and Q9NUL5. SNHG3 is a newly found lncRNA and was discovered as a biomarker of malignant cancers, for example, ovarian cancer, hepatocellular carcinoma, colorectal cancer, lung cancer, and  glioma Huang et al., 2017;Lu et al., 2019;Liu and Tao, 2020). The results from case study analyses showed that SNHG3 tends to link with Q9NUL5 (ranked as 1) and has highest association scores with the protein in LPNI, BPIHN, and CF. Thus, SNHG3 may be possibly linked with Q9NUL5.
PRPF31 is one retinitis pigmentosa-causing gene. Its genetic variants have joint connections with variation in response to metformin in patients with type 2 diabetes (Kiser et al., 2019). In our predicted results, PRPF31 was found to be densely associated with Q9UKV8 (ranked as 1). More importantly, the association between PRPF31 and Q9UKV8 was identified to be ranked as 1, 1, 2, and 1 in LPBNI, LPIHN, RWR, and CF, respectively. PRPF31 obtained the highest association score with Q9UKV8 in five models.

DISCUSSION AND CONCLUSION
In this manuscript, we developed a novel method LPI-WGRMF for identifying possible LPIs, based on lncRNA similarity, protein similarity, known LPIs, and weighted graph regularization-based matrix factorization. We first integrated the similarity information and known LPIs as the initial resource. We then proposed a weighted graph-regularized matrix factorization model to compute the association scores for lncRNA-protein pairs. LPI-WGRMF was compared with five classical LPI methods, that is, LPBNI, LPI-IBNRA, LPIHN, RWR, and CF. Crossvalidation experiments were conducted for 20 times. The results showed the powerful performance of LPI-WGRMF. We conducted four case study analyses after confirming the LPI-WGRMF's accuracy. The results suggest that there are possibly close associations between SFPQ and Q9NUL5, SNHG3 and Q9NUL5, and PRPF31 and Q9UKV8 and need to further experimental validation.
In the future, other sources of LPI-related data may be used to improve the prediction performance, for example, using multiple kernels and designing a multiple kernel learning-based algorithm to effectively integrate the abundant lncRNA and protein information.

DATA AVAILABILITY STATEMENT
The original contributions presented in the study are included in the article/supplementary material, further inquiries can be directed to the corresponding author/s.

AUTHOR CONTRIBUTIONS
FL and JY conceived, designed, and managed the study. XS and LC designed the LPI-WGRMF method, ran LPI-WGRMF, and wrote the original manuscript. JL and CX revised the original draft. XS, JL, and CX discussed the proposed method and gave further research. All authors read and approved the final manuscript.