A Novel Method for Identifying Essential Proteins Based on Non-negative Matrix Tri-Factorization

Identification of essential proteins is very important for understanding the basic requirements to sustain a living organism. In recent years, there has been an increasing interest in using computational methods to predict essential proteins based on protein–protein interaction (PPI) networks or fusing multiple biological information. However, it has been observed that existing PPI data have false-negative and false-positive data. The fusion of multiple biological information can reduce the influence of false data in PPI, but inevitably more noise data will be produced at the same time. In this article, we proposed a novel non-negative matrix tri-factorization (NMTF)-based model (NTMEP) to predict essential proteins. Firstly, a weighted PPI network is established only using the topology features of the network, so as to avoid more noise. To reduce the influence of false data (existing in PPI network) on performance of identify essential proteins, the NMTF technique, as a widely used recommendation algorithm, is performed to reconstruct a most optimized PPI network with more potential protein–protein interactions. Then, we use the PageRank algorithm to compute the final ranking score of each protein, in which subcellular localization and homologous information of proteins were used to calculate the initial scores. In addition, extensive experiments are performed on the publicly available datasets and the results indicate that our NTMEP model has better performance in predicting essential proteins against the start-of-the-art method. In this investigation, we demonstrated that the introduction of non-negative matrix tri-factorization technology can effectively improve the condition of the protein–protein interaction network, so as to reduce the negative impact of noise on the prediction. At the same time, this finding provides a more novel angle of view for other applications based on protein–protein interaction networks.


INTRODUCTION
Essential proteins play an indispensable role in the survival of organisms, and the criticality of proteins is mainly determined by their biological functions. Studies have shown that essential proteins have abundant functions such as translation, transcription, and replication (Glass et al., 2009). The prediction of essential proteins can apply the important reference information of biology and medicine, which has a wide application prospect in the fields of disease diagnosis and drug design. Currently, researchers have proposed a variety of biological methods to identify essential proteins, such as single-gene knockout (Kobayashi et al., 2003). However, these experimental methods have some limitations such as high cost and long time consumption. Therefore, it is urgent to improve the prediction performance of the computational method to identify essential proteins.
In recent years, researchers have proposed many computational methods to identify essential proteins relying on different ideas and technologies. Researchers have proposed many classic algorithms for predicting essential proteins based on PPI network topological characteristics, such as degree centrality (DC) (Hahn and Kern, 2005), information centrality (IC) (Björnsdottir, 2001), closeness centrality (CC) (Wuchty and Stadler, 2003), betweenness centrality (BC) (Joy et al., 2014), subgraph centrality (SC) (Estrada and Rodríguez-Velázquez, 2005), and sum of edge clustering coefficient centrality (NC) . Li et al. (2018) found that in the PPI network, the frequency of essential proteins in triangular structures is significantly higher than that of non-essential proteins. Based on this research discovery, they proposed a new measure of pure Centrality-Neighborhood Closeness Centrality (NCC). Although this type of approach allows direct identification of essential proteins in the absence of known essential proteins, there are limitations to these approaches. First, the existing PPI data are incomplete with a large number of false positives and false negatives, affecting the accuracy of predicting essential proteins. Second, most of these methods just use the topological properties of the network while ignoring other properties of essential proteins.
In order to make up for the limitations of incomplete protein interaction networks, many research groups have combined PPI networks with other biological information in recent years to improve the accuracy of essential protein identification. Tew et al. (2007) proposed a novel method called NFC, which defines the functional similarity between two proteins based on the GO term similarity and scores the protein based on the sum of the functional similarity between the protein and its neighboring proteins. Zhang et al. (2018) proposed an essential protein prediction method named TEO by combining the network topology characteristics, gene expression information, and GO annotation information. A weighted protein interaction network was established by calculating the Edge Clustering Coefficient (ECC), Pearson Correlation Coefficient (PCC), and functional similarity, so as to realize essential protein recognition. Lei et al. (2019) proposed an essential protein identification method called RWEP. Firstly, a weighted PPI network was established using network topology, gene expression, and GO annotations; then, each protein in the network was identified according to subcellular localization and protein complexes. Finally, the restart random walk algorithm is used to iteratively calculate the protein score in the weighted network. Due to the strong clustering of essential proteins, Ren et al. (2011) proposed a new centrality method that combines PPI network topology and protein complex information to identify essential proteins. By fusing the topological feature of PPI networks and gene expression information, Zhang et al. (2013) and Li et al. (2012) proposed two different models to predict essential proteins, called CoEWC and PeC, respectively. Based on the modular characteristics of essential proteins, Zhao et al. (2014) proposed an essential protein identification method called POEM. Based on the network topological characteristics and gene expression information, a highly reliable weighted network was established, and on this basis, overlapping functional modules with high cohesion and low coupling were dug. Finally, scores were calculated according to the weighted density of the modules to which the proteins belong, so as to realize the identification of essential proteins. Peng et al. (2012) considered that essential proteins were more conservative than non-essential proteins and often combined with each other. They proposed an iterative method ION that combines direct homology and PPI networks to predict essential proteins. The probability transfer matrix was established by using the edge clustering coefficient (ECC) and interaction network, and the initial score vector of protein was established by using homology information. According to the similarities of active PPI networks of each time, Peng et al. (Zhang et al., 2019) established a novel PPI network. Then, based on this network and orthologous information of protein, they developed a dynamic protein-protein interaction network-based model called FDP. Zhong et al. (2021) proposed a new measure method called JDC, which offers a dynamic threshold method to binarize gene expression data and combines Jaccard similarity index and degree centrality to predict essential proteins. However, the methods based on multisource data are relatively simple. It not only will conceal the complex relationship between the multisource data but also may introduce artificial noise.
In this article, we utilize non-negative matrix tri-factorization (NMTF) to deal with the challenges introduced above and propose a novel method named NTMEP for identifying essential proteins. NTMEP focuses on the following three important aspects. First, it is well known that the multiple kinds of biological data about proteins can be integrated to construct a weighted PPI network with similar functions. As a result, the more different types of data are used, the more artificial noise is produced inevitably. Considering this problem, NTMEP constructs the weighted PPI by using original protein-protein interaction information merely. Second, the NMTF algorithm is extensively used for many applications in pattern recognition, text mining, DNA gene expressions, and so on. This is also extended to community detection and the recommendation system. Hence, to mine more potential protein-protein associations, the NMTF algorithm is introduced in our progress. It takes the internal possibility of associations between proteins into account, which contributes to generation of a more reliable prediction model that excludes the noisy candidates. Third, distinct from previous approaches, we employ homologous and subcellular localization information in the course of ranking proteins, which can improve the accuracy of predicting essential proteins effectively.

MATERIALS AND METHODS
Our purpose is to develop a novel method which can improve the accuracy of predicting essential proteins. We firstly constructed a weighted PPI network to represent the complex relationships between proteins. Moreover, a novel prediction method based on NMTF was proposed specifically for the network to find the potential associations between proteins. Finally, the PageRank algorithm was performed to identify the essential protein candidates by integrating subcellular localization and homologous information.
Let G(V, E) be the PPI network that contains node set V = (p 1 , p 2 ,. . ., p n ) (n is the number of proteins) and edge set E = [(p 1 , p 2 , w 1 ), (p 2 , p 3 , w 2 ), . . ., (p i , p j , w m )] where (p i , p j , w m ) is the interaction between protein p i and p j with weighted value w m which was set to 1 in original protein-protein interaction information.

Protein Association Measurement
In this subsection, a weighted PPI network was constructed in which the association value of two proteins would be calculated based on their topological characteristics. In analyzing the topological characteristics of PPI networks, researchers have found that the PPI networks are one kind of small-world and scale-free network. Therefore, the topological features of the PPI network can be used to predict essential proteins. In recent years, the item of common neighbors of two proteins in the PPI network has been used in many prediction algorithms to realize the task of predicting essential proteins. They demonstrate that the more common neighbors exist between two proteins, the more deeply is the association they have with other. In this article, if proteins p i and p j share at least one common neighbor, we assume that p i and p j are interacting. This kind of connection between proteins is called the co-neighbor (CoN) relationships and calculated as follows: where S Nei (i) and S Nei (j) present the neighborhood sets of p i and p j , respectively. As can be seen from the above equation, the value of the CoN relationships of the two-protein range is between 0 and 1.

Reconstruction of the Weighted PPI Network Based on NMTF
Non-negative matrix tri-factorization as a general technology takes or compresses a data matrix into a compact latent space.
It has been used to model topics in text data (Hua et al., 2011), to predict cancer driver genes from clinical data (Xi et al., 2018), and to detect disease-disease associations (Žitnik et al., 2013). It is an efficient data representation technique, which has been widely used in recommender systems (Hernando et al., 2016;Luo et al., 2016). This new understanding should help to improve prediction accuracy of the essential proteins. To take full advantage of NMTF, we perform it on the weighted PPI network (P CoN ) to mine the potential interactions of proteins. In contrast to classic non-negative matrix factorization (Lee and Seung, 1999) where the input matrix is separated into two parts, NMTF resolves the input matrix into three latent matrices. Here, we consider that the input adjacency matrix P CoN ∈ R n * n has missing records, that is to say, the interactions between proteins have not been discovered. By using NMTF, a new matrix Y ∈ R n * n containing some new records would be constructed, as follows: Here, NMTF is designed to describe the matrix P CoN ∈ R n * n with a product of three non-negative potential matricesF ∈ R n * k , S ∈ R k * k , and G ∈ R n * k , while parameter k denotes factorization ranks and represents the number of potential vectors which form the column and row column space. For a given non-negative data matrix P CoN , the issue can be solved as the following optimization problem: where || · || F is the Frobenius norm. Since the objective function in Eq.
(3) is a joint non-convex problem, we employ the rule of multiplicative iteration to solve the objective function on the basis of using auxiliary functions. The squared Frobenius norm can be written as | | X| | 2 = Tr(X T X); therefore, Eq. (3) equals to: Its partial derivative equations for factor F, S, and G are as follows, respectively: It is well known that the static point can be detected using the Karush-Kuhn-Tucker (KKT) complementarity conditions. The KKT condition for factor F is as follows: In this connection, the conditions are assumed to be functional if the derivative is zero: Similarly, the updating rules for G and S can be derived as follows: The multiplication iteration rules are shown as follows: From the above Eq. (9), the optimal matrix Y, which is closest to P CoN , can be computed. Finally, to recover the symmetry of the protein-protein interactions, we transformed the matrix Y to a symmetrical transition probability matrix P * CoN , as follows: The NMTF-Based Model for Identifying Essential Proteins Through the description of the above algorithm, based on the information of the original PPI network, an optimized weighted PPI network can be established. Therefore, we can use an iterative method to rank protein scores. This method mainly includes two parts: the calculation of the initial score and the calculation of the ranking score, as detailed below.

Computation of Initial Scores
In this part, we will initially score each protein in the PPI network using homologous and subcellular localization information.
Taking the Saccharomyces cerevisiae PPI network as an example, Tang et al. (2018) analyzed whether all the proteins in this network had direct homologous proteins in 99 reference species. They concluded that the more homologous a protein has in the reference species, the more likely it is to become a required protein. In order to obtain the given protein p i in the PPI network G = (V, E), we mainly use the homology information to calculate the homology score (S H ) of the protein. Among them, S H (p i ) refers to the conservative score of p i , and the calculation formula is as follows: Among them, H(p i ) refers to the number of times that the protein p i has direct homologous proteins in the reference species. We know that an important feature of proteins is subcellular localization. By studying the characteristics of protein subcellular localization, researchers Zhao et al., 2016;Lei et al., 2018) found that essential proteins are more likely to appear in specific subcellular locations. Based on this, we calculated the subcellular localization score (S L ) of the protein based on the subcellular localization information. If the protein p i exists in the final subcellular localization dataset R, then the frequency of each subcellular location r can be calculated by the following formula: where SN represents the relationship between the protein and the subcellular location data set, SN(r) refers to the number of proteins corresponding to the subcellular location r, and n is the number of subcellular locations. Based on a fixed protein p i , the subcellular localization score S L (p i ) refers to the highest score for all subcellular locations.
where C(p i ) represents the subcellular location corresponding to the protein p i . Finally, according to Eq. (11-13), the unique initial score S L (p i ) of protein p i is expressed as follows:

Computation of Ranking Scores
The ranking of protein p i is called S F (p i ), and p j ∈S CoN (i) P * CoN (p i , p j )S F (p j ) refers to the neighbor induction score. Based on this, the ranking score of each protein in the PPI network can be calculated by Eq. (15), as shown below: Among them, the function of the parameter α (0 ≤ α < 1) is to adjust the weight of the two scores in the final ranking score. Based on the above analysis, the protein ranking score is a linear combination of its initial score and the neighborhood correlation score at the edge of the network. Therefore, formula (15) can be rewritten in matrix vector format as follows: In our study, the Jacobi iterative method is used to solve Eq. (16), as shown below: Frontiers in Genetics | www.frontiersin.org Algorithm 1 | NTMEP Input: A PPI network G, subcellular localization information, homologous proteins information, stopping error ε, parameters k, α, and K Output: Top K proteins sorted by S F in descending order Step 1: Calculate adjacency matrix P CoN of the weighted PPI network according to Eq. (1) Step 2: Reconstruct matrix P CoN to P * CoN by Eq. (2)-(10) Step 3: Initialize initial vector S I with S F 0 = S I and t = 0 Step 4: Compute S F t according Eq. (17) Step 5. If | | S F t -S F t−1 | | < ε, then PR S F = S F t and terminate the algorithm. Otherwise, let t = t+1 and repeat Step 4 Step 6. Sort proteins by the value of S F in the descending order Step 7. Output top K of sorted proteins where S t F is the protein's scores obtained in the tth iteration. Through the above analysis, we conclude that the overall framework of the NMTF-based model for the identification of essential protein (NTMEP) can be referred to as the following Algorithm 1.

Experimental Data
In the experiments, we use four data sets including proteinprotein interaction set, experimentally verified essential protein set, subcellular location set, and homologous protein information set. We downloaded the relationships among proteins from the DIP database (Xenarios et al., 2002), which includes 1,167 essential proteins and a total of 24,743 interactions between 5,093 proteins after removing self-interactions and duplicate interactions. Also, these data are adopted to construct the weighted protein network based on the topological structures. The experimentally verified essential protein dataset with 1,285 essential proteins are derived from MIPS (Mewes et al., 2006), SGD (Cherry et al., 1998), DEG Lin, 2009), andSGDP (Saccharomyces Genome Deletion Project, 2012). From the COMPARTMENTS (Binder et al., 2014) database, we obtained the subcellular location data, which cover 11 categories (Endoplasmic, Nucleus, Cytoskeleton, Golgi, Cytosol, Vacuole, Plasma, Mitochondrion, Endosome, Peroxisome, and Extracellular) (Peng et al., 2015). The homologous protein information is collected come from the seventh edition of the InParanoid database (Ostlund et al., 2010) including paired comparisons of 100 whole genomes (99 eukaryotes and one prokaryote).

Parameter α Sensitivity Analysis
In the NTMEP, the parameter α in Eq. (16), which used to weigh up the contribution of neighbor-induced score and initial score, was set to 0, 0.1, 0.2,..., and 1. While considering only the neighbor-induced score, α was set to 1. On the other hand, α was set to 0 when considering only the initial score. The impact of the parameter α to the performance of NTMEP is presented in Table 1. After the ranking scores of proteins were calculated with the different value of parameter α, we get the number of true essential proteins in the top 100, 200, 300, 400, 500, and 600 candidates, respectively. Table 1 shows that the performance of the NTMEP is very poor when α was set to 0 or 1. It can be seen from the data in Table 1 that the 0.1 and 0.2 groups have better prediction results. Especially, the best performance was achieved in the top 100 candidates when α was set to 0.1. Consequently, α was set to 0.2 in this article to make the NTMEP obtain good performance.
The higher number of essential proteins within the top k of the ranking list means the more real essential proteins are predicted successfully. Parameter k, which is set to 100, 200, 300, 400, 500, and 600, denotes the number of essential protein candidates selected. The number of real essential proteins within top k candidates is shown in Figure 1. NTMEP consistently outperformed the other competitive methods at various k cutoffs and ranked 92, 85.5, 78.7, 73.8, 69.4, and 65.2% of positive samples in top 100, 200, 300, 400, 500, and 600, respectively. Especially, as for the top 100 of essential protein candidates, NTMEP has higher predict accuracy 46,48,55,48,51,37,18,19,11, and 12% than that obtained from DC, IC, CC, BC, SC, NC, PeC, CoEWC, POEM, and

Validated by Precision-Recall Curves
To obtain a fair and convincing comparison, the precisionrecall (PR) curve is used to evaluate the prediction performance for essential proteins of our method and other state-of-the-art methods. The value of cutoffs, presented as k, is ranged from 1 to 5,093. We compute the scores of all proteins by using each algorithm and sorted it in descending order, respectively. The top k proteins are selected as a positive set, namely, essential protein candidates, and others as the negative set, namely, non-essential protein candidates. Figure 2 compares the results obtained from the different methods. As shown in Figure 2A, compared with DC, IC, BC, CC, SC, and NC, the PR curves of NTMEP reported significantly higher capability for identifying essential proteins. The results obtained from our method and PeC, CoEWC, POEM, and JDC are presented in Figure 2B. Looking at Figure 2B, in the first part of the PR curve, it is apparent that the precision of our method has the best performance compared to those methods. In order to give quantitative comparison results, the area under the curve (AUC) values of the PR curve were computed, respectively, as shown in Table 2. As a whole, the NTMEP dramatically outperformed those competitive methods.

Validated by Jackknife Methodology
In this subsection, we employ the jackknife curves to assess the performance of our NTMEP method and other existing methods (DC, BC, CC, SC, IC, NC, PeC, CoEWC, POEM, and JDC), the various top number of ranked proteins as candidates. The jackknife curves of all the methods are displayed in Figure 3, where the horizontal axis denotes the number of proteins ranked at the top in descending order with each corresponding method, and the vertical axis is the accumulative quantity of the real essential proteins within the ranked proteins. Figures 3A,B illustrate the jackknife curves of all the competitive methods compared with NTMEP, respectively. As is seen from Figure 3A, the curve of NTMEP reported a higher number of real essential proteins than other existing centrality measure methods, such as DC, BC, CC, SC, IC, and NC. As shown in Figure 3B, NTMEP is also better than PeC, CoEWC, POEM, and JDC. To give quantitative comparison results, the AUC values of jackknife curve were computed, respectively, as shown in Table 3. From Figure 3 and Table 3, it is clear that the NTMEP method outperforms the other 10 essential protein prediction methods.
In summary, these results demonstrated the powerful ability of NTMEP in identifying essential proteins. This finding is reasonable because our method adopts NMTF to find the potential interactions between proteins, which could provide additional interaction information and help to improve the prediction results by a large margin.

Analysis of the Differences Between NTMEP and Other 10 Competitive Prediction Methods
This subsection will analyze the difference between NTMEP and other prediction methods through experimental results. Firstly, 11 protein sets were constructed by NTMEP and other 10 prediction methods (DC, IC, CC, BC, SC, NC, PeC, CoEWC, POEM, and JDC), and each protein set contains the top 100 essential proteins predicted by each prediction method. The number of proteins that overlap between the NTMEP method and other methods and the number of proteins that differ are shown in Table 4.
In Table 4, Mi refers to one of the 10 prediction methods (DC, IC, CC, BC, SC, NC, PeC, CoEWC, POEM, and JDC);  |Mi∩NTMEP| represents the number of common proteins predicted by both Mi and NTMEP in the top 100 ranked proteins. {Mi-NTMEP} refers to the difference set in the top 100 ranked proteins, while proteins were selected as essential proteins by Mi but not by NTMEP. Moreover, | Mi-NTMEP| represents the number of proteins in the difference set. Similarly, {NTMEP-Mi} denotes the difference set constituted by the proteins belonging to NTMEP but not to Mi, and the number is denoted by |NTMEP-Mi|. As shown in Table 4, the second row of the table shows that 85 essential protein candidates out of the top 100 essential protein candidates predicted by DC are different from those predicted by NTMEP, while 32 of these 85 predicted essential protein candidates are true essential proteins; thus, the percentage of essential proteins in the difference set is 37.6%. Among the top 100 essential protein candidates predicted by NTMEP, 85 essential protein candidates were different from those predicted by DC, but 78 of them were accurate; thus, the percentage of essential proteins in the difference set was 91.8%. From this line of data, it can be seen that most of the top 100 essential protein candidates predicted by NTMEP are different from those candidates predicted by DC. Moreover, NTMEP predicts far more true key proteins than DC. This indicates that NTMEP not only is a different method from DC but also shows that NTMEP is much better than DC in distinguishing essential proteins from common proteins. Similarly, it can be seen from the other rows of the table that NTMEP maintains this advantage over all other prediction methods.

CONCLUSION
In reviewing the literature, previous studies developed many computational methods to predict essential proteins effectively. However, these methods do not take full account of the falsepositive and -negative noise generated from high-throughput experimentation and the process of the weighted PPI network construction. To get the utmost out of the complex association between proteins, NMTF is introduced into our proposed method. Moreover, subcellular localization and homologous protein information are used in the final scoring stage instead of the stage of establishing the weighted network. Also, a comprehensive experiment is carried out and the results show that our new method can obtain a better performance compared with other methods. A possible explanation for these results might be that there are deep relationships between proteins which are not founded by high-throughput experimentation, and fusion of multiple data raises the cost and reduces the overall efficiency of the process. These results add to the rapidly expanding field of computational methods for predicting essential proteins. It is unfortunate that the study did not solve the problem of noise generated by multisource data fusion. This is an important issue for future research.

DATA AVAILABILITY STATEMENT
The datasets presented in this study can be found in online repositories. The names of the repository/repositories and accession number(s) can be found in the article/ supplementary material.

AUTHOR CONTRIBUTIONS
ZZ and MJ obtained the protein-protein interaction data, benchmark essential protein dataset, subcellular location data, and homologous protein information. ZZ, MJ, and XQ designed the new method, NTMEP, and analyzed the results. ZZ, DW, and WZ drafted and revised the manuscript together. All authors have read and approved the manuscript.