A Novel Model for Identifying Essential Proteins Based on Key Target Convergence Sets

In recent years, many computational models have been designed to detect essential proteins based on protein-protein interaction (PPI) networks. However, due to the incompleteness of PPI networks, the prediction accuracy of these models is still not satisfactory. In this manuscript, a novel key target convergence sets based prediction model (KTCSPM) is proposed to identify essential proteins. In KTCSPM, a weighted PPI network and a weighted (Domain-Domain Interaction) network are constructed first based on known PPIs and PDIs downloaded from benchmark databases. And then, by integrating these two kinds of networks, a novel weighted PDI network is built. Next, through assigning a unique key target convergence set (KTCS) for each node in the weighted PDI network, an improved method based on the random walk with restart is designed to identify essential proteins. Finally, in order to evaluate the predictive effects of KTCSPM, it is compared with 12 competitive state-of-the-art models, and experimental results show that KTCSPM can achieve better prediction accuracy. Considering the satisfactory predictive performance achieved by KTCSPM, it indicates that KTCSPM might be a good supplement to the future research on prediction of essential proteins.


INTRODUCTION
With the deepening of researches on proteins, accumulating evidences have demonstrated that proteins are closely related to most of the life activities. Moreover, different proteins are of different importance to different life activities. Among these proteins, essential proteins, as a kind of important proteins, are essential for the survival, and development of life. Therefore, in recent years, detection and recognition of essential proteins has become a hot issue in the research and development of disease treatment. However, it is very timeconsuming and expensive to identify essential proteins by traditional biological experiments, which leads to the emergence and development of computational prediction methods. For instance, Zhao et al. (2019) designed a new random walk wandering based prediction model to detect key proteins based on a heterogeneous network consisting of proteins and protein domains. Jeong et al. (2001) found that PPI networks are scale-free and proposed a center-lethal rule for PPI networks. Based on which, lots of methods including the information centrality (IC) (Stephenson and Zelen, 1989), betweenness centrality (BC) (Joy et al., 2014), degree centrality (DC) (Hahn and Kern, 2004), Closeness Centrality (CC) (Wuchty and Stadler, 2003), subgraph centrality (SC) (Estrada and Rodriguez-Velazquez, 2005), and neighbor centrality (NC)  had been put forward successively. In addition, Yu et al. (2007) proposed the importance of network bottlenecks. Estrada (2006) found that a small number of binary proteins were mostly essential proteins. Chua et al. (2006) proposed to identify essential proteins by calculating the weights of indirect neighbor nodes. Li et al. (2012) designed a method to predict essential proteins by combining PPI networks with gene expression data of proteins. Li et al. (2011) find essential proteins by analyzing the relationship between proteins and their neighbors, and define the method as LAC. Peng et al. (2012) combined orthology information of proteins with PPI networks to predict key proteins. Zhao et al. (2014) found that combination of gene expression profiles and PPI networks was of great help to the prediction accuracy of essential proteins. Min et al. (2017) discovered that the complex information of proteins can improve prediction accuracy and precision of potential essential proteins. Zhao et al. (2016) proposed a basic protein identification method based on protein gene time expressions and protein domains. Zhang et al. (2019) proposed a new protein prediction method called TEGS, which can identify essential proteins by fusing the introduced multiple biological information data. Lei et al. (2018) found that it can achieve good results to adopt artificial fish swarm optimization algorithm into key protein prediction. Peng et al. (2015) discovered that combination of protein domain features and protein interaction networks can effectively predict potential essential proteins. Li et al. (2019) proposed a target convergence set (TCS) based method for predicting potential lncRNA-disease associations. Athira and Gopakumar (2020) proposed a multiplex network to identifying essential proteins. Zhang et al. (2018) designed a novel method by combining network topology, gene expression profile and GO information to identifying essential proteins. Fan et al. (2017) proposed a modified PageRank algorithm based on subcellular information. Meng et al. (2021) predict the essential protein by constructing a new weighted protein and protein domain network, and performing a local random walk on this basis. Xenarios et al. (2002) introduced a public database called DIP for studying cellular networks of protein interactions. Gavin et al. (2006) provided a complete and comprehensive eukaryotic machine and biological data integration and modeling platform.
Inspired by above methods, in this manuscript, a computational model named KTCSPM was proposed to predict essential proteins. In KTCSPM, a weighted PDI network was first constructed by integrating a weighted PPI network and a weighted domain-domain interaction (DDI) network. And then, each node in the weighted PDI network would be assigned a unique key target convergence sets (KTCS) according to the network distance information of the weighted PDI network, which could reflect the specificity of different nodes in the process of random walk with restart and improve the predictive performance of KTCSPM. Next, for an arbitrarily selected walker, considering that there may still be some nodes that are essential proteins but not included in KTCS while KTCS reached the final convergence state, each node in the heterogeneous network would be further assigned a unique Intact Set (IS) to ensure that the predicted results would not be omitted as far as possible. Next, we will construct a random walk probability matrix and calculate the stable walk probability of all nodes, and then rank each protein based on the initial protein score vector. Finally, in order to evaluate the predictive performance of KTCSPM, we compared it with 12 advanced predictive methods based on two kinds of yeast PPI networks, and experimental results showed that KTCSPM can achieve reliable predictive accuracy of 90. 19, 81.96, 70.72, 62.04, 55.83, and 51.13% in top 1, top 5, top 10, top 15, top 20, and top 25% of predicted key proteins separately, which are better than all these 12 competing predictive models.

Construction of the Weighted PPI Network
In this section, we will download known PPI data from two different public databases such as the DIP database (Xenarios et al., 2002) and the Gavin database (Gavin et al., 2006), respectively. Obviously, based on these known PPI network downloaded from any given public database, an original PPI network PPIN =< D PP , E PP > can be constructed as follows: Let D PP = {p 1 , p 2 , . . . , p N P } represent the set of newly downloaded proteins and E PP denote the set of edges between proteins in PPIN, here, for any two given proteins p i and p j in D PP , if and only if there is a known interaction between them, then we define that there is an edge between them in PPIN. Thereafter, based on the newly constructed original PPI network PPIN, we can further obtain an N P × N P dimensional adjacency matrix M PPIN as follows: for any two given protein nodes p i and p j in PPIN, if and only if there is an edge between them in PPIN, there is In previous studies, the Gaussian interaction profile kernel similarity has been widely used to measure the similarity between similar nodes (Chen et al., 2016). In this section, for any two given proteins p i and p j in M PPIN , we define the Gaussian interaction profile kernel similarity between them as follows: Here, IP(p t ) represents the vector of elements in the t-th row of the matrix M PPIN , and γ p denotes the normalized kernel bandwidth based on the bandwidth parameter γ p . In addition, according to the methodology proposed by Vanunu et al. (2010), we will further optimize above Gaussian interaction profile kernel similarity of protein by introducing a logistics function as follows: LGKS(p i , p j ) = 1 1 + e (−12GKS(i,j)+log9999) This logistic function can make the calculated results of Gaussian interaction porofile kernel similarity more influential in the identification of essential proteins. Additionally, considering that while analyzing the topology structure of PPI network, the PPI network can be weighted to show the interaction between proteins, therefore, based on above newly obtained matrixLGKS, for any two given proteins p i and p j , we can weigh the relationship between them as follows: Here, N(p i ) and N(p j ) represent the sets of protein nodes directly adjacent to p i and p j in PPIN, respectively, and N(p i ) ∩ N(p j ) denotes the set of protein nodes adjacent to both p i and p j in PPIN. Obviously, based on above Equation (4), we can obtain a weighted PPI network WPIN = < D PP ,E WPP > easily by taking W PP (p i , p j ) as the weight of the edge between nodes p i and p j in WPIN, where D PP and E WPP denote the sets of nodes and edges in WPIN separately.

Construction of the Weighted DDI Network
In this section, we will first download known domain data from the Pfam database (Peng et al., 2012;Bateman et al., 2014), and for convenience, let D DD = {d 1 , d 2 , . . . , d N D } represent the set of newly downloaded domains, then for any given protein p i ∈ D PP , and domain d j ∈ D DD , it is obvious that we can estimate the relationship between them as follows: Here, |d j | represents the number of different proteins belonging to d j . Furthermore, according to above Equation (5), for any two given domains d i and d j in D DD , we can calculate the relationship between them as follows: Obviously, based on above Equation (6), we can easily construct a weighted DDI network WDIN = < D DD , E DD > as follows: Let E DD denote the set of edges between domains in WDIN, here, for any two given domains d i and d j in D DD , if and only if there is W DD (d i , d j ) > 0, we define that there is an edge between them in WDIN, and at the same time, the weight of the edge between

Construction of the Weighted PDI Network
Based on above Equations (4)-(6), it is obvious that we can construct a new (N P + N D )×(N D + N P ) dimensional matrix M PD as follows: Here, W T PD is a transport matrix of W PD . Based on above matrix M PD , we can easily construct a novel weighted PDI network WPDIN = < D PD , E WPD > as follows: Let D PD = {pd 1 , pd 2 , . . . , pd N P , pd N P +1 , pd N P 2 , . . . , pd N P + N D } = {p 1 , p 2 , . . . , p N P , d 1 , d 2 , . . . , d N D } represent the set of nodes in WPDIN, and E WPD denote the set of edges in WPDIN, then, for any two given nodes pd i and pd j in D PD , if and only if there is there is an edge between them in E WPD , and moreover, the weight of the edge between them is M PD (pd i , pd j ).

Calculation of Initial Scores for Proteins
For any given protein node pi in WPDIN, in this section, we will assign an initial score for it based on the functional features extracted from the subcellular localization information of proteins, and the conservative features provided by orthologous information of proteins. Firstly, we will download the orthologous information of proteins from the InParanoid database (Mewes et al., 2006;Gabriel et al., 2010) and the subcellular localization information of proteins from the COMPART-MENTS database (Binder et al., 2014;Min et al., 2017). And then, for convenience, let Np(i) represent the total number of proteins relating to the i-th subcellular localization, N L denote the total number of different subcellular localizations downloaded above, and S(p i ) represent the set of subcellular locations associating with p i . Hence, we can calculate a score for p i based on the subcellular localization information as follows: Where, Next, let Hom(p i ) denote the score of p i in the downloaded homologous information and N H denote the total number of proteins with homologous information, then, we can calculate another score for p i based on the homologous information as follows: Finally, through integrating above two kinds of scores together, we can obtain an initial score for p i as follows:

Establishment of the Key Target Convergence Sets
Before implementing random walk with restart on WPDIN, as shown in Figure 1, each node in WPDIN will establish a unique KTCS first according to the following steps: Step 1: For any given protein node p i in WPDIN, we define its original KTCS as the set of all domain nodes associating with p i , that is the original KTCS of p i is Similarly, for any given protein domain node d j , we can define its original KTCS as Step 2: For any given protein node p i in WPDIN, ∀d k ∈ KTCS 0 (p i ) and ∀d t ∈ D DD , we define the network distance between d k and d t in WPDI as follows: Similarly, for any given domain node d i in WPDIN, ∀p k ∈ KTCS 0 (d i ) and ∀p t ∈ D PP , we can define the network distance between p k and p t in WPDI as follows: Step 3: According to the above Equations (13, 14), for any given protein node p i or domain node d j in WPDIN, we define the KTCS (d j ) of d j as the set of first 200 protein nodes in WPDIN that have the minimum average network distance to nodes in KTCS 0 (d j ), and the KTCS (p i ) of p i as the set of first 200 domain nodes in WPDIN that have the minimum average network distance to nodes in KTCS 0 p i . Therefore, it easy to know that these 200 protein nodes in KTCS (d j ) may belong to KTCS 0 d j or may not belong to KTCS 0 d j , and these 200 domain nodes in KTCS(p i ) may belong to KTCS 0 p i or may not belong to KTCS 0 p i as well.

Random Walk With Restart in WPDIN
The transition process of a walker from a starting node in the network to other nodes with a given probability is called the method of Random walk. Based on the assumption that there is a correlation between essential proteins and domains, as shown in Figure 2, the random walk process of KTCSPM can be mainly divided into the following steps: Step 1: For a walker, before it starts to walk randomly in WPDIN, we can first obtain a transition probability matrix W for it as follows: Step 2: Moreover, for any given node pd i in WPDIN, we can as well obtain an initial probability vector R i (0) for the walker as follows: Step 3: Next, while starting a walk, the walker will select a node (for convenience, let it be pd 0 ) in WPDIN randomly as its initial location of this walk, where pd 0 may be a protein node or a domain node. Supposing that after walking t-1 hops, the walker reaches the current node pd i in WPDIN, then, we can further calculate a new walking probability vector R i (t) for it as follow: Here, α(0 < α < 1) is a parameter for adjusting weights between R i (0) and R i (t-1). Moreover, for convenience, let R i (t) = (R i,1 (t), R i,2 (t), . . . , R i,j (t), . . . , R i,N P N D (t)) T , where R i,j (t) denotes the walking probability that the walker will walk from its current location pd i to the node pd j at its t-th hop.
Here, it is worth noting that for the starting node pd 0 , since the walker can be considered to reach pd 0 from pd 0 after zero hops, therefore, for the starting node pd 0 , the walker can obtain an initial probability vector R 0 (0), and a walking probability vector R 0 (1).
Step 4: Assuming that the walker has walked from a node pd i to a current node pd j after t-1 hops during its random walk in WPDIN, the walk probability vectors calculated by the walker at pd i and pd j are R i (t-1) and R j (t), respectively, if the L1 norm between R i (t-1) and R j (t) satisfies ||R j (t) − R i (t − 1)|| 1 ≤ 10 −6 , then we define that the walking probability vector R j (t) has reached a stable state at its current location. Moreover, after the walker having obtained a stable walking probability at each node in WPDIN, for convenience, we will define the stable probability obtained by the walker at any given node pd k in WPDIN as R k (∞), and then, we can construct a stable walking probability matrix K (∞) as follows: where, K 1 is a N P × N P dimensional matrix, K 2 is a N P × N D dimensional matrix, K 3 is a N D × N P dimensional matrix, and K 4 is a N D × N D dimensional matrix. Thereafter, it is obvious that K 2 and K 3 will be the final result matrices, which can be adopted to predict potential essential proteins. According to above steps of KTCSPM, it is easy to see that, for any node pd i in WPDIN, a stable walking probability vector R i (∞) = (R i,1 (∞), R i,2 (∞), . . . , R i,j (∞), . . . , R i,N P +N D (∞)) T will be obtained by the walker. For convenience, we denote the node set DPD in WPDIN as the IS. Therefore, we can redefine the stable probability R i (∞) as R IS i (∞). However, through observing R IS i (∞), it is easy to find that the walker will stop its random walking only after the walking probability vector calculated at each node in WPDIN is stable. In the face of Frontiers in Genetics | www.frontiersin.org large data, this mechanism is obviously very time-consuming. Hence, in order to speed up the convergence speed of KTCSPM and reduce the experimental execution time, based on the concept of KTCS defined above, when constructing the vector R i (t) = (R i,1 (t), R i,2 (t), . . . , R i,j (t), . . . , R i,N P +N D (t)) T at the node pd i , if the j-th node pd j ∈ KTCS (pd i ) in WPIND, then R i,j (t) will be remained unchanged, otherwise we will redefine R i,j (t) =0. Thus, the walking probability vector at pd i will be changed to R i KTCS (t) and the stable walking probability at pd i will be changed to R i KTCS (∞). Obviously, the stable state of R i KTCS (∞) can be achieved faster than that of R i IS (∞). However, considering that there may be some nodes not belonging to KTCS (pd i ) but relating to the target, therefore, in order to avoid any omissions, at any given node pd i in WPIND, we will construct a novel final stable walking probability vector R ANS Step 5: For any protein node p i in WPIND, according to the final stable walking probability vector R ANS i (∞) and the initial protein score Initial_Score (p i ) obtained above, it is obvious that a novel final feature score Final_Score (p i ) can be calculated as follows:

Algorithm KTCSPM Input
Original PPI network, original protein-domain network, domain data, subcellular data, orthologous data, and the proportion regulation parameters α.

Output
Proteins final score Final_Score(p i ).

Experimental Data
In this section, extensive experiments will be done to compare KTCSPM with representative methods. And during experiments, the domain data is downloaded from the Pfam database (Bateman et al., 2014). The subcellular location data is derived from the COMPARTMENTS database (Binder et al., 2014), in   which, the following classifications of the subcellular interstitium related to the basic proteins of eukaryotic cells are included: Golgi bodies, endoplasm, cytoplasm, cytoskeleton, vacuoles, endosomes, mitochondria, plasma, peroxomes, and nuclei, etc. Besides, the reference bases of the essential genes of Scerevisiae are collected from MIPS (Mewes et al., 2006), SGD (Cherry et al., 1998), DEG (Zhang and Lin, 2009), and SGDP (Saccharomyces Genome Deletion Project, 2012). In the dataset downloaded from the DIP database, there are 5,093 proteins in total, in which, 1,167 are essential and 3,526 are non-essential. In the dataset downloaded from the GAVIN database, there are a total of 1,855 proteins, in which, 714 are essential proteins.

Comparison Between KTCSPM and Competitive Methods
In order to verify the predictive performance of KTCSPM, in this section, we will compare it with several representative methods such as DC (2001) Figure 3 shows the comparison results between KTCSPM and these competitive methods. From observing Figure 3, it is obvious that the prediction accuracy of KTCSPM is significantly better than that of all these competing methods in from top 1 to 25% predicted essential proteins. In particular, KTCSPM can achieve a reliable prediction accuracy rate of 90.21% in the top 1% ranked key proteins.

Validation With Jackknife Methodology
For a comprehensive and accurate comparison, in this section, we will adopt the Jackknife methodology (Holman et al., 2009) to compare the predictive performances between KTCSPM and above mentioned competing methods. Experimental results are shown in Figure 4, from which, it can be clearly seen that the jackknife curve of KTCSPM is higher than that of all these state-of-the-art predictive methods. Although in Figure 4B, the jackknife curves of KTCSPM and TEGS have multiple intersections, however, when the number of ranked proteins is bigger than 600, the predictive results of KTCSPM will become continuously higher than that of TEGS. Therefore, according to both Figures 4A,B, we can draw a conclusion that KTCSPM can achieve better predictive performance than all these representative methods in predicting essential proteins.

Validation by Precision-Recall Curves and ROC Curves
In this section, ROC curve (receiver operating characteristic) and precision-recall curves (PR) will be adopted to measure the performance of KTCSPM. Researches show that the larger the area under the ROC curve (AUC), the better the model performance, and in addition, when AUC = 0.5, the model  performance will be in a random state. Moreover, when PR curves are adopted to evaluate predictive models, more comprehensive feedbacks on performances of predictive models can be obtained by using different validation methods. And as a result, Figure 5 shows the comparisons of performance between KTCSPM and 12 competitive prediction models under the PR curve and ROC curve separately. From the Figures 5A1,2,B1,2, it can be seen that when KTCSPM is compared with SC, EC, DC, IC, BC, CC, NC, PeC, the area under the PR curve (AUC), and ROC curve display results show that KTCSPM is superior. For these methods, by observing a3 and b3, it can be seen that when KTCSPM is compared with TEGS and POEM methods, the gap becomes smaller and there is overlap, but even so, the prediction performance of KTCSPM is still better than the 12 methods.

Analysis of the Differences Between KTCSPM and Competitive Methods
It can be seen from above descriptions that KTCSPM can achieve satisfactory predictive effects. In this section, we will further analyze the differences between KTCSPM and 12 competing methods by calculating the number of overlaps of first 200 predicted proteins. comparison results are shown in Figure 6, where Mi represents one of these 12 competitive methods, | KTCSPM-Mi| denotes the number of proteins detected by KTCSPM but not by Mi, | Mi-KTCSPM| means the number of proteins detected by Mi but not by KTCSPM. Obviously, according to the curve trends in Figure 6, we can see that the ratio of essential proteins predicted by KTCSPM is much higher than that predicted by anyone of these 12 competing methods, which means that KTCSPM can screen out more essential proteins not found by Mi, and demonstrates that KTCSPM can achieve much better predictive performance as well.

Prediction Performance of KTCSPM Based on the Gavin Database
In this section, in order to further verify the adaptability of KTCSPM, we will further compare it with 11 competitive methods based on the Gavin database, and comparison results are shown in the following Table 1.

Effects of Parameters on Performance of KTCSPM
In this section, we will estimate the effects of parameters on the prediction performance of KTCSPM. First, as for the parameter γ p in Equation (1), we will set its value to one based on precedents (Twan et al., 2011). However, as for the parameter in Equation (17), as illustrated in Table 2, we will set its value from 0.1 to 0.9, and evaluate its impacts on the prediction performance of KTCSPM. Through observing Table 2, it is easy to see that when is set to 0.3, KTCSPM can achieve the best prediction effect. Moreover, it can be clearly seen that KTCSPM remains robust to different values of, which means that KTCSPM is not sensitive to the values of α.

DISCUSSION
It is time consuming and energy consuming to predict essential proteins through traditional biological experiments, so it has become a hot topic in the field of bioinformatics to build mathematical models to predict essential proteins. In this manuscript, a new prediction model called KTCSPM is proposed, in which, a weighted PDI network constructed by integrating a weighted PPI network and a weighted DDI network first, and then, based on the concepts of KCS and IS, a predictive method is further designed to infer potential key proteins in the weighted PDI network based on the random walk with restart.
Finally, extensive experiments have demonstrated the predictive superiority of KTCSPM. At present, some methods have been proposed to infer potential disease related miRNAs such as RWRMDA (Chen et al., 2012), RLSMDA (Chen and Yan, 2014) and RBMMMDA (Chen et al., 2015), in the future, KTCSPM may also be applied to predict potential associations between miRNAs, and diseases.

CONCLUSION
In this manuscript, the main contributions are as follows: (1) A novel weighted PDI network is designed by combining a weighted PPI network with a weighted DDI network.
(2) The concept of network distance is introduced, and the KTCS and the IS are established for nodes in the weighted PDI network.
(3) Based on the concepts of KTCS and IS, an improved random walk with restart algorithm is proposed to recognize essential proteins. By comparing with existing state-of-the-art predictive models, it is proved that KTCSPM can achieve better predictive performance in detecting essential proteins.

DATA AVAILABILITY STATEMENT
The datasets presented in this study can be found in online repositories. The names of the repository/repositories and accession number(s) can be found in the article/Supplementary Material.

AUTHOR CONTRIBUTIONS
JP conceived this research. JP, LW, and LK were improved on the basis of the original design model. JP and ZZ wrote the manuscript. YT and ZC provided advice and supervision on the research. All authors contributed to the article and approved the submitted version.