ORIGINAL RESEARCH article

Front. Genet., 29 July 2021
Sec. Computational Genomics
https://doi.org/10.3389/fgene.2021.721486

A Novel Model for Identifying Essential Proteins Based on Key Target Convergence Sets

Jiaxin Peng1,2, Linai Kuang1*, Zhen Zhang2, Yihong Tan2, Zhiping Chen2 and Lei Wang1,2*
  • 1College of Computer, Xiangtan University, Xiangtan, China
  • 2College of Computer Engineering and Applied Mathematics, Changsha University, Changsha, China

In recent years, many computational models have been designed to detect essential proteins based on protein-protein interaction (PPI) networks. However, due to the incompleteness of PPI networks, the prediction accuracy of these models is still not satisfactory. In this manuscript, a novel key target convergence sets based prediction model (KTCSPM) is proposed to identify essential proteins. In KTCSPM, a weighted PPI network and a weighted (Domain-Domain Interaction) network are constructed first based on known PPIs and PDIs downloaded from benchmark databases. And then, by integrating these two kinds of networks, a novel weighted PDI network is built. Next, through assigning a unique key target convergence set (KTCS) for each node in the weighted PDI network, an improved method based on the random walk with restart is designed to identify essential proteins. Finally, in order to evaluate the predictive effects of KTCSPM, it is compared with 12 competitive state-of-the-art models, and experimental results show that KTCSPM can achieve better prediction accuracy. Considering the satisfactory predictive performance achieved by KTCSPM, it indicates that KTCSPM might be a good supplement to the future research on prediction of essential proteins.

Introduction

With the deepening of researches on proteins, accumulating evidences have demonstrated that proteins are closely related to most of the life activities. Moreover, different proteins are of different importance to different life activities. Among these proteins, essential proteins, as a kind of important proteins, are essential for the survival, and development of life. Therefore, in recent years, detection and recognition of essential proteins has become a hot issue in the research and development of disease treatment. However, it is very time-consuming and expensive to identify essential proteins by traditional biological experiments, which leads to the emergence and development of computational prediction methods. For instance, Zhao et al. (2019) designed a new random walk wandering based prediction model to detect key proteins based on a heterogeneous network consisting of proteins and protein domains. Jeong et al. (2001) found that PPI networks are scale-free and proposed a center-lethal rule for PPI networks. Based on which, lots of methods including the information centrality (IC) (Stephenson and Zelen, 1989), betweenness centrality (BC) (Joy et al., 2014), degree centrality (DC) (Hahn and Kern, 2004), Closeness Centrality (CC) (Wuchty and Stadler, 2003), subgraph centrality (SC) (Estrada and Rodriguez-Velazquez, 2005), and neighbor centrality (NC) (Wang et al., 2012) had been put forward successively. In addition, Yu et al. (2007) proposed the importance of network bottlenecks. Estrada (2006) found that a small number of binary proteins were mostly essential proteins. Chua et al. (2006) proposed to identify essential proteins by calculating the weights of indirect neighbor nodes. Li et al. (2012) designed a method to predict essential proteins by combining PPI networks with gene expression data of proteins. Li et al. (2011) find essential proteins by analyzing the relationship between proteins and their neighbors, and define the method as LAC. Peng et al. (2012) combined orthology information of proteins with PPI networks to predict key proteins. Zhao et al. (2014) found that combination of gene expression profiles and PPI networks was of great help to the prediction accuracy of essential proteins. Min et al. (2017) discovered that the complex information of proteins can improve prediction accuracy and precision of potential essential proteins. Zhao et al. (2016) proposed a basic protein identification method based on protein gene time expressions and protein domains. Zhang et al. (2019) proposed a new protein prediction method called TEGS, which can identify essential proteins by fusing the introduced multiple biological information data. Lei et al. (2018) found that it can achieve good results to adopt artificial fish swarm optimization algorithm into key protein prediction. Peng et al. (2015) discovered that combination of protein domain features and protein interaction networks can effectively predict potential essential proteins. Li et al. (2019) proposed a target convergence set (TCS) based method for predicting potential lncRNA-disease associations. Athira and Gopakumar (2020) proposed a multiplex network to identifying essential proteins. Zhang et al. (2018) designed a novel method by combining network topology, gene expression profile and GO information to identifying essential proteins. Fan et al. (2017) proposed a modified PageRank algorithm based on subcellular information. Meng et al. (2021) predict the essential protein by constructing a new weighted protein and protein domain network, and performing a local random walk on this basis. Xenarios et al. (2002) introduced a public database called DIP for studying cellular networks of protein interactions. Gavin et al. (2006) provided a complete and comprehensive eukaryotic machine and biological data integration and modeling platform.

Inspired by above methods, in this manuscript, a computational model named KTCSPM was proposed to predict essential proteins. In KTCSPM, a weighted PDI network was first constructed by integrating a weighted PPI network and a weighted domain-domain interaction (DDI) network. And then, each node in the weighted PDI network would be assigned a unique key target convergence sets (KTCS) according to the network distance information of the weighted PDI network, which could reflect the specificity of different nodes in the process of random walk with restart and improve the predictive performance of KTCSPM. Next, for an arbitrarily selected walker, considering that there may still be some nodes that are essential proteins but not included in KTCS while KTCS reached the final convergence state, each node in the heterogeneous network would be further assigned a unique Intact Set (IS) to ensure that the predicted results would not be omitted as far as possible. Next, we will construct a random walk probability matrix and calculate the stable walk probability of all nodes, and then rank each protein based on the initial protein score vector. Finally, in order to evaluate the predictive performance of KTCSPM, we compared it with 12 advanced predictive methods based on two kinds of yeast PPI networks, and experimental results showed that KTCSPM can achieve reliable predictive accuracy of 90.19, 81.96, 70.72, 62.04, 55.83, and 51.13% in top 1, top 5, top 10, top 15, top 20, and top 25% of predicted key proteins separately, which are better than all these 12 competing predictive models.

Materials and Methods

Construction of the Weighted PPI Network

In this section, we will download known PPI data from two different public databases such as the DIP database (Xenarios et al., 2002) and the Gavin database (Gavin et al., 2006), respectively. Obviously, based on these known PPI network downloaded from any given public database, an original PPI network PPIN = < DPP, EPP > can be constructed as follows: Let DPP = {p1, p2, …, pNP} represent the set of newly downloaded proteins and EPP denote the set of edges between proteins in PPIN, here, for any two given proteins pi and pj in DPP, if and only if there is a known interaction between them, then we define that there is an edge between them in PPIN. Thereafter, based on the newly constructed original PPI network PPIN, we can further obtain an NP × NP dimensional adjacency matrix MPPIN as follows: for any two given protein nodes pi and pj in PPIN, if and only if there is an edge between them in PPIN, there is MPPIN(pi, pj) = 1, otherwise there is MPPIN(pi, pj) = 0.

In previous studies, the Gaussian interaction profile kernel similarity has been widely used to measure the similarity between similar nodes (Chen et al., 2016). In this section, for any two given proteins pi and pj in MPPIN, we define the Gaussian interaction profile kernel similarity between them as follows:

GKS(i,j)=exp(-γp||IP(pi)-IP(pj)||2)(1)
γp=γp/kNP=1||IP(pk)||2(2)

Here, IP(pt) represents the vector of elements in the t-th row of the matrix MPPIN, and γp denotes the normalized kernel bandwidth based on the bandwidth parameter γp. In addition, according to the methodology proposed by Vanunu et al. (2010), we will further optimize above Gaussian interaction profile kernel similarity of protein by introducing a logistics function as follows:

LGKS(pi,pj)=11+e(-12GKS(i,j)+log9999)(3)

This logistic function can make the calculated results of Gaussian interaction porofile kernel similarity more influential in the identification of essential proteins. Additionally, considering that while analyzing the topology structure of PPI network, the PPI network can be weighted to show the interaction between proteins, therefore, based on above newly obtained matrixLGKS, for any two given proteins pi and pj, we can weigh the relationship between them as follows:

WPP(pi,pj)=LGKS(pi,pj)+|N(pi)N(pj)|2(|N(pi)+1|)*(|N(pj)+1|)2(4)

Here, N(pi)and N(pj) represent the sets of protein nodes directly adjacent to pi and pj in PPIN, respectively, and N(pi)∩N(pj) denotes the set of protein nodes adjacent to both pi and pj in PPIN. Obviously, based on above Equation (4), we can obtain a weighted PPI network WPIN = < DPP,EWPP> easily by taking WPP(pi, pj) as the weight of the edge between nodes pi and pj in WPIN, where DPP and EWPP denote the sets of nodes and edges in WPIN separately.

Construction of the Weighted DDI Network

In this section, we will first download known domain data from the Pfam database (Peng et al., 2012; Bateman et al., 2014), and for convenience, let DDD = {d1, d2, …, dND} represent the set of newly downloaded domains, then for any given protein piDPP, and domain djDDD, it is obvious that we can estimate the relationship between them as follows:

WPD(pi,dj)=pkdjWPP(pi,pk)|dj|(5)

Here, |dj| represents the number of different proteins belonging to dj. Furthermore, according to above Equation (5), for any two given domains di and dj in DDD, we can calculate the relationship between them as follows:

WDD(di,dj)=pxdiWPD(px,dj)+pydjWPD(py,di)|di|+|dj|(6)

Obviously, based on above Equation (6), we can easily construct a weighted DDI network WDIN = < DDD, EDD > as follows: Let EDD denote the set of edges between domains in WDIN, here, for any two given domains di and dj in DDD, if and only if there is WDD(di,dj) > 0, we define that there is an edge between them in WDIN, and at the same time, the weight of the edge between di and dj is WDD(di, dj).

Construction of the Weighted PDI Network

Based on above Equations (4)–(6), it is obvious that we can construct a new (NP + ND) × (ND + NP) dimensional matrix MPD as follows:

MPD=[WPPWPDWPDTWDD](7)

Here, WPDT is a transport matrix of WPD. Based on above matrix MPD, we can easily construct a novel weighted PDI network WPDIN = < DPD, EWPD > as follows: Let DPD={pd1,pd,2,pdNP,pdNP+1,pdNP2,,pdNP+ND} = {p1, p2, …, pNP, d1, d2, …, dND} represent the set of nodes in WPDIN, and EWPD denote the set of edges in WPDIN, then, for any two given nodes pdi and pdj in DPD, if and only if there is WPP(pdi, pdj) > 0 or WPD(pdi, pdj) > 0 or WDD(pdi, pdj) > 0, there is an edge between them in EWPD, and moreover, the weight of the edge between them is MPD(pdi, pdj).

Calculation of Initial Scores for Proteins

For any given protein node pi in WPDIN, in this section, we will assign an initial score for it based on the functional features extracted from the subcellular localization information of proteins, and the conservative features provided by orthologous information of proteins. Firstly, we will download the orthologous information of proteins from the InParanoid database (Mewes et al., 2006; Gabriel et al., 2010) and the subcellular localization information of proteins from the COMPART-MENTS database (Binder et al., 2014; Min et al., 2017). And then, for convenience, let Np(i) represent the total number of proteins relating to the i-th subcellular localization, NL denote the total number of different subcellular localizations downloaded above, and S(pi) represent the set of subcellular locations associating with pi. Hence, we can calculate a score for pi based on the subcellular localization information as follows:

Subcell_Score(pi)=maxjS(pi)Subcell(j)(8)

Where,

Subcell(j)=Np(j)max1kNL(Np(k))(9)

Next, let Hom(pi) denote the score of pi in the downloaded homologous information and NH denote the total number of proteins with homologous information, then, we can calculate another score for pi based on the homologous information as follows:

Hom_Score(pi)=Hom(pi)max1jNHHom(pi)(10)

Finally, through integrating above two kinds of scores together, we can obtain an initial score for pi as follows:

Initial_Score(pi)=Subcell_Score(pi)+Hom_Score(pi)2(11)

Construction of the Prediction Model KTCSPM

Establishment of the Key Target Convergence Sets

Before implementing random walk with restart on WPDIN, as shown in Figure 1, each node in WPDIN will establish a unique KTCS first according to the following steps:

FIGURE 1
www.frontiersin.org

Figure 1. Flow chart of constructing KTCS for any given protein node pi in WPDIN.

Step 1: For any given protein node pi in WPDIN, we define its original KTCS as the set of all domain nodes associating with pi, that is the original KTCS of pi is KTCS0(pi) = {dk |MPD(dk, pi) = 1, dkDDD}. Similarly, for any given protein domain node dj, we can define its original KTCS as KTCS0(dj) = {pk|MPD(dj, pk) = 1, pkDPP}.

Step 2: For any given protein node pi in WPDIN, ∀dkKTCS0(pi) anddtDDD, we define the network distance between dk and dt in WPDI as follows:

AD(dk,dt)=1WDD(dk,dt)(12)

Similarly, for any given domain node di in WPDIN, ∀pkKTCS0(di) andptDPP, we can define the network distance between pk and pt in WPDI as follows:

AD(pk,pt)=1WPP(pk,pt)(13)

Step 3: According to the above Equations (13, 14), for any given protein node pi or domain node dj in WPDIN, we define the KTCS (dj) of dj as the set of first 200 protein nodes in WPDIN that have the minimum average network distance to nodes in KTCS0(dj), and the KTCS (pi) of pi as the set of first 200 domain nodes in WPDIN that have the minimum average network distance to nodes in KTCS0(pi). Therefore, it easy to know that these 200 protein nodes in KTCS (dj) may belong to KTCS0(dj) or may not belong to KTCS0(dj), and these 200 domain nodes in KTCS(pi) may belong to KTCS0(pi) or may not belong to KTCS0(pi) as well.

Random Walk With Restart in WPDIN

The transition process of a walker from a starting node in the network to other nodes with a given probability is called the method of Random walk. Based on the assumption that there is a correlation between essential proteins and domains, as shown in Figure 2, the random walk process of KTCSPM can be mainly divided into the following steps:

FIGURE 2
www.frontiersin.org

Figure 2. Schematic diagram of the construction process of KTCSPM, where green nodes represent proteins and red nodes represent domains.

Step 1: For a walker, before it starts to walk randomly in WPDIN, we can first obtain a transition probability matrix W for it as follows:

W(i,j)=MPD(i,j)k=1NP+NDMPD(i,k)(14)

Step 2: Moreover, for any given node pdi in WPDIN, we can as well obtain an initial probability vector Ri(0) for the walker as follows:

Ri(0)=(Ri,1(0),Ri,2(0),Ri,j(0),Ri,NP+ND(0))(15)
Ri,j(0)=W(i,j),j=1,2,3,NP+ND(16)

Step 3: Next, while starting a walk, the walker will select a node (for convenience, let it be pd0) in WPDIN randomly as its initial location of this walk, where pd0 may be a protein node or a domain node. Supposing that after walking t-1 hops, the walker reaches the current node pdi in WPDIN, then, we can further calculate a new walking probability vector Ri(t) for it as follow:

Ri(t)=(1-α)*WT*Ri(t-1)+α*Ri(0)(17)

Here, α(0 < α < 1) is a parameter for adjusting weights between Ri(0) and Ri(t-1). Moreover, for convenience, let Ri(t) = (Ri,1(t), Ri,2(t), …, Ri,j(t), …, Ri,NPND(t))T, where Ri,j(t) denotes the walking probability that the walker will walk from its current location pdi to the node pdj at its t-th hop. Here, it is worth noting that for the starting node pd0, since the walker can be considered to reach pd0 from pd0 after zero hops, therefore, for the starting node pd0, the walker can obtain an initial probability vector R0(0), and a walking probability vector R0(1).

Step 4: Assuming that the walker has walked from a node pdi to a current node pdj after t-1 hops during its random walk in WPDIN, the walk probability vectors calculated by the walker at pdi and pdj are Ri(t-1) and Rj(t), respectively, if the L1 norm between Ri(t-1) and Rj(t) satisfies ||Rj(t)−Ri(t−1)||1 ≤ 10−6, then we define that the walking probability vector Rj(t) has reached a stable state at its current location. Moreover, after the walker having obtained a stable walking probability at each node in WPDIN, for convenience, we will define the stable probability obtained by the walker at any given node pdk in WPDIN as Rk(∞), and then, we can construct a stable walking probability matrix K (∞) as follows:

K()=[K1K2K3K4]=(R1(),R2(),R3().....,RNPND())T(18)

where, K1 is a NP × NP dimensional matrix, K2 is a NP × ND dimensional matrix, K3 is a ND × NP dimensional matrix, and K4 is a ND × ND dimensional matrix. Thereafter, it is obvious that K2 and K3 will be the final result matrices, which can be adopted to predict potential essential proteins.

According to above steps of KTCSPM, it is easy to see that, for any node pdi in WPDIN, a stable walking probability vector Ri(∞) = (Ri,1(∞),Ri,2(∞), …, Ri,j(∞), …, Ri,NP + ND(∞))T will be obtained by the walker. For convenience, we denote the node set DPD in WPDIN as the IS. Therefore, we can redefine the stable probability Ri(∞) as RISi(∞). However, through observing RISi(∞), it is easy to find that the walker will stop its random walking only after the walking probability vector calculated at each node in WPDIN is stable. In the face of large data, this mechanism is obviously very time-consuming. Hence, in order to speed up the convergence speed of KTCSPM and reduce the experimental execution time, based on the concept of KTCS defined above, when constructing the vector Ri(t) = (Ri,1(t),Ri,2(t), …, Ri,j(t), …, Ri,NP + ND(t))T at the node pdi, if the j-th node pdjKTCS (pdi) in WPIND, then Ri,j(t) will be remained unchanged, otherwise we will redefine Ri,j(t) = 0. Thus, the walking probability vector at pdi will be changed to RiKTCS(t) and the stable walking probability at pdi will be changed to RiKTCS(). Obviously, the stable state of RiKTCS() can be achieved faster than that of RiIS(). However, considering that there may be some nodes not belonging to KTCS (pdi) but relating to the target, therefore, in order to avoid any omissions, at any given node pdi in WPIND, we will construct a novel final stable walking probability vector RiANS()=(Ri,1ANS(),Ri,2ANS(),,Ri,jANS(),,Ri,NP+NDANS())T by combining RiIS() with RiKTCS() as follows:

RiANS()=RiIS()+RiKTCS()2(19)

Step 5: For any protein node pi in WPIND, according to the final stable walking probability vector RiANS() and the initial protein score Initial_Score (pi) obtained above, it is obvious that a novel final feature score Final_Score (pi) can be calculated as follows:

Final_Score(pi)=j=1NP+NDRi,jANS()+Initial_Score(pi)2(20)

Algorithm KTCSPM

Input

Original PPI network, original protein-domain network, domain data, subcellular data, orthologous data, and the proportion regulation parameters α.

Output

Proteins final score Final_Score(pi).

Step 1: Establishing the heterogeneous network according to formulas (1–7);

Step 2: Calculating proteins initial score by orthologous data and subcellular data according to formulas (8–11);

Step 3: Establishing the KTCS according to formulas (12, 13);

Step 4: Establishing the transition probability matrix W according to formula (14);

Step 5: Calculating a stable walking probability vector Ri(t) according to formulas (15–17);

Step 6: Establishing stable walking probability matrix K (∞) according to formula (18); and

Step 7: Outputting the final score of protein according to formula (19);

Results

Experimental Data

In this section, extensive experiments will be done to compare KTCSPM with representative methods. And during experiments, the domain data is downloaded from the Pfam database (Bateman et al., 2014). The subcellular location data is derived from the COMPARTMENTS database (Binder et al., 2014), in which, the following classifications of the subcellular interstitium related to the basic proteins of eukaryotic cells are included: Golgi bodies, endoplasm, cytoplasm, cytoskeleton, vacuoles, endosomes, mitochondria, plasma, peroxomes, and nuclei, etc. Besides, the reference bases of the essential genes of Scerevisiae are collected from MIPS (Mewes et al., 2006), SGD (Cherry et al., 1998), DEG (Zhang and Lin, 2009), and SGDP (Saccharomyces Genome Deletion Project, 2012). In the dataset downloaded from the DIP database, there are 5,093 proteins in total, in which, 1,167 are essential and 3,526 are non-essential. In the dataset downloaded from the GAVIN database, there are a total of 1,855 proteins, in which, 714 are essential proteins.

Comparison Between KTCSPM and Competitive Methods

In order to verify the predictive performance of KTCSPM, in this section, we will compare it with several representative methods such as DC (2001), IC (1989), And So-called Centrality (CC) (2014), Bee-tweenness Centrality (BC) (2005), SC (2003), NC (2005), PeC (2012),LAC (2011), CoEWC (2014), POEM (2017), and TEGS (2019) based on the DIP database and the Gavin database separately. Figure 3 shows the comparison results between KTCSPM and these competitive methods. From observing Figure 3, it is obvious that the prediction accuracy of KTCSPM is significantly better than that of all these competing methods in from top 1 to 25% predicted essential proteins. In particular, KTCSPM can achieve a reliable prediction accuracy rate of 90.21% in the top 1% ranked key proteins.

FIGURE 3
www.frontiersin.org

Figure 3. (A) Top 1% ranked proteins. (B) Top 5% ranked proteins. (C) Top 10% ranked proteins. (D) Top 15% ranked proteins. (E) Top 20% ranked proteins. (F) Top 25% ranked proteins. In this Figure shows the predictive accuracy between KTCSPM and 12 competitive methods including IC, CC, BC, SC, NC, LAC, EC, PeC, CoEWC, POEM, and TEGS.

Validation With Jackknife Methodology

For a comprehensive and accurate comparison, in this section, we will adopt the Jackknife methodology (Holman et al., 2009) to compare the predictive performances between KTCSPM and above mentioned competing methods. Experimental results are shown in Figure 4, from which, it can be clearly seen that the jackknife curve of KTCSPM is higher than that of all these state-of-the-art predictive methods. Although in Figure 4B, the jackknife curves of KTCSPM and TEGS have multiple intersections, however, when the number of ranked proteins is bigger than 600, the predictive results of KTCSPM will become continuously higher than that of TEGS. Therefore, according to both Figures 4A,B, we can draw a conclusion that KTCSPM can achieve better predictive performance than all these representative methods in predicting essential proteins.

FIGURE 4
www.frontiersin.org

Figure 4. The comparison results between key target convergence sets based prediction model (KTCSPM) and 12 competitive methods based on the DIP database under the jackknife methodology. (A) Comparison results between KTCSPM and DC, IC, EC, SC, BC, and CC. (B) Comparison results between KTCSPM and NC, PeC, CoWEC, POEM, LAC, and TEGS.

Validation by Precision-Recall Curves and ROC Curves

In this section, ROC curve (receiver operating characteristic) and precision-recall curves (PR) will be adopted to measure the performance of KTCSPM. Researches show that the larger the area under the ROC curve (AUC), the better the model performance, and in addition, when AUC = 0.5, the model performance will be in a random state. Moreover, when PR curves are adopted to evaluate predictive models, more comprehensive feedbacks on performances of predictive models can be obtained by using different validation methods. And as a result, Figure 5 shows the comparisons of performance between KTCSPM and 12 competitive prediction models under the PR curve and ROC curve separately. From the Figures 5A1,2,B1,2, it can be seen that when KTCSPM is compared with SC, EC, DC, IC, BC, CC, NC, PeC, the area under the PR curve (AUC), and ROC curve display results show that KTCSPM is superior. For these methods, by observing a3 and b3, it can be seen that when KTCSPM is compared with TEGS and POEM methods, the gap becomes smaller and there is overlap, but even so, the prediction performance of KTCSPM is still better than the 12 methods.

FIGURE 5
www.frontiersin.org

Figure 5. Comparisons of performance between key target convergence sets based prediction model (KTCSPM) and 12 competitive models under the PR curve and ROC curve based on the DIP database. Panels (A1–C1) are comparison results of PR curves between KTCSPM and 12 competitive models. Panels (A2–C2) are comparison results of ROC curves between KTCSPM and 12 competitive models.

Analysis of the Differences Between KTCSPM and Competitive Methods

It can be seen from above descriptions that KTCSPM can achieve satisfactory predictive effects. In this section, we will further analyze the differences between KTCSPM and 12 competing methods by calculating the number of overlaps of first 200 predicted proteins. comparison results are shown in Figure 6, where Mi represents one of these 12 competitive methods, | KTCSPM-Mi| denotes the number of proteins detected by KTCSPM but not by Mi, | Mi-KTCSPM| means the number of proteins detected by Mi but not by KTCSPM. Obviously, according to the curve trends in Figure 6, we can see that the ratio of essential proteins predicted by KTCSPM is much higher than that predicted by anyone of these 12 competing methods, which means that KTCSPM can screen out more essential proteins not found by Mi, and demonstrates that KTCSPM can achieve much better predictive performance as well.

FIGURE 6
www.frontiersin.org

Figure 6. Comparison results between KTCSPM and 12 competing methods, where the X axis denotes the competing methods including DC, IC, EC, SC, BC, CC, NC, PEC, COEWC, POEM, LAC, TEGS, and the Y axis represents the proportion of true essential proteins predicted by each method.

Prediction Performance of KTCSPM Based on the Gavin Database

In this section, in order to further verify the adaptability of KTCSPM, we will further compare it with 11 competitive methods based on the Gavin database, and comparison results are shown in the following Table 1.

TABLE 1
www.frontiersin.org

Table 1. Comparison results between KTCSPM and 11 state-of-the-art methods including DC, IC, CC, BC, NC, EC, PeC, CoEWC, ION, and POEM based on the Gavin database, where the Gavin database consists of 1,855 essential proteins.

Effects of Parameters on Performance of KTCSPM

In this section, we will estimate the effects of parameters on the prediction performance of KTCSPM. First, as for the parameter γp in Equation (1), we will set its value to one based on precedents (Twan et al., 2011). However, as for the parameter in Equation (17), as illustrated in Table 2, we will set its value from 0.1 to 0.9, and evaluate its impacts on the prediction performance of KTCSPM. Through observing Table 2, it is easy to see that when is set to 0.3, KTCSPM can achieve the best prediction effect. Moreover, it can be clearly seen that KTCSPM remains robust to different values of, which means that KTCSPM is not sensitive to the values of α.

TABLE 2
www.frontiersin.org

Table 2. Influence of the parameter α on prediction accuracy of KTCSPM based on the DIP database.

Discussion

It is time consuming and energy consuming to predict essential proteins through traditional biological experiments, so it has become a hot topic in the field of bioinformatics to build mathematical models to predict essential proteins. In this manuscript, a new prediction model called KTCSPM is proposed, in which, a weighted PDI network constructed by integrating a weighted PPI network and a weighted DDI network first, and then, based on the concepts of KCS and IS, a predictive method is further designed to infer potential key proteins in the weighted PDI network based on the random walk with restart. Finally, extensive experiments have demonstrated the predictive superiority of KTCSPM. At present, some methods have been proposed to infer potential disease related miRNAs such as RWRMDA (Chen et al., 2012), RLSMDA (Chen and Yan, 2014) and RBMMMDA (Chen et al., 2015), in the future, KTCSPM may also be applied to predict potential associations between miRNAs, and diseases.

Conclusion

In this manuscript, the main contributions are as follows: (1) A novel weighted PDI network is designed by combining a weighted PPI network with a weighted DDI network. (2) The concept of network distance is introduced, and the KTCS and the IS are established for nodes in the weighted PDI network. (3) Based on the concepts of KTCS and IS, an improved random walk with restart algorithm is proposed to recognize essential proteins. By comparing with existing state-of-the-art predictive models, it is proved that KTCSPM can achieve better predictive performance in detecting essential proteins.

Data Availability Statement

The datasets presented in this study can be found in online repositories. The names of the repository/repositories and accession number(s) can be found in the article/Supplementary Material.

Author Contributions

JP conceived this research. JP, LW, and LK were improved on the basis of the original design model. JP and ZZ wrote the manuscript. YT and ZC provided advice and supervision on the research. All authors contributed to the article and approved the submitted version.

Funding

This research was supported by the National Natural Science Foundation of China (No. 61873221), the Research Foundation of Education Bureau of Hunan Province (No. 20B080), and the Natural Science Foundation of Hunan Province (Nos. 2018JJ4058 and 2019JJ70010).

Conflict of Interest

The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Publisher’s Note

All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article, or claim that may be made by its manufacturer, is not guaranteed or endorsed by the publisher.

Supplementary Material

The Supplementary Material for this article can be found online at: https://www.frontiersin.org/articles/10.3389/fgene.2021.721486/full#supplementary-material

References

Athira, K., and Gopakumar, G. (2020). An integrated method for identifying essential proteins from multiplex network model of protein-protein interactions. J. Bioinform. Comput. Biol. 18:2050020. doi: 10.1142/S0219720020500201

PubMed Abstract | CrossRef Full Text | Google Scholar

Bateman, A., Coin, L., Durbin, R., Finn, R. D., Hollich, V., Griffithsjones, S., et al. (2014). The pfam protein families database nucleic acids res. Nucleic Acids Res. 32, D138–D141. doi: 10.1093/nar/gkh121

PubMed Abstract | CrossRef Full Text | Google Scholar

Binder, J. X., Sune, P. F., Kalliopi, T., Christian, S., O’DonoghueSeán, I., Reinhard, S., et al. (2014). Compartments: unifification and visualization of protein subcellular localization evidence. Database J. Biol. Datab. Curat. 2014:bau012. doi: 10.1093/database/bau012

PubMed Abstract | CrossRef Full Text | Google Scholar

Chen, X., and Yan, G. Y. (2014). Semi-supervised learning for potential human microRNA-disease associations inference. Sci. Rep. 4:5501. doi: 10.1038/srep05501

PubMed Abstract | CrossRef Full Text | Google Scholar

Chen, X., Huang, Y. A., You, Z. H., Yan, G. Y., and Wang, X. S. (2016). A novel approach based on KATZ measure to predict associations of human microbiota with non-infectious diseases. Bioinformatics 33, 733–739. doi: 10.1093/bioinformatics/btw715

PubMed Abstract | CrossRef Full Text | Google Scholar

Chen, X., Liu, M. X., and Yan, G. Y. (2012). RWRMDA: predicting novel human microRNA–disease associations. Mol. Biosyst. 8, 2792–2798. doi: 10.1039/c2mb25180a

PubMed Abstract | CrossRef Full Text | Google Scholar

Chen, X., Yan, C. C., Zhang, X., Li, Z., Deng, L., Zhang, Y., et al. (2015). RBMMMDA: predicting multiple types of disease-microRNA associations. Sci. Rep. 5:13877. doi: 10.1038/srep13877

PubMed Abstract | CrossRef Full Text | Google Scholar

Cherry, J. M., Adler, C., Ball, C., Chervitz, S. A., Dwight, S. S., and Hester, E. T. (1998). SGD: saccharomyces genome database. Nucleic Acids Res. 26, 73–79. doi: 10.1093/nar/26.1.73

PubMed Abstract | CrossRef Full Text | Google Scholar

Chua, H. N., Sung, W. K., and Wong, L. (2006). Exploiting indirect neighbours and topological weight to predict protein function from protein-protein interactions. Bioinformatics 24:452. doi: 10.1093/bioinformatics/btm609

CrossRef Full Text | Google Scholar

Estrada, E. (2006). Protein bipartivity and essentiality in the yeast protein-protein interaction network. J. Proteome. Res. 5:2177–2184. doi: 10.1021/pr060106e

PubMed Abstract | CrossRef Full Text | Google Scholar

Estrada, E., and Rodriguez-Velazquez, J. A. (2005). Subgraph centrality in complex networks. Phys. Rev. E Statist. Nonlin. Soft Mat. Phys. 71:056103. doi: 10.1103/PhysRevE.71.056103

PubMed Abstract | CrossRef Full Text | Google Scholar

Fan, Y., Tang, X., Hu, X., Wu, W., and Ping, Q. (2017). Prediction of essential proteins based on subcellular localization and gene expression correlation. Bmc Bioinform. 18:470.

Google Scholar

Gabriel, O., Thomas, S., Kristoffffer, F., Tina, K., David, N. M., Sanjit, R., et al. (2010). In Paranoid 7: new algorithms and tools for eukaryotic orthology analysis. Nucleic Acids Res. 38, D196–D203. doi: 10.1093/nar/gkp931

PubMed Abstract | CrossRef Full Text | Google Scholar

Gavin, A. C., Aloy, P., Grandi, P., Krause, R., Boesche, M., Marzioch, M., et al. (2006). Proteome survey reveals modularity of the yeast cell machinery. Nature 440, 631–636. doi: 10.1038/nature04532

PubMed Abstract | CrossRef Full Text | Google Scholar

Hahn, M. W., and Kern, A. D. (2004). Comparative genomics of centrality and essentiality in three eukaryotic protein-interaction networks. Mol. Biol. Evol. 22, 803–806. doi: 10.1093/molbev/msi072

PubMed Abstract | CrossRef Full Text | Google Scholar

Holman, A. G., Davis, P. J., Foster, J. M., Carlow, C. K., and Kumar, S. (2009). Computational prediction of essential genes in an unculturable endosymbiotic bacterium, Wolbachia of Brugia malayi. BMC Microbiol. 9:243. doi: 10.1186/1471-2180-9-243

PubMed Abstract | CrossRef Full Text | Google Scholar

Jeong, H., Mason, S., and Barabási, A. (2001). Lethality and centrality in protein networks. Nature 411, 41–42. doi: 10.1038/35075138

PubMed Abstract | CrossRef Full Text | Google Scholar

Joy, M. P., Brock, A., Ingber, D. E., and Huang, S. (2014). High-betweenness proteins in the yeast protein interaction network. J. Biomed. Biotechnol. 2005:96. doi: 10.1155/JBB.2005.96

PubMed Abstract | CrossRef Full Text | Google Scholar

Lei, X., Yang, X., and Wu, F. (2018). Artificial fish swarm optimization based method to identify essential proteins. IEEE ACM Transact. Comput. Biol. Bioinform. 18:1. doi: 10.1109/TCBB.2018.2865567

PubMed Abstract | CrossRef Full Text | Google Scholar

Li, J., Li, X., Feng, X., Wang, B., Zhao, B., and Wang, L. (2019). A novel target convergence set based random walk with restart for prediction of potential LncRNA-disease associations. BMC Bioinform. 20:3216. doi: 10.1186/s12859-019-3216-4

PubMed Abstract | CrossRef Full Text | Google Scholar

Li, M., Wang, J., Chen, X., Wang, H., and Pan, Y. (2011). A local average connectivity-based method for identifying essential proteins from the network level. Comput. Biol. Chem. 35, 143–150. doi: 10.1016/j.compbiolchem.2011.04.002

PubMed Abstract | CrossRef Full Text | Google Scholar

Li, M., Zhang, H., Wang, J. X., and Pan, Y. (2012). A new essential protein discovery method based on the integration of protein-protein interaction and gene expression data. Bmc Syst. Biol. 6:509. doi: 10.1186/1752-0509-6-15

PubMed Abstract | CrossRef Full Text | Google Scholar

Meng, Z., Kuang, L., Chen, Z., Zhang, Z., Tan, Y., Li, X., et al. (2021). Method for essential protein prediction based on a novel weighted protein-domain interaction network. Front. Genet. 12:645932.

Google Scholar

Mewes, H. W., Frishman, D., Mayer, K. F. X., Munsterkotter, M., Noubibou, O., Pagel, P., et al. (2006). MIPS: analysis and annotation of proteins from whole genomes in 2005. Nucleic Acids Res. 34, D169–D172. doi: 10.1093/nar/gkj148

PubMed Abstract | CrossRef Full Text | Google Scholar

Min, L., Yu, L., Niu, Z., and Wu, F. X. (2017). United complex centrality for identification of essential proteins from PPI networks. IEEE ACM Transact. Comput. Biol. Bioinform. (TCBB) 14, 370–380. doi: 10.1109/TCBB.2015.2394487

PubMed Abstract | CrossRef Full Text | Google Scholar

Peng, W., Wang, J., Wang, W., Liu, Q., Wu, F. X., and Pan, Y. (2012). Iteration method for predicting essential proteins based on orthology and protein-protein interaction networks. Bmc Syst. Biol. 6:87. doi: 10.1186/1752-0509-6-87

PubMed Abstract | CrossRef Full Text | Google Scholar

Peng, W., Wang, J., Cheng, Y., Lu, Y., Wu, F., and Pan, Y. (2015). UDoNC: an algorithm for identifying essential proteins based on protein domains and protein-protein interaction networks. Comput. Biol. Bioinform. 12, 276–288. doi: 10.1109/TCBB.2014.2338317

PubMed Abstract | CrossRef Full Text | Google Scholar

Ren, Z., and Lin, Y. (2009). DEG 5.0, a database of essential genes in both prokaryotes and eukaryotes. Nucleic Acids Res. 37, D455–D458. doi: 10.1093/nar/gkn858

PubMed Abstract | CrossRef Full Text | Google Scholar

Saccharomyces Genome Deletion Project (2012). Available online at: http://yeastdeletion.stanford.edu/ (accessed June 20, 2012).

Google Scholar

Stephenson, K., and Zelen, M. (1989). Rethinking centrality: methods and examples. Soc. Netw. 11, 1–37. doi: 10.1016/0378-8733(89)90016-6

CrossRef Full Text | Google Scholar

Twan, V. L., Nabuurs, S. B., and Elena, M. (2011). Gaussian interaction profile kernels for predicting drug–target interaction. Bioinformatics 21:3036. doi: 10.1093/bioinformatics/btr500

PubMed Abstract | CrossRef Full Text | Google Scholar

Vanunu, O., Magger, O., Ruppin, E., Shlomi, T., and Sharan, R. (2010). Associating genes and protein complexes with disease via network propagation. PLoS Comput. Biol. 6:e1000641. doi: 10.1371/journal.pcbi.1000641

PubMed Abstract | CrossRef Full Text | Google Scholar

Wang, J. X., Li, M., Wang, H., and Pan, Y. (2012). Identification of essential proteins based on edge clustering coefficient. IEEE ACM Trans. Comput. Biol. Bioinform. 9, 1070–1080. doi: 10.1109/TCBB.2011.147

PubMed Abstract | CrossRef Full Text | Google Scholar

Wuchty, S., and Stadler, P. F. (2003). Centers of complex networks. J. Theoret. Biol. 223, 45–53. doi: 10.1016/S0022-5193(03)00071-7

CrossRef Full Text | Google Scholar

Xenarios, I., Salwinski, L., Duan, X. J., Higney, P., Sul-Min, K., and Eisenberg, D. (2002). DIP, the database of interacting proteins: a research tool for studying cellular networks of protein interactions. Nucleic Acids Res. 30, 303–305. doi: 10.1093/nar/30.1.303

PubMed Abstract | CrossRef Full Text | Google Scholar

Yu, H., Kim, P. M., Sprecher, E., Trifonov, V., and Gerstein, M. (2007). The importance of bottlenecks in protein networks: correlation with gene essentiality and expression dynamics. PLoS Comput. Biol. 3:e59. doi: 10.1371/journal.pcbi.0030059

PubMed Abstract | CrossRef Full Text | Google Scholar

Zhang, R., and Lin, Y. (2009). DEG 5.0.A database of essential genes in both prokaryotes and eukaryotes. Nucleic Acids Res. 37, D455–D458.

Google Scholar

Zhang, W., Xu, J., and Zou, X. (2019). Predicting essential proteins by integrating network topology, subcellular localization information, gene expression profile and go annotation data. IEEE ACM Transact. Comput. Biol. Bioinform. 17, 2053–2061. doi: 10.1109/TCBB.2019.2916038

PubMed Abstract | CrossRef Full Text | Google Scholar

Zhang, W., Xu, J., Li, Y., and Zou, X. (2018). Detecting essential proteins based on network topology, gene expression data, and gene ontology information. IEEE ACM Transact. Comput. Biol. Bioinform. 15, 109–116. doi: 10.1109/tcbb.2016.2615931

PubMed Abstract | CrossRef Full Text | Google Scholar

Zhao, B., Wang, J., Li, M., Wu, F. X., and Pan, Y. (2014). Prediction of essential proteins based on overlapping essential modules. IEEE Transact. Nano Biosci. 13, 415–424. doi: 10.1109/TNB.2014.2337912

PubMed Abstract | CrossRef Full Text | Google Scholar

Zhao, B., Wang, J., Li, X., and Wu, F. X. (2016). Essential protein discovery based on a combination of modularity and conservatism. Methods 16, 54–63. doi: 10.1016/j.ymeth.2016.07.005

PubMed Abstract | CrossRef Full Text | Google Scholar

Zhao, B., Zhao, Y., Zhang, X., Zhang, Z., Zhang, F., Wang, L., et al. (2019). An iteration method for identifying yeast essential proteins from heterogeneous network. BMC Bioinform. 20:355.

Google Scholar

Keywords: protein-protein interaction, essential protein, heterogeneous network, random walk with restart, key target convergence set

Citation: Peng J, Kuang L, Zhang Z, Tan Y, Chen Z and Wang L (2021) A Novel Model for Identifying Essential Proteins Based on Key Target Convergence Sets. Front. Genet. 12:721486. doi: 10.3389/fgene.2021.721486

Received: 07 June 2021; Accepted: 30 June 2021;
Published: 29 July 2021.

Edited by:

Quan Zou, University of Electronic Science and Technology of China, China

Reviewed by:

Yuhua Yao, Hainan Normal University, China
Xing Chen, China University of Mining and Technology, China
Bing Wang, Anhui University of Technology, China

Copyright © 2021 Peng, Kuang, Zhang, Tan, Chen and Wang. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

*Correspondence: Linai Kuang, kla@xtu.edu.cn; Lei Wang, wanglei@xtu.edu.cn

Download