Method for Essential Protein Prediction Based on a Novel Weighted Protein-Domain Interaction Network

In recent years a number of calculative models based on protein-protein interaction (PPI) networks have been proposed successively. However, due to false positives, false negatives, and the incompleteness of PPI networks, there are still many challenges affecting the design of computational models with satisfactory predictive accuracy when inferring key proteins. This study proposes a prediction model called WPDINM for detecting key proteins based on a novel weighted protein-domain interaction (PDI) network. In WPDINM, a weighted PPI network is constructed first by combining the gene expression data of proteins with topological information extracted from the original PPI network. Simultaneously, a weighted domain-domain interaction (DDI) network is constructed based on the original PDI network. Next, through integrating the newly obtained weighted PPI network and weighted DDI network with the original PDI network, a weighted PDI network is further constructed. Then, based on topological features and biological information, including the subcellular localization and orthologous information of proteins, a novel PageRank-based iterative algorithm is designed and implemented on the newly constructed weighted PDI network to estimate the criticality of proteins. Finally, to assess the prediction performance of WPDINM, we compared it with 12 kinds of competitive measures. Experimental results show that WPDINM can achieve a predictive accuracy rate of 90.19, 81.96, 70.72, 62.04, 55.83, and 51.13% in the top 1%, top 5%, top 10%, top 15%, top 20%, and top 25% separately, which exceeds the prediction accuracy achieved by traditional state-of-the-art competing measures. Owing to the satisfactory identification effect, the WPDINM measure may contribute to the further development of key protein identification.


INTRODUCTION
Accumulating evidence indicates that proteins have a tremendous impact on almost all life activities. Essential proteins cannot only maintain normal biological processes but also ensure the integrity of cell functions. With the development of biotechnology (Lu et al., 2019(Lu et al., , 2020, more and more essential proteins have been discovered by biological experiments in recent years. However, because biological experiments are quite costly and time-consuming, an increasing number of computational models have been proposed to identify essential proteins based on the topological features of PPI networks. For instance, based on the rule of centrality-lethality (Jeong et al., 2001), researchers have proposed a series of prediction models, which have been designed successively to infer potential critical proteins. These include Information Centrality (IC) (Stephenson and Zelen, 1989), Degree Centrality (DC) (Hahn and Kern, 2004), Subgraph Centrality (SC) (Ernesto and Rodriguez-Velazquez, 2005), Closeness Centrality (CC) (Wuchty and Stadler, 2003), Betweenness Centrality (BC) (Jop et al., 2005), Neighbor Centrality (NC) , and local average connectivity (LAC) (Li et al., 2015). Wang et al. (2011) designed a predictive model named SoECC by combining the features of edges and nodes and taking advantage of the edge clustering coefficient effectively. Lin et al. (2008) introduced two kinds of prediction models such as the Maximum Neighborhood Component (MNC) and the Density of Maximum Neighborhood Component (DMNC) to infer essential proteins, respectively. However, these prediction models cannot achieve high identification accuracy owing to the incompleteness of current PPI networks (Chen and Yuan, 2006).
Hence, to address this problem, some different methods based on both biological information on proteins and the topological properties of PPI networks have been proposed to detect essential proteins. For example, Li et al. (2012) proposed a calculation method called Pec by uniting the gene expression data with the centrality-lethality rule to identify key proteins from PPI networks. Zhang et al. (2013) presented a method based on integrating the topological features of PPI networks with the co-expressions of proteins. Peng et al. (2012) designed a prediction method called ION based on topological features extracted from the PPI network and the orthologous information of proteins. Additionally, inspired by the model of Degree Centrality, Tang et al. (2014) developed an identification model for predicting essential proteins by combining the Person correlation coefficient (PCC) and the edge clustering coefficient (ECC) with the gene expression data of proteins. Kim (2012) proposed a method for predicting key proteins by implementing a machine learning algorithm on both Gene Ontology and topological information of PPI networks. Luo et al. (2015) developed a computational model by integrating the local interaction density with protein complexes to detect key proteins. Li et al. (2016) designed a method for identifying essential proteins by adopting the subcellular localization and orthologous information. Luo and Kuang (2014) proposed a prediction model called CDLC to detect essential proteins by employing the dynamic local average connectivity and in-degree of proteins in complexes. Zhang et al. (2018) introduced a calculative algorithm named TEO for inferring essential proteins by integrating gene ontology annotation information and the gene expression data with PPI networks. Zhong et al. (2013) designed a learning algorithm to predict essential proteins by combining the biological information of proteins with PPI networks. Shang et al. (2016) introduced a strategy to detect essential proteins through integrating the RNA-Seq dataset and biological information of proteins with dynamic PPI networks. Zhang et al. (2016) introduced a prediction measure named PINs for identifying essential proteins based on gene expression profiles and PPI networks through integrating five approaches including the DC, BC, SC, CC, and the eigenvector centrality (EC) (Bonacich, 1987).
This study proposes a novel prediction model called WPDINM that can be used to detect key proteins by combing a weighted protein-domain interaction (PDI) network with the biological information containing the subcellular localization and orthologous information of proteins. WPDINM is based on the original PPI network and the original PDI network, obtained by known protein-protein interactions (PPIs) and known protein-domain associations that have been downloaded from benchmark databases. In this prediction model, a weighted PPI network and a weighted domain-domain interaction (DDI) network are established first, based on the gene expression data of proteins and the topological information of the original networks respectively. Then, a weighted PDI network is constructed by combining these two newly constructed weighted networks. Next, based on the weighted PDI network, initial scores are assigned to proteins based on the biological information of proteins such as the subcellular localization and orthologous information of proteins, and a novel iterative method is implemented to estimate the criticality of proteins.
Different from traditional prediction models, in WPDINM, the Discrete Fourier transform (DFT) is applied to the gene expression profiles of proteins to calculate the weight between proteins, which can translate gene expression profiles from the time domain to frequency domain effectively. A novel weighted PDI network is then constructed by integrating a weighted DDI network and weighted PPI network. Moreover, by taking into account the associations between proteins, a new directed distribution network is designed to calculate the rankings of proteins iteratively, based on the weighted PDI network. Finally, to evaluate the prediction performance of WPDINM, the WPDINM is compared with other competitive measures such as SC (Ernesto and Rodriguez-Velazquez, 2005), DC (Hahn and Kern, 2004), IC (Stephenson and Zelen, 1989), CC (Wuchty and Stadler, 2003), BC (Jop et al., 2005), NC , EC (Bonacich, 1987), Pec , CoEWC , TEGS (Zhang et al., 2019), ION (Peng et al., 2012), and POEM (Zhao et al., 2014). Experimental results indicate that WPDINM can achieve better prediction accuracies than competing prediction models, achieving 90.19, 81.96, 70.72, 62.04, 55.83, and 51.13% in the top 1%, top 5%, top 10%, top 15%, top 20%, and top 25% of predicted proteins separately.

Experimental Data
To construct original PPI networks, we first download known PPIs from three different databases including the DIP database (Xenarios et al., 2002), the Gavin database (Gavin et al., 2006), and the Krogan database (Krogan et al., 2006), respectively. After removing duplicated interactions, we finally obtain three different datasets such as the DIP-based dataset, consisting of 24,743 known PPIs between 5,093 proteins, the Krogan-based dataset, consisting of 14,317 known PPIs between 3,672 proteins, and the Gavin-based dataset consisting of 7,669 known PPIs between 1,855 proteins. Next, we further download known domains from the Pfam database (Bateman et al., 2004), and after preprocessing, obtain a dataset consisting of 1,107 different domains. Based on these three kinds of datasets obtained from the DIP database, the Gavin database, and the Krogan database, we finally construct three kinds of original PPI networks and corresponding matrices with dimensions of 5, 093 = 1, 107, 1, 855 = 1, 107, and 3, 672 = 1, 107 separately. The gene expression data is provided by Tu et al. (2005), which consists of 6,776 gene expression sequences with a length of 36.
In order to obtain the initial scores of proteins, we download the subcellular localization data from the COMPART-MENTS databases (Binder et al., 2014). As a result, we obtain a dataset consisting of 11 kinds of subcellular localizations, including the Extracellular, Peroxisome, Nucleus, Plasma, Endosome, Mitochondrion, Vacuole, Cytosol, Golgi, Cytoskeleton Endoplasmic, that are intimately linked with downloaded known key proteins. We also download the orthologous information of proteins from the InParanoid database (Gabriel et al., 2010). Furthermore, the set of essential proteins existing in Saccharomyces cerevisiae is was downloaded from four different databases including DEG (Zhang and Lin, 2009), MIPS (Mewes et al., 2006), SGD (Cherry et al., 1998), and SGDP (Saccharomyces Genome Deletion Project, 2012).
As shown in Figure 1, the flowchart of WPDINM consists of the following four major steps: Step 1: Firstly, based on known PPIs downloaded from any given benchmark database, an original PPI network is obtained. Then, a weighted PPI network is further constructed by implementing the DFT method on the gene expression data of proteins and extracting topological features from the original PPI network.
Step 2: Based on known PPIs and known protein-domain interactions (PDIs) downloaded from given benchmark databases, a weighted DDI network is then constructed. Thereafter, a weighted PDI network is further established by integrating the weighted DDI network with the weighted PPI network.
Step 3: Then, by combining the weighted PDI network with biological information, including the orthologous information and subcellular information of proteins, each protein in the weighted PDI network is assigned an initial score.
Step 4: Finally, a novel prediction method based on the Page Rank algorithm is designed and applied on the weighted PDI network to compute the final scores of criticality for all proteins iteratively.

Construction of the Weighted PPI Network
In this section, based on the datasets consisting of known PPIs downloaded from three different databases, including the DIP database (Xenarios et al., 2002), the Gavin database (Gavin et al., 2006), and the Krogan database (Krogan et al., 2006), respectively, we construct three original PPI networks simultaneously. For convenience, let OppiN = {N P ,E P } represent a newly constructed original PPI network, where N P = {p 1 , p 2 ....p O } is the set of protein nodes in OppiN and E P is the set of edges between protein nodes in OppiN. Here, for two given proteins p i and p j in N P , there is an edge ed(p i , p j ) between them in E P , if and only if there is a known interaction between these two proteins. Based on the original PPI network OppiN, we can further obtain an O × O dimensional adjacency matrixOppiM as follows: for any two protein p i and p j in N P , there is OppiM i, j = 1, only if there is a known interaction between p i and p j , otherwise there is OppiM i, j = 0.
Next, based on OppiN, for any given protein p with a known gene expression sequence in N P , let Gep p =< Gep p, 1 , Gep p, 2 , ..., Gep p, M > represent the gene expression sequence of p, where Gep p, t is the degree of gene expression at t th time. As Gep(p) is a time sequence with the length of M, then we can adopt the DFT method to convert it from time domains to frequency domains, since while N ≥ M, the N-point Discrete Fourier can transform a 1 × M dimensional time series vector Gep p to a 1 × N dimensional spectrum vector DF p =< DGep (0) , DGep (1) , ... DGep (N − 1) >as follows: Thereafter, through combining the above formulas with the Gaussian kernel interaction profiles, for any two given proteins p i and p j with gene expression sequences in N P , we can estimate the probability of association between them by calculating the spectra similarity as follows: Here, α p is the adjustment coefficient for the kernel bandwidth, which is defined as follows: In the above formula (4), P Gep is the set of proteins with gene expression sequences in OppiN. Additionally, for any two given proteins p i and p j without gene expression sequences in OppiN, we adopt the topological features extracted from the original PPI network OppiN to calculate the possibility of an association between them. Thus, the weight between p i and p j can be calculated as follows: Here, Np p i and Np p j denote the set of neighboring protein nodes of p i and p j in OppiN separately, Com p i , p j represents the set of common neighbors between p i and p j in OppiN, and |X| means the number of different elements in the set X.
Integrating formula (3) and formula (5) for any two given proteins p i and p j in OppiN, we can calculate the possibility of an association between them, b, as follows: In the above formula (7), β is the scaling parameter with a value from 0 to 1.

Construction of the Original PDI Network
In this section, based on the dataset consisting of known PDIs downloaded from the Pfam database (Bateman et al., 2004), we construct an original PDI network is the set of domain nodes in OpdiN, and E PD is the set of edges between protein nodes in N P and domain nodes in N D . Here, for a given protein p i and a given domain d j in N PD , there is an edge between them in E PD , only if there is p i belonging to d j . Based on the original PDI network OpdiN, we can further obtain an O × Q dimensional adjacency matrix OpdiM as follows: for a given protein node p i and a given domain node

Construction of the Weighted DDI Network
For any two given domains d i and d j in OpdiN, in this section, we further obtain a Q × Q dimensional matrix WddiM by adopting the Gaussian kernel interaction profiles to estimate the association between d i and d j as follows: Here, IP d d l denotes the vector at the l th column of the matrix OpdiM, and δ d is an adjustment coefficient for the kernel bandwidth based on the new bandwidth parameter δ d , which is defined as follows: Based on the above formula (8), it is easy to construct a weighted DDI network WddiN.

Construction of the Weighted PDI Network
In this section, through combining the weighted PPI network WppiN and original PDI network OpdiN with the weighted DDI network WddiN, we calculate two O × Q dimensional matrices WpdiM and WdpiM as follows: : else if t i ∈ N P and t j ∈ N D (10) : else if t i ∈ N P and t j ∈ N D Thereafter, for any two given nodes t i and t j in OpdiN, we can obtain a new O × Q dimensional matrix WPDIM as follows: According to the above formula (12), it is easy to construct a weighted PDI network WpdiN.

Calculation of the Initial Scores of Proteins
First, based on the weighted PDI network WpdiN, for a given protein p i and a given domain d j in N PD , we can obtain a Q ×O dimensional allocation probability matrix APM as follows: Next, for simplicity, let the initial score vector for all domains in WpdiN be S d = < 1, 1, 1 > T , we assign an initial score of 1 to each domain in WpdiN, then based on the allocation matrix APM, we can distribute the initial scores of domains to all proteins in WpdiN in the following way: PSD is an O dimensional vector, and PSD(i) denotes the score, which is the i th protein node p i obtained from all domain nodes in WpdiN.
To calculate the score of the subcellular localization feature, let N SL represent the number of all subcellular localizations, and N SL (j) denote the number of proteins related to the j th subcellular localization. The s_avg means the average sum of the protein associated with subcellular localization. Then, the score of j th subcellular localization can be computed as follows: Where: Hence, for any given protein p i , its subcellular localization feature score can be calculated as follows: Where SL(p i ) is the set of subcellular localization related to the protein p i . In addition, because triangles have the characteristic of stability, we further adopt the topological feature of triangles extracted from the OppiN to calculate at biological feature score for each protein p i . Here, for a given protein p i , its set of neighbor nodes is represented as Np(p i ), then there is: Therefore, the triangles for protein p i is computed as follows: Where the TRI(p i ) is the set of triangles related to the protein p i and |Np(p i )| represents the degree of the protein p i . According to the above calculated triangle numbers for each protein, we compute the triangle feature score for p i ; Based on the orthologous information obtained from the InPaianoid database (Gabriel et al., 2010), for any given protein p i , let f oth (p i ) be its score of orthologous information, then we can calculate an orthologous feature score for p i as follows: Based on the above formulas (14)∼(18), for any given protein p i , we can obtain its feature score as follows: FS p i = ϕ * FS SL p i + θ * FS TRI p i + τ * FS ORT p i (24) Frontiers in Genetics | www.frontiersin.org    Where ϕ,θ and τ are proportion parameters, which are used to adjust the ratio of feature score for proteins and satisfy ϕ + θ + τ = 1. Finally, according to the above formula (13) and formula (19), for any given protein p i , we can obtain its initial score as follows: Here, ω is a proportion parameter.

Construction of the Prediction Model WPDINM
According to the weighted PPI network WppiN, let N p i and N p j be the sets of neighboring nodes of p i and p j , respectively, then N p i ∩ N p j = {p 1 , p 2 , ..., p T } is the set of common neighbors of both p i and p j . Supposing that there is WppiM p 1 , p j ≤ WppiM p 2 , p j ≤ ... ≤ WppiM p T , p j , then we define the allocation possibility of weight from p i to p j as follows: Similarly, supposing that there is WppiM p 1 , p i ≤ WppiM p 2 , p i ≤ ... ≤ WppiM p T , p i , then we define the allocation possibility of weight from p j to p i as follows: Hence, based on the above formulas, for any two given protein nodes p i and p j in WppiN, we can obtain an allocation possibility matrix of weights between them as follows: Where ρ is the adjustment parameter with a value between 0 and 1. Based on the above allocation possibility matrix WAPM, let a possibility vector S t+1 denote the vector of scores of proteins at the (t+1) th iteration, then we can calculate the proteins ranks iteratively as follows: Where µ ∈ (0, 1) is a scale parameter for adjusting the proportion of the current score vector S t and initial score vector S 0 . Based on the above descriptions, the algorithm WPDINM can be briefly described as follows.

Comparison of Twelve Essential Proteins Prediction Measures
The data presented by the bar chart illustrates that the identification performance of WPDINM exceeds the other measures by comparing the forecast accuracy from top 1% to top 25% proteins. It's apparent from Figure 2 that, with the comparison of prediction accuracy in the top 1% proteins, 90.19% of the true key proteins are detected by the WPDINM method. By deferring the top 5% of proteins, the identification precision of WPDINM is up to 81.96%. The prediction result from the top 10% of proteins shows that the percentage of essential proteins identified by WPDINM is 70.58%. The prediction accuracies of WPDINM are 27.4, 19.6, 15.3, 13.2, 10.3, and 8.4% higher than the NC method from the top 1% to top 25%. By comparing it with the TEGS method, the precision of WPDINM increase by 3.6% from the top 25% of proteins. Figure 3 shows the identification accuracy in the Krogan database. By observing the top 1% of proteins, the true essential proteins predicted by WPDINM make up 95%. With the top 5% proteins, 145 essential proteins detected. For the top 10% proteins, the proportion of essential proteins detected by the WPDINM is 5.7% observably higher than TEGS. For the top 15% and top 20% of proteins, the WPDINM can acquire 66.7% and 60% of the identification accuracy. In particular, in the top 25% candidate proteins, the prediction accuracy of

Validated by Jackknife Methodology
To further assess the prediction effect for WPDINM, the Jackknife Methodology is adopted to compare WPDINM with other methods. Figure 4 shows the comparison results from the top 600 ranked proteins in the DIP dataset between the WPDINM method and other methods. As is revealed by Figure 4A, we can see that the WPDINM has more advantages than six prediction methods including IC, DC, CC, NC, BC, and EC. Figure 4B indicates that the performance of WPDINM exceeds the six methods: SC, Pec, CoEWC, POEM, ION, TEGS, respectively. Figure 5 indicates the comparison result from the top 600 ranked proteins between the WPDINM and other measures in the Krogan dataset. From Figure 5A, it can be seen that the curve of WPDINM is above the curves of other competitive methods, containing DC, IC, EC, BC, CC, and NC. From Figure 5B, we can observe that the WPDINM is superior to the six methods including SC, Pec, CoEWC, POEM, ION, and TEGS.

Differences Between WPDINM and Other Methods
To compare the differences between WPDINM and other methods, we select the top 500 ranked proteins to compare the WPDINM with the 11 methods. The results of the comparison are shown in Tables Table 1 shows the distinction between the WPDINM method and the eleven kinds of methods in the DIP dataset. It can be found from the second column of the table that the numbers of proteins identified by WPDINM and DC,IC, CC,BC,IC,EC are fewer than 200 proteins. In terms of the data for NC, the numbers of common proteins detected by both WPDINM and NC is just less than half. The proportions of overlapping proteins predicted by WPDINM and Pec, CoEWC, POEM are not more than half. Table 2 reflects the differences of between WPDINM methods and other methods in the Krogan database. From Table 2 we can see that the proportion of key proteins in {WPDINM-Me} is higher than one of the methods.
We further employ the receiver operating characteristic curve (ROC) and Precision-recall curve (PR) to test the prediction ability of the WPDINM model. The larger the area under the ROC curve (AUC), the better the prediction effect of the   measure. The AUC data for all methods are collected in Table 3. Figures 6, 7 show the ROC curves and PR curves of the WPDINM method and various methods based on the DIP database and the Krogan database, respectively. As depicted in Figure 6F, although the ROC curves of WPDINM and ION have a little overlap, the AUC of WPDINM from Table 3 is higher than the ION model. Figure 7 shows that the ROC curve of WPDINM is higher than other competitive measures in the Krogan database. As shown in Table 4, when comparing with the other 12 measures, the prediction accuracy of WPDINM is highest from top1% to top 25%. This reveals that the indication effect of the WPDINM model is better than 12 competing methods and that the WPDINM method has applicability to a large extent.

The Analysis of Parameters
Because the prediction precision needs to be enhanced, we set a proportions parameter µ ∈ (0, 1) in iterative formula (29). As is demonstrated in Table 5, we can see that in the DIP dataset, different values of parameter µ can have various influences on the experiment result. The statistics show the prediction accuracy in the top 1% to the top 25% proteins, when the parameter µ is set to a different value. It can be seen that the forecast accuracy slightly fluctuates, with the value of µ increasing. We repeat the same operation in the Krogan database. The data in Table 6 presents the prediction performance from the Krogan database when parameter changing. Finally, because the prediction result is most competitive when the value of µ is 0.4, we choose to compare it with other methods.
For the sake of achieving higher prediction accuracy, we set a series of parameters. When calculating the weighted protein-protein network, we add two parameters β, γ to the computing formula (7). β and γ are adopted to regulate the ratio of two kinds of similarity between proteins. When the values of β and γ are set to 0.5, the WPDINM method obtains the best prediction effect. In formula (19), the parameters ϕ, θ and τ are employed to adjust the proportion of three features such as subcellular localization, orthologous information, and triangles features. The best experimental result is obtained by setting ϕ, θ and τ to based on the original PPI network and gene expression data processed by DFT. Next, the weighted domain-domain network is established based on the original protein-domain network. Then, by integrating the weighted domain-domain network with the weighted PPI network, the new weighted protein-domain network is further constructed. After that, we assign the initial scores for each protein by combining the topological feature and some biological information such as orthologous information, and subcellular information. Finally, we design a novel iteration algorithm based on the PageRank algorithm to compute protein scores iteratively. As a result, to testify the performance of the WPDINM algorithm, the WPDINM method is applied to three datasets including the DIP database, the Krogan database, and the Gavin database. The experimental result shows that the WPDINM achieves better indication than competitive methods.

DATA AVAILABILITY STATEMENT
The datasets generated for this study can be found in the online repositories. The names of the repository/repositories and accession number(s) can be found in the article/ Supplementary Material.

AUTHOR CONTRIBUTIONS
ZM and LW conceived and designed the study. ZM, ZC, and LK obtained and processed datasets. ZM and LK wrote the manuscript. YT, ZZ, and XL provided suggestions and supervised the research. All authors contributed to the article and approved the submitted version.