An Improved Memetic Algorithm for Detecting Protein Complexes in Protein Interaction Networks

Identifying the protein complexes in protein-protein interaction (PPI) networks is essential for understanding cellular organization and biological processes. To address the high false positive/negative rates of PPI networks and detect protein complexes with multiple topological structures, we developed a novel improved memetic algorithm (IMA). IMA first combines the topological and biological properties to obtain a weighted PPI network with reduced noise. Next, it integrates various clustering results to construct the initial populations. Furthermore, a fitness function is designed based on the five topological properties of the protein complexes. Finally, we describe the rest of our IMA method, which primarily consists of four steps: selection operator, recombination operator, local optimization strategy, and updating the population operator. In particular, IMA is a combination of genetic algorithm and a local optimization strategy, which has a strong global search ability, and searches for local optimal solutions effectively. The experimental results demonstrate that IMA performs much better than the base methods and existing state-of-the-art techniques. The source code and datasets of the IMA can be found at https://github.com/RongquanWang/IMA.


INTRODUCTION
Many complex systems in the real world are often modeled with complex networks, such as computer networks, social networks, and biological networks. The detection of community structure is an essential property of complex networks, and it helps us understand the structure and functionally of complex networks. Protein complexes mined from the PPI network are representative of detecting community structure in complex networks. In proteomics, proteins rarely act alone, and often organize together to form protein complexes to perform specific biological functions cooperatively (Spirin and Mirny, 2003). Therefore, accurately identifying protein complexes from PPI networks can contribute to the study of the mechanisms of cellular functions and organization (Gavin et al., 2002) in the post-genomic era. Although some experimental methods such as yeast two-hybrid and tandem affinity purification can detect protein complexes, they have limitations. Specifically, they are expensive and time-consuming. With the development of high-throughput experimental technologies, many PPI networks are now available. The computational methods developed complement the experimental techniques in identifying protein complexes. As a result, many computational methods have been proposed for the identification of protein complexes from PPI networks, which is a type of cluster analysis, which consists of grouping patterns into clusters based on the similarity, and it is a valuable technology in many areas such as bioinformatics, machine learning, and computer vision.
To date, a variety of computational methods for detecting protein complexes in PPI networks have been proposed. Based on our work, we provide a summary of the related works by classifying the existing methods into three types: methods based on reducing noise, methods based on different topological structures, and methods based on evolutionary algorithms.
Because the PPI networks are derived from high-throughput experiments, the high false positive and false negative rates in the PPI networks are high (Von Mering et al., 2002;Samanta and Liang, 2003;Liu et al., 2009;Srihari, 2012;Zaki et al., 2013;Lei et al., 2018), which substantially affects the accuracy of protein complex identification. To reduce the influence of noise, some methods utilize various network topological properties for strong interactions and identify protein complexes. These algorithms include PEWCC (Zaki et al., 2013), ProRank+ (Hanna and Zaki, 2014), and EWCA (Wang et al., 2019a). Alternatively, some studies have attempted to integrate the gene expression data (Keretsu and Sarmah, 2016), gene ontology data (Zhang et al., 2013a;Wang et al., 2019b), and subcellular localization data (Wang et al., 2020) to improve the reliability of the interactions. These studies mainly use biological data to weight protein interactions and compensate the PPI network, which better reflects the real protein interactions. DPCT (SabziNezhad and Jalili, 2020) uses TAP and GO data to construct a weighted PPI network and to reduce the noise of PPI, and a memetic algorithm to detect protein complexes. Valdeolivas et al. (2019) extended the Random walk with restart (RWR) algorithm to multiplex and heterogeneous networks. This framework performs better as compared to the aggregation of the different interaction sources and the multiplex framework is more efficient than network aggregations to extract communities from biological networks. Blatti et al. (2020) presented Knowledge Engine for Genomics (KnowEnG), it is a free-to-use computational system for analysis of genomics data sets, it provides the standard clustering pipeline to cluster a collection of samples. However, further reducing the impact of random noise in PPI networks, and improving the performance of protein complex detection methods remain urgent problems to be solved.
In recent years, various computational methods based on different topological structures have been developed to detect protein complexes in PPI networks. Among these methods, different types of topological structures are commonly assumed to be protein complexes. Some methods partition proteins into many non-overlapping clusters by using partition functions or principles. For example, Markov Clustering (MCL) (Van Dongen, 2000) is a simulated random walk method, which mainly uses expansion and inflation operators to manipulate the adjacency matrix and mine protein complexes from the PPI networks. Meanwhile, RNSC (King et al., 2004) first moves the proteins randomly among the clusters to optimize the cost function, and then a post process, based on cluster size, density, and functional homogeneity is carried out. Some methods aim to find cliques. In 2006, CFinder (King et al., 2004) was developed to cluster proteins in the PPI network, and it used the concept of k-clique to discover protein complexes. In 2009, CMC (Liu et al., 2009) tried to enumerate cliques in the PPI network for discovering protein complexes, but it was too strict for most protein complexes. Therefore, several methods based on density have been designed to identify dense subgraphs in the PPI networks, where subgraphs with densities above a predefined threshold were considered as protein complexes. For example, MCODE (Bader and Hogue, 2003) first weighted every node by local neighborhood density and then extended locally dense nodes to detect protein complexes. Later, Li et al. (2008) improved the seed-extended method by modifying the DPClus algorithm based on the diameter and density of the local graph. Furthermore, Gavin et al. (2006) proposed that protein complexes have the core-attachment structure, and some identification methods based on the core-attachment structure. For example, COACH (Wu et al., 2009) and WPNCA  have been proposed to find protein complexes. They first extracted the protein complex cores from the neighborhood graphs of the proteins, and protein complex cores were further extended to form complete protein complexes. Finally, some protein complexes with a high overlap were merged. There are some variants of network topological features that are used to detect protein complexes; these studies (Nepusz et al., 2012;Giurgiu et al., 2019) have shown that proteins in a protein complex commonly display strong interactions within the core of the protein complex, and weak interactions with the proteins outer surface of the protein complex, ClusterONE (Nepusz et al., 2012) starts from a seed node and inserts neighbors into it to form overlapping protein complexes by using cohesiveness. Subsequently, Wang et al. (2019b) proposed a novel seedexpand algorithm called SE-DMTG to identify protein complexes with a combinatorial function from the weighted PPI networks. Additionally, based on the 3-sigma principle (Wang et al., 2013), MPC-C (Wang et al., 2020) identifies the active points of proteins in a time serial of gene expression data and generates a series of time-sequenced subnetworks to identify static and dynamic protein complexes. In 2021, Liu et al. (2021) proposed a protein complex detection methods based on a semisupervised model to detect protein complexes with clear module structures. A number of computational methods only consider single topological properties to identify protein complexes, and they recover protein complexes with other types of topological structures.
Intensive studies on evolutionary algorithms have also been conducted. In recent years, some researchers have provided new ideas for solving protein complex identification problems by using optimized algorithms, by employing the characteristics of highly adaptive and good optimization abilities. Some successful methods have been applied to tackle the problems of identifying protein complexes and efficiency. In 2015, Ramadan et al. (2016) introduced a genetic algorithm to detect protein complexes. Subsequently, in 2016, Lei et al. (2016) presented F-MCL based on Markov clustering and the firefly method, which automatically determines the parameters by using the firefly method. In the same year, a novel fruit fly optimization clustering method was designed to detect dynamic protein complexes (Lei et al., 2016). In 2017, Zhang et al. (2017) proposed a new firefly clustering algorithm for transforming the protein complex detection problem into an optimization problem. Zhao et al. (2017) proposed a novel improved cuckoo search clustering method for discovering protein complexes in dynamic weighted PPI networks. In 2019, Lei et al. (2019b) used a nature-inspired optimization method to detect protein complexes. In 2019, a moth-flame-optimization-based protein complex detection method was presented (Lei et al., 2019a). In 2020, an evolutionary algorithm based on a heuristic biological operator was introduced to detect protein complexes (Abduljabbar et al., 2020). These evolutionary methods have a strong global search ability, but they have difficulty in locating the local optima efficiently.
To solve these issues, we present a novel algorithm, named IMA, which uses an improved memetic algorithm we designed to detect protein complexes from the PPI network. First, we constructed a weighted PPI network by using the common neighbor, gene expression data, GO-slim data, and subcellular location data to reduce the impact of noise on our IMA. Second, many highquality initial individuals, including protein complexes with different topological structures, are generated using EWCA (Wang et al., 2019a), SE-DMTG (Wang et al., 2019b), and MPC-C (Wang et al., 2020). We propose a fitness function to identify protein complexes with various topological properties. Third, a new improved memetic algorithm is proposed to mine the protein complexes by optimizing this fitness function in the weighted PPI network. Remarkably, its selection, recombination, and updating population operators are used for the global search of the best individual, and a local optimization strategy is designed to locate the local optima individually. Finally, our IMA was applied to four different yeast PPI networks and compared with the 12 existing excellent methods. The experimental results illustrate that the IMA achieves state-of-the-art performance of computational metrics and biological relevance metrics in identifying protein complexes.
In the materials and methods section, we introduce the datasets and standard protein complexes used in the evaluation of IMA and define all phases of IMA separately. In the experiments and results section, we evaluated the proposed method and compare it with the state-of-the-art methods. The case study and discussion section shows some examples of protein complexes detected by IMA and we conclude this paper in conclusion section.
Furthermore, two sets of standard protein complexes from the literature (Wang et al., 2020) were used to evaluate the performance of the protein complex detection methods, and their information is shown in Table 2. Here, standard protein complexes 1 consists of the known protein complexes from MIPS (Mewes et al., 2004), SGD (Hong et al., 2007), TAP06 (Gavin et al., 2006), ALOY (Aloy et al., 2004), CYC 2008 (Pu et al., 2009) and NEWMIPS (Friedel et al., 2009). Meanwhile, standard protein complexes 2 also is a combined protein complexes dataset (Ma et al., 2017). It consists of the Wodak database (Pu et al., 2009), PINdb and GO complexes (Ma et al., 2017). Additionally, we also used CYC2008 protein complexes (Pu et al., 2009) and MIPS protein complexes (Mewes et al., 2004), and they come from other people's work, and they are shown in https:// github.com/RongquanWang/IMA/Additional file 3.
The GO-slim data can explain the biological function of proteins, and it can be downloaded from https://downloads. yeastgenome.org/curation/literature/go_slim_mapping.tab. The gene expression data were obtained from https://www.ncbi. nlm.nih.gov/sites/GDSbrowser. Additionally, the subcellular localization data set for yeast proteins was obtained from https://compartments.jensenlab.org/Downloads.

Preliminaries
Since PPI networks are defined using graph-theoretic concepts, we first provide some of the terminologies used in our paper and then describe the IMA method in detail.
A PPI network can be represented as an undirected graph G(V, E), where V is the set of vertices (individual proteins) and E is the set of edges (protein interactions) between the vertices. The neighbors of v in G, denoted by N(v), are the set of vertices adjacent to v.
Biologically, protein complexes are groups of proteins that interact with each other at the same time and place, forming a single multi-molecular machine. However, due to the inherent topological structures of protein complexes (Nepusz et al., 2012) in PPI networks, protein complexes are usually assumed to be the subgraphs of PPI networks. Let C (V C , E C , W C ) be a subgraph of G. The neighbors of C are defined by Eq.1: As a result, the task of identifying protein complexes can be formulated as mining connected clusters that are densely connected inside and well separated from the rest of the PPI networks, and the clusters correspond to the protein complexes. The protein complex detection method obtains a set of clusters, P (P 1 , P 2 , . . ., P t ).

Constructing a Weighted PPI Network
Many studies (Lei et al., 2018(Lei et al., , 2019bWang et al., 2020) have shown that the performance of protein complex identification methods can be improved by integrating multiple data sources into a single weighted PPI network, which enhances the confidence of interactions in PPI networks. Thus, we integrated the topological structures and multiple biological data to weight the interactions in the PPI networks. Our goal was to weight the edges of the PPI network to reflect the reliability of the corresponding interactions. Graph clustering algorithms can use these weights to reduce the influence of noisy edges and yield meaningful clusters.
2.2.2.1.1 Protein Common Neighbor similarity. The larger the number of common neighbors between the two proteins, the stronger the interaction of the two proteins, and they are more likely to participate in the same protein complex. Some common neighbor similarity measures (Zaki et al., 2013;Wang et al., 2019a;Wang et al., 2019b) were used to calculate the similarity of protein pairs. This paper defines a higher-order common neighbor (HCN) similarity measure to estimate the reliability of the interaction between two proteins, v and u. The HCN is defined by Eq. 2.
where NCN(v, u) N(v) ∩ N(u) is the number of common neighbors between proteins v and u. N(v) and N(u) represent the number of neighbors that proteins v and u are connected with, respectively. HCN(v, u) can further balance the comprehensive connectivity of the two interacting proteins, which may consist of the same protein complex.
2.2.2.1.2 Protein Co-Expression similarity. Next, the gene expression data describes proteins under various conditions in a biological process (Zhang et al., 2013b;Wang et al., 2014). The gene expression vector of each protein comprised of a series of expression values within the period. If two proteins have a similar degree of expression at the same time interval, they have a high co-expression value, and then they are more likely to form a protein complex. The gene expression profiles of a pair of proteins v and u in a PPI network, their gene expression profiles are v {v 1 , v 2 , . . ., v n } and u {u 1 , u 2 , . . ., u n }, respectively. Here, we use the person correlation coefficient (Wang et al., 2013) to calculate the co-expression value PCC(v, u), as defined in Eq. 3: where v and u are the average gene expressions of proteins v and u in n time points, respectively. PCC(v, u) indicates the coexpression of the vector representation between two interacting proteins. As the value of PCC(v, u) ranges from −1 to 1, we set PCC(v, u) PCC(v,u)+1.0 2 to set PCC(v, u) in [0,1]. The higher the value of PCC(v, u), the larger the probability of coexpression of proteins v and u and formation of a protein complex.
2.2.2.1.3 Protein Functional Similarity. From the perspective of protein function, we used GO-slim data to reflect the functional similarity of proteins. Moreover, we generated an attribute matrix O ∈ R N×M , where N denotes the number of proteins in the PPI network and M denotes the number of GO slim attributes. Based on matrix O, we constructed a protein attribute affinity matrix, S ∈ R N×N . Each entry FS(v, u) reflects the GO slim attribute similarity between proteins v and u, as defined in Eq. 4: (4)

Protein Subcellular Location
Similarity. Generally, if two interacting proteins have the same subcellular location, the interaction between the proteins is more reliable. Proteins in the protein complex should be localized in the same inner cellular compartment. Here, we defined the subcellular location similarity SL(v, u), and is defined in Eq. 5: where |SL(v)| and |SL(u)| denote the number of subcellular localizations of proteins v and u, respectively. |SL(v) ∩ SL(u)| represents the number of common subcellular localization attributes between proteins v and u. The edges whose weight is 0 are seen as noise and are removed from the PPI network, and the rest of edges whose weight W(v, u) are expressed by Eq. 6: . (6) Finally, the weighted PPI networks were constructed, and the reliability of the PPI networks was enhanced.

Memetic Algorithm
A memetic algorithm (Li et al., 2014) is typically a hybrid-local heuristic search method used for optimization. Generally, memetic algorithms consist of a genetic algorithm and local optimization strategy. Here, the genetic algorithm is the global search method, which can explore a reliable estimate of the global optimum, but it does not obtain an optimal individual in the explored search space quickly. Therefore, the local optimal search strategy is typically used to accelerate searching and find the best individual in the local search space. In this paper, we present an improved memetic algorithm that can create new individuals that are located in new promising regions in the global search space and search around the newly generated individuals to optimize individuals of better quality (Gach and Hao, 2012).

The IMA Algorithm for Complex Detection
The memetic algorithm is a valuable framework for dealing with combinatorial optimization problems. It provides a good balance between searching for diversification and optimization by employing a genetic algorithm and the local optimization strategy (Žalik and Žalik, 2018). We designed a fitness function to evaluate the quality of the clusters. In addition, we propose a recombination operator and local optimization strategy for detecting protein complexes in the PPI network. Our goal is to maximize the individual's FS fitness (see Eq. 13) using an improved memetic algorithm. Algorithm 1 is the main framework of IMA. The primary input of the IMA algorithm was the PPI network. A flowchart of Algorithm 1 is presented in Figure 1. IMA begins with initial populations (line 1, see section generating the initial population) and then repeats an iterative procedure many times (generations) (lines 3-18). Two parent individuals in the current population are selected using a tournament selection strategy for each generation (line 4, see section selection operator). Two new children individuals were generated based on two selected parents using the recombination operator (line 5, see section recombination operator). Then, we select the children with the highest fitness function (line 6, see section fitness function: Eq. 13), and if its fitness function is larger than the 80% fitness of the individual with the maximum fitness, it is further improved using the local optimization strategy we proposed (line 7-11, see section local optimization strategy). Finally, a roulette wheel selection strategy is used to update the population (line 12, see section updating population operator). The individual with the highest fitness function (line 13, Eq. 13) mined during the search process is always recorded (lines 14-16). The entire IMA stops when iter consecutive generations (line 3, Max Iter ). In the following subsections, we provide more details on the parts of the IMA algorithm.
2.2.2.3.2 Generating Initial Population. As we know, most of existing protein complex detection methods are based on unsupervised learning, and they can only identify protein complexes with a single topology. However, real protein complexes have a variety of topologies. That is why we proposed the IMA. In fact, IMA algorithm is a typical swarm intelligence optimization algorithm, which needs to build an initial population P. In order to build an initial population, we should follow two basic requirements: 1) The individuals in the initial population should be high quality. It means that the protein complex detection method should produce the high accuracy initial population; 2) Individuals in the initial population need to be diverse. Obtaining initial population method should identify the protein complexes with different topological structures. Based on the above two points, we choose EWCA (Wang et al., 2019a), SE-DMTG (Wang et al., 2019b) and MPC-C (Wang et al., 2020) as the methods to generate the initial population.
In fact, according to their references, we can see that these three algorithms not only have high protein complex detection accuracy, but also can identify the protein complexes with different topological structures. And their advantageous than other existing methods are shown in their literatures (Wang et al., 2019a;b, 2020). For example, EWCA can identify protein complexes with core-attachment structure, SE-DMTG can identify protein complexes with high density and modularity and MPC-C can predict static protein complexes and dynamic protein complexes with community structure. Moreover, the identification accuracy of these three algorithms is also excellent in the existing algorithms. Therefore, they can ensure the generation of protein complexes with high quality and Frontiers in Genetics | www.frontiersin.org December 2021 | Volume 12 | Article 794354 5 different topological structures. That is why we chose these algorithms.
The first approach is the EWCA, which generates o individuals by using different values of ss (structural similarity) threshold and ranges from 0.3 to 0.62 with 0.04 increment. The second approach is the SE-DMTG algorithm, which generates o individuals by randomly constructing seed queues, and then uses its seed-extended algorithm to generate individuals. The last method is the MPC-C algorithm, which is used to construct o individuals by setting different values of the filter Score(C) cutoff (from 0.1 to 0.8 with a step size of 0.1). Finally, we obtain Pop 3 × o individuals and combine them to create the initial population, where the parameter o is 8.
Algorithm 1. The pseudo-code of general framework of IMA.

Fitness Function.
We first define the fitness function of the IMA method; this fitness function should reflect the topological properties of protein complexes in PPI networks. Generally, a high-quality protein complex is a group of proteins that are densely interconnected but only sparsely connected with the rest of the PPI network (Li et al., 2008;Nepusz et al., 2012;Wang et al., 2019bWang et al., , 2020. Meanwhile, in identifying various topological properties of protein complexes, the combination of multiple single objective functions can compensate for the shortcomings of a single objective function, which leads to an improved quality of the identified protein complexes. Therefore, we propose a multi-objective function (Eq. 12) by integrating five objective functions (Eqs 7-11 to describe the topological properties of protein complexes: cohesiveness(C), density(C), AIEW(C), ABEW(C), and AWM(C).
C (V C , E C , W C ), where V C is the set of proteins in cluster C, E C is the set of interactions in cluster C, and W C is the set of weights between the pair of proteins. According to previous studies (Nepusz et al., 2012;Wang et al., 2019b), the cohesiveness score is defined as Eq. 7: where W in (C) is the sum of the weights of all edges among C, and W out (C) is the sum of the weights of the edges connecting nodes in C to other nodes in the rest of the PPI network. (Li et al., 2008;Liu et al., 2009;Wang et al., 2019b) hypothesized that the higher the density of the cluster is, the more likely the cluster is a protein complex in the PPI network. The weighted density of cluster C is defined by Eq. 8: where W in (C) is the sum of the weights of the edges between them, and V C is the number of nodes in C.
In this paper, we propose three measures to estimate the likelihood of a local cluster C being a protein complex. First, it is the average inner edge weight (AIEW), and it can estimate the reliability of the internal edges of the cluster C. This is defined in Eq. 9: where W in (C) is the sum of the weights of the edges among them, and |E C | is the number of edges in cluster C. AIEW(C) is the average weight of the inner edges in cluster C. Second, it is the average border edge weight (ABEW), and it can measure the reliability of the border edges of cluster C. This is defined in Eq. 10: where |BE C | {(u, v)|u ∈ C, v∉C} is the number of border edges that connect the cluster C with the rest of the PPI network, and W out (C) is the sum of the weights of the edges connecting nodes in C to the neighbor of the cluster C. ABEW(C) is the average weight of the border edges in cluster C. Third, it is the average weighted modularity (AWM), which indicates that the cluster C is highly average weight connected among them and has a low average weight interaction with the rest of the network. This is defined in Eq. 11: Based on these objective functions, we propose a fitness function that combines these single objective functions to assess the possibility of a cluster C being a protein complex. This fitness function is denoted by Eq. 12: here, the density(C) and AIEW(C) seek a dense intra-connection topological structure, whereas ABEW(C) identifies the sparse topological structure inter-connecting with the rest of the PPI network. cohesiveness(C) and AWM(C) are used to identify the topological structures with densely interconnected nodes that are sparsely connected to the rest of the PPI network. As a result, FF(C i ) is used to identify protein complexes with various topological structures in the individual FS. Finally, the fitness function (FS fitness ) of the individual is the sum of FF(C i ), which is defined in Eq. 13: where C i represents the ith cluster in the individual FS. FF(C i ) is a multi-objective function that is designed to capture the community structure of protein complexes (a group of nodes has better internal connectivity than external connectivity). k is the total number of protein complexes found in an individual FS.
In this study, the goal of our IMA method is to find the individual with the maximum value of FS fitness . Generally, the higher the FS fitness of an individual FS, and the better the quality of the individual. Therefore, the protein complex detection problem can be regarded as an optimization problem by maximizing the value of FS fitness in a PPI network.

Selection
Operator. The selection operator is an essential operation used in the memetic algorithm. The main idea of the selection strategy is that the better an individual, the higher is its chance of being a parent. In the IMA algorithm, the recombination operator and updating population operator need individual selection strategies.
In the parent selection operator, a selection strategy is required to select two parent individuals from the current population to generate individual children. The fundamental principle of this operation is that individuals with a higher fitness function are more likely to be selected as parent individuals. Here, we use a binary tournament selection strategy to select the parent individuals. Meanwhile, the recombination operator also uses a binary tournament selection strategy to select protein complexes from composite parent individuals to create individual children.
Lastly, the population updating strategy needs to update the population according to the fitness of individuals in current population and generate children individuals. In this updating process, we use the roulette wheel selection strategy to update the current population to balance the relatively good individuals and avoid precocity.

Recombination
Operator. The recombination operator is the critical diversification mechanism of memetic algorithm. An effective recombination operator should generate not only diversified solutions but also transfer significant components from parents to children (Hao, 2012;Li et al., 2014). The recombination operator is responsible for combining the genetic material of several individuals (usually two individuals) to create a new children (Spears and De Jong, 1995). New children inherit many high-quality protein complexes from their parents. Additionally, the recombination operator plays a vital role in the effectiveness of the memetic algorithm in the global search space. Traditional recombination operators, having uniform crossover and two-point crossover, are challenging to convey the excellent protein complexes of two parents to a new children simultaneously. This method is less suitable for protein complex detection. Therefore, we present a recombination operator based on the fitness function, whose children can inherit the better protein complexes of their parents. This operator plays an essential role in preventing the algorithm from being trapped in an optimal local solution and exploring the global search space. The main idea of this operator is to take the protein complexes from two parents as the genetic material and try to retain the high-quality protein complexes in parents for the new children. The recombination operator is described in Algorithm 2.
Let parent1 and parent2 represent the parents, and let len(parent1) and len(parent2) be the number of protein complexes in each parent. Parents parent1 and parent2 are merged into a composite parent. Note that if there are redundant protein complexes, and we only leave one. As a result, we obtain a composite parent individual compositeparent in line 2. Based on the length of parent1 (len(parent1)) and the length of parent2 (len(parent2)), we determine the length of the new children (children len ) by randomly generating a number between len low and len up in lines 3-5. The multi-objective function (Eq. 12) of each protein complex in the compositeparent individual is calculated and sorted in line 6. Next, according to their multi-objective function (Eq. 12) and compositeparent, we use the binary tournament selection strategy to create two new offspring in line 7.

Local Optimization strategy.
To improve the quality of the generated offspring, we presented a new local optimization strategy to obtain better offspring. This strategy is different from the general sense of local search strategies, such as the hillclimbing strategy and simulated annealing strategy. The purpose of the local optimization strategy is to obtain offspring of relatively high quality. Here, each protein complex in children is optimized using a multi-objective function (Eq. 12) and a local optimization strategy. The local optimization strategy of the IMA is applied to new children as shown in Algorithm 3.
In this process, for each protein complex Childrens i in the Childrens max , we optimize it using the following steps in lines 3-22. First, we find inner nodes (Innernodes) that belong to the Childrens i and connect at least one protein in the rest of the PPI networks in line 9. Then, we find the inner_max by improving the multi-objective function (see Eq. 12) maximum in Innernodes in line 10, and then we remove the inner_max from Clusterdel in line 11. Second, we find boundary nodes (Boundarynodes), which is the set of proteins that connect at least one inner protein of the current Childrens i in line 15. Then, we detect the boundary max by increasing the multi-objective function (see Eq. 12) maximum in line 16, and we insert the boundary max into Clusteradd in line 17. We repeat the above two steps until the protein complex Childrens i does no change, and if it is not changed (it is Frontiers in Genetics | www.frontiersin.org December 2021 | Volume 12 | Article 794354 considered a locally optimal cluster) or iteration > 20 in lines 19-22, it is an identified protein complex Opcs i in line 24. Next, we use the local optimization strategy to optimize the rest of the protein complexes in Childrens max in line 2-25. If the fitness function of the optimized individual Opcs is not larger than that of the Childrens max , for the individual Childrens max , its local optimization strategy is deemed invalid in lines 26-29. Finally, we output this optimal child, Opcs, in line 30.
Algorithm 3. The local optimization strategy.

Updating Population Operator.
Population diversity is also a vital issue for memetic algorithms to effectively avoid prematurity. When a new child individual is produced with the recombination operator and local optimization strategy, the fitness function (Eq. 13) of the new children and the current population are calculated, respectively. All of these are sorted by their fitness functions. Moreover, to avoid premature convergence, we employ a roulette wheel selection strategy to update the population. Here, the size of the new population in each iteration is the same as that of the original population. The roulette wheel selection strategy can balance the diversity of the population and guarantee that individuals with higher fitness always have a greater probability of being retained in the population.

EXPERIMENTS AND RESULTS
In the experiment, our operational environment was a windows 10 operating system with an Intel(R) Core(TM) i7-9700 CPU with a physical memory of 16 GB, and a speed of the processor was 3.60 GHz. The IMA was run on PyCharm Community Edition 2017.2.2. The IMA was implemented using the Python 3.

Evaluation Metrics
There are several statistical matching-based metrics that estimate the quality of the detected protein complexes based on different protein complex detection methods. Meanwhile, biological relevance-based metrics, which are supplementary to statistical matching-based metrics, are used to evaluate the biological significance of identified protein complexes. If a detected protein complex ipc and a known protein complex kpc contain common proteins each other, their overlapping score (OS(ipc, kpc)) is calculated using Eq. 14: where ipc and kpc are the protein set of ipc and the protein set of kpc, respectively. If OS(ipc, kpc) ≥ λ, ipc is matched with kpc, where λ is a threshold. The F-measure is the harmonic mean of precision and recall, and it can be calculated using Eq. 15: For more details, please see (Lei et al., 2019a). An identified protein complex is considered to match a standard protein complex where the overlap score OS(ipc, kpc) is larger than 0.2 (Lei et al., 2019a). The coverage rate (CR) was used to measure the number of proteins in the standard protein complexes that could be covered by the identified protein complexes . This is defined in Eq. 16: For more details on these parameters, please refer to reference . Generally, a higher Sn indicates that the identified protein complexes cover the proteins in the standard protein complexes better. In contrast, a higher PPV indicates that the identified protein complexes are more likely to be actual protein complexes. Accuracy (ACC) is the geometric average of PPV and Sn, which is denoted by Eq. 17: The maximum matching ratio (MMR) (Nepusz et al., 2012) can measure the overlap matching between standard protein complexes and detected protein complexes based on maximal one-to-one mapping. It can deal with the case that a known protein complex split into different parts in the identified protein complexes, because only one part is matched with the known protein complex. Jaccard (Wang et al., 2019b) was used to quantify the overlap between the detected protein complexes and known protein complexes. In fact, Jaccard is defined as the harmonic mean of the JaccardC of the identified protein complexes and the JaccardG of standard protein complexes, and it is used to evaluate the clustering results. Jaccard is calculated using Eq. 18: Frontiers in Genetics | www.frontiersin.org December 2021 | Volume 12 | Article 794354 8 As a result, the performance of the detection method is evaluated by the total score, which is calculated using Eq. 19 (Wang et al., 2020): In this paper, the p-value is used to estimate the biological relevance of the identified protein complexes, and it is denoted by Eq. 20: For a more detailed explanation of these parameters, please refer to references (Lei et al., 2019a;Wang et al., 2020). If the p-value of the protein complex is less than 0.01, it means that the protein complex has biological significance.
The co-localization score (CL) (Krumsiek et al., 2008) is denoted as the maximal fraction of proteins in a protein complex that is found at the same location (Friedel et al., 2009). For all the detected protein complexes by different methods, the average co-localization score was computed using Eq. 21: where l i,j is the number of proteins in the detected protein complex j allocated to the localization group i, N j is the number of proteins in the detected protein complex j, and m and n are the number of detected protein complexes and localization groups, respectively. The final localization score was calculated as the geometric mean of the co-localization scores based on the Huh datasets (Huh et al., 2003).  MaxIter determines the number of iterations of population optimization. Figures 2, 3 show how performance is influenced by these parameters in three PPI networks and two standard protein complexes. Effect of Pop. Figure 2 shows how Total score changes with the value of Pop. We can see that Total score is not very sensitive to Pop, especially when Pop falls in [6,10]. So in our experiments, Pop is set to 8 by default.
Effect of MaxIter. Figure 3 shows the changing trend of Total score when MaxIter increases from 40 to 80. We can see that for the three PINs, the fluctuations of Total score are not significant. In our experiments, we set MaxIter 60 by default.
According to Figures 2, 3, we can see that the Total score is not very sensitive to the changing trend of MaxIter and Pop, so we only set MaxIter 60 and Pop 8 as the default value. To avoid evaluation bias and overestimation of the performance, we do not tune the parameter to a particular dataset and set them as the default value in the all experiments. For more details on parameters setting, please see https://github.com/ RongquanWang/IMA/Additional file 1.rar.

Comparison With Competitive Algorithms
To demonstrate the performance of the IMA, we compared it with 12 state-of-the-art methods. These include MCODE (Bader and Hogue, 2003), MCL (Van Dongen, 2000), IPCA (Li et al., 2008), COACH (Wu et al., 2009), ClusterONE (Nepusz et al., 2012), PEWCC (Zaki et al., 2013), ProRank+ (Hanna and Zaki, 2014), WPNCA , WEC (Keretsu and Sarmah, 2016), EWCA (Wang et al., 2019a), SE-DMTG (Wang et al., 2019b), and MPC-C (Wang et al., 2020). The total score is used for a more comprehensive evaluation of the different methods. We obtained the software implementations for all the compared methods, and their parameters are shown in Table 3. Although better results could probably be obtained by fine-tuning these parameters, we only use default or suggestion thresholds to maintain the fairness of different algorithms.
The performance of the methods was compared on four PPI networks based on two standard protein complexes. The experimental results of our IMA and other methods on these PPI networks are listed in Tables 4, 5, and the highest value of each metric of each PPI network is in bold.
First, we compared them with the standard protein complexes 1. As shown in Table 4, IMA outperformed the other algorithms on the Krogan dataset. IMA obtained a F-measure of 0.6272, CR of 0.3917, MMR of 0.3266, Jaccard of 0.4373, and total score of 2.0688, which were obviously superior to other detection algorithms. MCL predicted 370 protein complexes and achieved the highest ACC of 0.3192. Second, we compared the 13 approaches using the DIP dataset. IMA detected 1338 protein complexes, and achieved the highest F-measure, MMR, Jaccard, and Total scores, respectively. ClusterONE found 904 protein complexes and achieved a CR of 0.5062 and ACC of 0.3270, the best performance in terms of CR and ACC. However, it only achieved a F-measure of 0.5118, MMR of 0.1467, Jaccard of 0.3297, and total score of 1.8214, which were lower than those obtained using the IMA method. Second, we compared our IMA and other methods using the combined6 dataset. Table 4 shows than the results obtained by using combined6 dataset are similar to those obtained using the DIP dataset. IMA detected 1054 protein complexes and achieved the highest F-measure, MMR, Jaccard, and total score with values of 0.7256, 0.3364, 0.4869, and 2.3052, respectively. IPCA found 2160 protein complexes and achieved a better CR of 0.5106. ClusterONE predicted 648 protein complexes, achieving a ACC of 0.3306, which was the highest. Finally, we also used the WI-PHI dataset to evaluate the performance of all methods, and IMA identified 2561 protein complexes, and IMA scores of F-measure, CR, MMR, Jaccard, and total score were higher than those determined by the other methods, and they were 0.7503, 0.6223, 0.3965, 0.4828, and 2.5579, respectively. As for the ACC, which is among the top three, only lower than SE-DMTG and ClusterONE. From the above analysis, we found that the IMA algorithm achieved the best performance in the most evaluation metrics, with the exception of CR and ACC in some cases. Therefore, these results demonstrate that the IMA outperforms the base and could be an excellent approach to detect protein complexes in PPI networks. More evaluation metrics are made available in the https://github.com/ RongquanWang/IMA/Additional file 2.    The comparison results obtained using standard protein complexes 2 in Table 5 are basically consistent with those obtained using standard protein complexes 1 in Table 4. This means that the performance of the proposed IMA is relatively stable. The IMA algorithm performs significantly well on four PPI networks, and it is competitive with the other algorithms in term of

Comparison With Functional Enrichment Analysis
We needed to conduct a multi-angle analysis for this statistic, because the p-value of the identified protein complexes is closely related to the size of the identified protein complexes (Wang et al., 2019b). For this purpose, the number of detected protein complexes (Num), the number of significantly identified protein complexes, and the percentage of significantly identified protein complexes with different p-values from 1E-2 to 1E-20 were used to analyze their functional enrichment. We used a p-value test to analyze the protein complexes discovered by the IMA, MCODE, MCL, IPCA, COACH, ClusterONE, PEWCC, ProRank+, WPNCA, WEC, EWCA, SE-DMTG, and MPC-C. The results of the p-values of these methods are shown in Table 6. As shown in Table 6, the number of protein complexes that could be significantly detected by IMA was higher than that determined by the other methods in the four PPI networks. This means that IMA can detect more protein complexes with biological significance compared to other methods. Although some detected protein complexes do not match standard protein complexes currently, they are likely to be real protein complexes. As for the percentage of significantly detected protein complexes at different thresholds of the p-value from E-2 to E-20 in Table 6, we can conclude that IMA could detect a relatively higher proportion of protein complexes with biologically significance in most PPI FIGURE 5 | The 130 th protein complex in standard protein complexes 2 detected by different methods based on WI-PHI PPI network. Note that the red nodes represent the correctly identified proteins, and the blue nodes represent the wrongly identified proteins by method.
Frontiers in Genetics | www.frontiersin.org December 2021 | Volume 12 | Article 794354 networks. The above analysis demonstrates that the IMA method could be a promising method for discovering new protein complexes with biological significance.

Comparison With Subcellular Location Score
According to the definition of colocalization score, it is based on the average colocalization of all detected protein complexes. It should be noted that the lower the number of detected protein complexes, the higher the colocalization score. Here, we used the ProCope tool (Krumsiek et al., 2008) to calculate the colocalization score. Table 7 shows the average co-localization scores of protein complexes detected using various methods on localization dataset, (Huh et al., 2003). In Krogan, the best co-localization score of 0.7442 is obtained by the MCODE method, but MCODE only detected 39 predicted protein complexes, which was beneficial for achieving high the co-localization score, and IMA obtained a score of 0.7351, lower than MCODE. In DIP, ProRank + detected 357 protein complexes and obtained a co-localization score of 0.6933, which was better than that of all the other methods. In combined6, IMA method detected 1054 detected protein complexes and achieved the highest the co-localization score of 0.7440. In WI-PHI, IMA achieved the highest co-localization score, and the number of protein complexes was 2561. Based on the co-localization score of the detected protein complexes by IMA, it indicates that the proteins of protein complexes predicted by IMA have better localization consistency; these proteins in the same protein complex tend to carry out a similar function.

CASE STUDY AND DISCUSSION
IMA algorithm can detect protein complexes with multiple topological structures. Figure 4 shows some examples of the detected protein complexes with different topological structures by using the IMA algorithm. Note that the standard protein complexes 1 and 2 are also detected by the IMA algorithm. These protein complexes contain the number of proteins between 3 and 14. These protein complexes with different topological structures include linear, triangular, starlike, rectangular, k-clique, dense subgraph, and coreattachment structure, and hybrid structure. Proteins that have binary interactions are connected by dashed lines. More examples can be found them at https://github.com/ RongquanWang/IMA/Examples. Figure 5 visualizes an example of the 130 th protein complex in standard protein complexes 1 in the WI-PHI dataset so as to display the detection result more clearly. Figure 5A shows that our IMA successfully detected all proteins correctly. The red nodes represent the correctly identified proteins, and the blue nodes represent the wrongly identified proteins.
From Figure 5, we can see that SE-DMTG correctly identifies 14 proteins, but misidentifies a protein. Moreover, the other methods have inaccurately proteins. Our IMA can correctly identify almost all proteins, which suggests that the IMA algorithm is superior to the other comparative methods.

CONCLUSION
In this paper, we present a novel IMA method for identifying protein complexes in PPI network. The key idea of IMA is enabled us to design an improved memetic algorithm to optimize a fitness function for identifying protein complexes in PPI networks based on existing contending methods and a weighted PPI network. Here, an improved memetic algorithm is the cooperation of a genetic algorithm with a local optimization strategy. A genetic algorithm is used to improve the diversity of the population, and the local optimization strategy helps to locate better solutions more quickly. Furthermore, we designed a fitness function to overcome the limitations of a single objective function in estimating an individual's fitness. The experimental results show that IMA significantly outperforms the existing outstanding algorithms in various metrics. We will use graph neural networks (Zhang et al., 2021) and other evolutionary algorithms to improve the accuracy of protein complexes identified in the future.

DATA AVAILABILITY STATEMENT
Publicly available datasets were analyzed in this study. This data can be found here: https://github.com/RongquanWang/IMA.