A Local Search Algorithm for the Influence Maximization Problem

How to select a set of top k nodes (called seeds) in a social network, through which the spread of influence under some certain diffusion models can achieve the maximum, is a major issue considered in the social network analysis. This problem is known as the Influence Maximization Problem (IMP). Due to its NP-hard nature, designing a “good” algorithm for the IMP is a very challengeable work. In this paper, we propose an efficient local search algorithm called DomIM to solve the IMP, which involves two main ideas. The first one is an approach to constructing an initial solution based on a dominating set, while the second is a degree based greedy strategy in the local search phase. DomIM is evaluated on three real world networks, under three widely-used diffusion models, including independent cascade (IC) model, weighted cascade (WC) model, and linear threshold (LT) model. Experimental results show that DomIM is competitive and efficient, and under all of these diffusion models it can obtain the best performance (in terms of solution quality) on the networks we consider.


INTRODUCTION
A social network is an interconnected structure which consists of a set of socially relevant nodes (e.g., individuals, groups, organizations, or related systems) connected with one or more relations, such as shared ideas, social contacts, financial stock exchanges, and affinities [1,2]. Needless to say, exploring valuable information related to nodes and revealing relations between them are very meaningful and significant. For this, many topics have been introduced to analyze social networks, from a different perspective; please refer to [3,4] for an overview of social network analysis.
One of the most studied problems in the social network analysis is the influence maximization problem (IMP), whose task is to select a set of k nodes from a given social network, called seed set, through which the number of influenced nodes under some certain diffusion model can achieve the maximum. Due to its potential applications in practice, the IMP has attracted wide-spread attention. Especially, in today's era, with the rapid development in the communication field, the size of social networks is becoming increasingly large. As a consequence, information exchange among users throughout social networks has become an indispensable part in our daily life, and meanwhile a large number of users may be influenced by such information diffusion. So, there is a growing body of literature analyzing the influence and information propagation in social networks [5][6][7][8].
The well-known application of the IMP is viral marketing, which aims to exploit the network value of customers, i.e., the potential influence of a customer who may recursively influence his neighbors (e.g., family members, colleagues, friends, friend's colleagues, friend's friends, and so on) to buy a product through the "word-of-mouth" propagation [9]. Clearly, a small number of highly influential customers can be specified as potential customers to market to, so that the expected profit can be maximized. Besides viral marketing, there are also many other applications, e.g., analyzing human behavior [10], target advertisement [11], rumor blocking [12], social recommendation [13], etc. Practically, the spread of influence can occur with the aid of some operational models. Three widely-used diffusion models are independent cascade (IC) model, weighted cascade (WC) model, and linear threshold (LT) model, where the WC model is a special case of the IC model [14]; see Section 2 for an detailed discussion of these models.

Related Works
The IMP in social networks was first studied in 2001 by [9], who regarded it as an algorithm problem. Since then, it has been studied extensively, especially after the work by [14] who proved that the IMP is NP-hard by defining it as a combinatorial optimization problem. Nevertheless, it is still challengeable to solve the IMP, due to the following two difficulties: the first one is how to accurately measure the influence of a given seed set, which has been shown to be YP-hard; the second is how to select a seed set with the maximum influence [15,16]. We make an overview of related works on the IMP from the following two aspects: greedy based approaches and heuristic approaches. For more detailed categories on this problem, please see the survey papers [17][18][19].

Greedy Based Approach
It is widely believed that the initial work using greedy based idea to solve the IMP is attributed to [14], who proposed a simple hillclimbing greedy algorithm to solve the IMP under the IC model and the LT model. Despite the algorithm can get a guarantee that obtains the optimal solution with a high probability (about 63%), it is very time-consuming, because it has to search for the whole network (every node) and implement tens of thousands Monte-Carlo simulations. To optimize the efficiency of the simple greedy algorithm, [20] proposed an improved greedy algorithm with an approximation ratio of 1 2 (1 − 1 e ), called CELF, which selects influential nodes leveraging the submodular property. They showed that CELF can achieve up to 700 times faster than the simple greedy algorithm. Whereas, CELF has a poor performance in large network since it has to compute the marginal influence spread of each alternative node repeatedly [21]. [22] designed new schemes to optimize the greedy algorithm under the IC model, by which they generated a faster greedy algorithm based on CELF. [23] developed an improved version of CELF, called CELF++, and showed that it is 35-55% faster than CELF. [24] proposed a deprecation based greedy algorithm for the IMP, called DGS. This algorithm first orders the nodes of a social network by applying three heuristic influence functions, and then selects the most influential nodes from a list of pre-ordered vertices. Although DGS takes less time than CELF, it is still time-consuming in large networks. In [25], Heidari et al. proposed a fast greedy algorithm SMG to solve the IMP. By reducing calculations in counting the traversing nodes and Monte-Carlo graph construction, SMG improves the efficiency of greedy algorithms. To deal with the time-consuming drawback of greedy algorithms, [26] proposed a CascadeDiscount algorithm for solving the IMP. The algorithm uses PageRank to measure the initial influence of nodes, measures node's marginal gain of influence spread by considering the influence loss on their neighbors, and then selects the most influential nodes based on a greedy strategy. [27] proposed a communitybased framework for the IMP, which was further improved in [28] by designing an objective function to evaluate the influence spread and then generating an efficient greedy algorithm to find the influential nodes.
A simple greedy algorithm can yield nearly optimal solutions, but it is often time-consuming, which limits its application on large-scale networks. As a useful technique to deal with NP-hard problems, heuristic approaches have been widely used in a variety of problems, such as partition coloring problem [29], network immunization [30], dominating set problem [31], etc. Also, heuristic algorithms for the IMP are proposed sequentially.

Heuristic Approach
To solve the low efficiency of simple greedy algorithms, [22] in 2009 proposed a degree discount heuristics to improve influence spread, by considering the degree discount of a candidate node caused by its seed neighbors. However, compared with greedy algorithms, the algorithm has a poor accuracy, though it reduces the running time. Later on, [32] introduced a heuristic approach called MIP to measure node's influence from other nodes, by which a heuristic algorithm called PMIA was developed to solve the IMP on large-scale social networks. The drawback of PMIA is that it has to design different thresholds for different networks and there is no uniform method to set the thresholds, which may affect the accuracy of the algorithm. Since then, a large body of heuristic algorithms for the IMP are developed. In 2011, [33] designed a simulated annealing based algorithm for the IMP under the IC model, which integrates two heuristic approaches to optimize the convergence process and a method to speed up the selection of the most influential nodes. [34] proposed a scalable influence approximation algorithm IPA for the IMP under the IC model, which uses an independent influence path to estimate the influence of nodes. For the purpose of bridging the theory and practice in influence maximization, [35] proposed an algorithm called TIM. They showed that TIM runs in O((k + ℓ)(n + m)log n ε 2 ) time and guarantees an approximation ratio of 1 − 1 e − ε (with at least 1−n −ℓ probability). By utilizing the genetic approach and the strength greedy algorithm, [36] proposed an efficient algorithm for solving the IMP in social networks. Based on evolutionary methods, [37] introduced a simple genetic algorithm for the IMP. In [38], Kim proposed a Random Walk and Rank Merge based algorithm, which uses a random walk method to speed up the algorithm. [39] analyzed the reason why the greedy approaches have low efficiency and proposed a degree-descending search strategy, based on which they designed an evolutionary algorithm. By eliminating the time-consuming simulations in a greedy algorithm, the efficiency of the algorithm is improved significantly. Recently, [16] proposed a discrete shuffled frogleaping algorithm for the IMP, which selects influential nodes based on network topology characteristic. In [21], Qin et al. introduced a discount-degree descending technology and lazyforward technology to identify a set of candidate nodes, based on which they designed a two-stage selection algorithm for the IMP in social networks. [6] proposed a path-based approach, which uses the degree and the independent influence path to estimate the influence spread and uses a heuristic method to reduce the computation volume.
The heuristic algorithms usually have better running time and scalability. But, they cannot provide any performance guarantee.

Contribution
In this paper, we propose an efficient local search algorithm named DomIM to solve the IMP in social networks. Our contributions mainly include the following three aspects.
(1) We propose a mechanism to construct a high quality initial solution based on dominating set, and an approach to building candidate set. (2) A degree based greedy strategy is introduced in the local search. (3) DomIM is evaluated on three real world graphs, under IC model, WC model, and LT model. Compared with four heuristic algorithms, DomIM is competitive and efficient, and obtains the best performance on these graphs.
The remainder of the paper is organized as follows. Section 2 introduces basic definitions, including the influence maximization problem and three diffusion models. Section 3 gives a brief overview of dominating set problem and a heuristic algorithm for finding minimum dominating set that we will quote. Section 4 describes our DomIM algorithm. Section 5 presents experimental results and Section 6 concludes this paper with future work.

PRELIMINARIES
To study the IMP, we often abstract a social network as a graph, where the vertex set represents the set of nodes in the social network and edge set represents the social ties among nodes. From now on, we use the term "graphs" to replace "social networks", and follow the standard terminologies in graph theory.
All graphs considered in this paper are simple undirected graphs. Let G (V, E) be a graph with vertex set V and edge set E.
We use a 2-length string uv to denote an edge connecting two vertices u and v. The two vertices u, v are called endpoints of edge uv. An edge is said to be incident with its two endpoints, and the two endpoints of an edge are said to be adjacent to each other. A vertex is called a neighbor of another vertex, if they are adjacent in to denote the subgraph of G induced by S, i.e., the resulting graph obtained from G by deleting all vertices in V \ S and their incident edges.

Influence Maximization Problem
Given a graph G (V, E) and a positive number k, the task of the IMP is to find a set S of k vertices (called seed set) such that the influence spread by S [denoted by σ(S)], i.e., the number of influenced vertices triggered by S, reaches maximum under a given diffusion model. This problem was formulated as an optimization problem by [14], which is shown as follows.
In Equation 1, the maximum is taken over all seed sets S and S* is the best one that can maximize the spread of influence. Now, we describe three widely-adopted diffusion models that we will use for the IMP.

Independent Cascade Model
As the simplest model of dynamic cascade models, the IC model was first investigated by [40]. In this model, the influence spread, starting with a set of active vertices (seed set), follows a randomized rule: an active vertex can activate its inactive neighbors only when it first becomes active. Specifically, let u be a vertex activated at step t. Then, for each inactive neighbor v ∈ N G (u), there is a single change for v to be activated by u with probability p u,v (a parameter independent of all previous attempts to active v). If u succeeds, then v will become active at step t + 1; otherwise, v is still inactive. Note that whether or not v is activated successfully, it cannot be further activated by u at subsequent steps. If at step t, an inactive vertex u has more than one newly activated neighbors, then they can activate u one by one in any order. In this way, the diffusion process stops when no more possible vertices will be activated.

Weighted Cascade Model
The WC model is a special IC model [14], in which a newly activated vertex u activates its inactive neighbor v with a probability related to the degree of v, i.e., p u,v 1 dG(v) . It is clear to see that a high-degree vertex may be activated by each of its activated neighbors with low probability. In a certain sense, this simulates the actual interpersonal relationships. Consider the case that if a person has only one friend, then suggestions from his unique friend will play a very important role in his decisions. In contrast, if a person has many friends, then suggestions from one of its friends may be less important to his decisions.

Linear Threshold Model
The LT model is different from the IC model and the WC model, which estimates the spread process by using vertex-specific thresholds [14]. In this model, an inactive vertex v is influenced by each of its active neighbor u with a weight b v,u , under the limitation of u b v,u ≤ 1, where u is taken over all active neighbors of v. Indeed, this limitation has its own significance, since the probability that u can be activated is at most one.
The dynamic process can be described as follows. We preassign randomly a threshold θ v ∈ [0, 1] to each vertex v. Then, start with a seed set as an initial set of active vertices; in sept t ( ≥2), each active vertex in step t−1 (t ≥ 2) is still active and an inactive vertex v is activated successfully if the total weight of its active neighbors is at least θ v , i.e., It is intuitive that the thresholds of vertices represent the distinct potential tendencies of vertices to become active. Due to the lack of knowledge, we assign the same threshold to all vertices in the experiment.

DOMINATING SET
A dominating set of a given graph G (V, E) is a subset S of vertices such that V \ S 4 N G (S), where N G (S) {v|v has a neighbor in S}. The minimum dominating set problem (MDS) aims to find a dominating set with the minimum cardinality. The MDS is a classic NP-hard problem, which has been widely studied in both theoretical and application aspects [41,42], especially for designing efficient approximation algorithms [43,44]. Given that vertices in a dominating set may have some important properties,  we have reason to believe that vertices from a (minimum) dominating set can have high influence. So, we can construct an initial solution based on a dominating set of a given social network.
In our algorithm DomIM which will be presented in the subsequent section, an algorithm finding a minimum dominating set will be used. We quote such an algorithm called ScBppw proposed by [31]. Here we present the local search framework of ScBppw for the reader's convenience.
Notice that Algorithm 1 integrates two sub-procedures, InitDS and ExchangeVertices, where InitDS is a simple greedy strategy to generate an initial solution and ExchangeVertices is an exchanging procedure based on a proposed tabu strategy. For more information about this algorithm, please refer to paper [31].

THE DOMIM ALGORITHM
We develop a local search algorithm for IMP named DomIM (Algorithm 2), which is based on dominating set and a degreerelated rule for selecting vertices.
In the beginning, the algorithm finds a minimum dominating set, by which an initial solution will be constructed. Considering that the MDS problem is NPhard, we adopt a fast heuristic algorithm (Algorithm 1) to approximatively find a minimum dominating set of the input graph (line 2). A key concept of our algorithm is the uncorrelated degree.
For a fixed set S, the vertices with the maximum S-uncorrelated degree may have higher influence in some sense, since it can influence more vertices directly when vertices in S are not considered. This observation is used to construct an initial solution and design an approach to improving solutions by exchanging vertices.
The algorithm utilizes a greedy strategy (related to the uncorrelated degree) based on dominating set to construct an initial solution. Notice that when the dominating set D contains less than k vertices, the algorithm selects k−|D| vertices from V \ D (according to the Suncorrelated degree from large to small) to generate an initial solution by adding them into D (lines 3-5); otherwise, the algorithm chooses top k vertices from D in terms of the uncorrelated degrees (as large as possible) as an initial solution (line 7).
After the construction of an initial solution S, the algorithm computes the minimum S-correlated degree of vertices in S, according to which a candidate set T is constructed for the local search phase (Line 9). Note that T may be not large enough to improve the current solution S by exchanging vertices repeatedly between T and S; if this happens, i.e., |T| < α|V|, the algorithm selects Pα|V|R − |T| vertices from V \ (S ∪ T) with S-uncorrelated degree as high as possible and adds them to T, where α is a real number in the interval (0,1) related to the value of |V| and the type of the diffusion model (lines [10][11][12].
Subsequently, a loop (lines 13-24) is executed until a given termination condition is reached. DomIM returns the best found seed set S* (line 25). In each iteration of the loop, a local search process is executed to exchange vertices between the current solution S and the candidate set T for improving the current solution (starting with the initial solution). Specifically, the algorithm chooses a vertex u ∈ S with the minimum Suncorrelated degree (randomly select one when there is more than one vertices with the minimum value). Note that it is possible that u is not be selected for the first time; if so, u is reselected randomly (lines [15][16]. Then, remove u from S, and Frontiers in Physics | www.frontiersin.org October 2021 | Volume 9 | Article 768093 select a vertex v from T randomly (for the diversity) and add it to S (lines [17][18]. If the exchanging can produce more influence, then it is viewed as a valid process, and update S* by S and T by T ∪ {u} (for the diversity of solutions) (lines [19][20][21]; otherwise, S is back to the previous state (lines [22][23][24].

EXPERIMENTS
We evaluate DomIM on three real world (undirected) networks under the three diffusion models mentioned in Section 2, i.e., the LT model, the IC model, and the WC model. The data come from two databases: Network Repository 1 and SNAP (Stanford Large Network Dataset Collection) 2 . ia-email-univ (IEU) [45]: This network is from Network Repository, which is an email communication network at the University Rovira i Virgili in Tarragona in the south of Catalonia in Spain. There are in total 1,133 vertices and 5,451 edges. Each vertex represents a user and an edge connecting two users indicates that one sent at least one email to another.
soc-wiki-Vote (SWV) [45]: This network is also from Network Repository, which involves all the Wikipedia voting data from the inception of Wikipedia till January 2008. This graph contains 889 vertices and 2,914 edges, where vertices represent Wikipedia users and a direct edge from vertex i to vertex j represents that user i vote on user j. In our experiment, we consider only the underlying undirected graph of this graph.
feather-lastfm-social (FLS) [46]: This is a social network of LastFM users which was collected from the public API in March 2020. This graph is from SNAP, consisting of 7,624 vertices and 27,806 edges, where vertices represent LastFM users from Asian countries and edges are mutual follower relationships between them.

Experiment Setup
DomIM is implemented in C++ and complied by g++ 8.2.0. All experiments are run on a computer with Intel i7-8565U 1.80 GHz with 16 GB RAM under Windows 10.
We compare the overall performances of DomIM with four heuristic algorithms, including Degree [14], Random [14], CELFGreedy [20], and TreeCore [47]. Degree is a simple algorithm that selects high-degree vertices. Random chooses vertices randomly. CELFGreedy is a greedy algorithm with lazy-forward optimization, in which for each candidate seed set, it executes 10,000 simulations to obtain an accurate estimation of influence spread. Therefore, CELFGreedy is time-consuming. TreeCore is an approach based on a network connectivity parameter called tree coritivity.
For each instance, all algorithms are executed 3 times with seed set size from 1 to 50, from which we select the best solutions for each situation. The time limit of each run is at most 90 s, which is dependent on the size of networks.

Results on Real World Social Networks
Experimental results under the three diffusion models are shown in three groups of figures (Figures 1-3), where each group contains three figures corresponding to the results on the three networks we consider, respectively [1) for the IEU network, 2) for the SWV network, and 3) for the FLS network]. In each figure, the x-axis represents the size of seed set (denoted by seed set size which is from 1 to 50) and the y-axis represents the number of all vertices that are activated at the end of the diffusion process (denoted by influence spread). Each figure depicts the results obtained by five different algorithms (represented by distinct colors), where CELFGreedy, Degree, Random, and TreeCore are the four approaches mentioned above, and DomIM is our algorithm.

Results Under the LT Model
Under the LT model, all vertices are assigned to the same threshold 0.5 in the experiment, and we assume that an inactive vertex v is influenced by each of its neighbor u with the same weight b v,u 1 dG(v) . In Figure 1 (a), α is set to 0.1 and cutoff is 10 s; in Figure 1 (b), α is set to 0.05 and cutoff is 15 s; and in Figure 1 (c), α is set to 0.06 and cutoff is 50 s.
In each figure, the trend of influence spread is on the rise as the seed set size increases, although some exceptions may occur due to the random selection in the local search phase. Of all these approaches, Random did worst on these instances. And the reason is simple because Random does not consider any network properties and does not use any strategy to improve the solution. We use Random here just for the sake of comparison. As a whole, our algorithm DomIM preforms the best in terms of solution quality, but Degree and TreeCore are worse on the IEU instance and CELF is worse on the SWV instance. In particular, for the IEU and SWV instances, DomIM is essentially better than the other algorithms. For the FLS instance, CELFGreedy performs slightly worse than DomIM, but Degree and TreeCore are worse.

Results Under IC Model
Under the IC model, for every two adjacent vertices u and v such that u is an active vertex and v is an inactive vertex, the probability p u,v that v is activated by u is set to the same value 0.05 (this is based on the consideration that the networks we use are sparse). In Figure 2 (a), α is set to 0.01 and cutoff is 10 s; in Figure 1 (b), α is set to 0.01 and cutoff is 20 s; and in Figure 1 (c), α is set to 0.0015 and cutoff is 20 s.
As shown in Figure 2, all of these algorithms (except for Random) can obtain a better influence spread, and they have a very similar performance on all the three instances. Our algorithm DomIM slightly outperforms Degree, TreeCore, and CELFGreedy (especially when the size of seed set increases), and TreeCore and CELFGreedy perform very closely to DomIM. Note that the result obtained by the simple approach Degree is also not bad. The reason is because the diffusion probability is not so large, which limits the propagation depth of an active vertex. So, vertices with high-degree may influence much more neighbors. This shows that selecting high-degree vertices as seed set is possible to produce a good influence spread for this case.

Results Under WCM
For the WC model, in Figure 3 (a), α is set to 0.01 and cutoff is 20 s; in Figure 1 (b), α is set to 0.01 and cutoff is 20 s; and in Figure 1 (c), α is set to 0.0065 and cutoff is 90 s.
As shown in Figure 3, all algorithms (except for Random) can achieve a similar influence spread. For the IEU instance, DomIM has the best performance under almost all cases (in terms of the seed set size); For the SWV and FLS instances, DomIM and CELFGreedy are better than other algorithms, and they have a similar performance. However, DomIM is efficient, while CELFGreedy is inefficient which will take a long time to obtain a better solution.

Analysis of Underlying Strategies
We also study the effectiveness of the key strategies of our algorithm. We modify DomIM to obtain two alternative approaches, denoted by DomIM 1 and DomIM 2 , where DomIM 1 uses standard degree to replace the uncorrelated degree and DomIM 2 removes the local search procedure on the basis of DomIM.
The comparison experiment of DomIM and its alternatives on the three real-world instances is implemented under the LT model, and the results are presented in Figure 4, from which we see that DomIM is better than DomIM 1 and DomIM 2 . This implies that these two strategies play an important role in our algorithm DomIM.

CONCLUSION
We proposed a local search algorithm DomIM for the IMP. Compared with four distinct types of algorithms, DomIM is efficient and robust, and obtains the best performance for all graphs and all diffusion models we use. However, for the purpose of obtaining an improved solution in the local search phase, our algorithm has to compute the influence of a newly constructed seed set in each iteration. This may slightly effect the efficiency of DomIM. We would like to consider this issue in our future work.

DATA AVAILABILITY STATEMENT
The original contributions presented in the study are included in the article/Supplementary Material, further inquiries can be directed to the corresponding author.

AUTHOR CONTRIBUTIONS
EZ performed the measurements, analyzed the results, and wrote the paper. LY: performed the experiments and analyzed the results. YX: analyzed the results and refined our manuscript.