Identifying Multiple Influential Spreaders in Complex Networks by Considering the Dispersion of Nodes

Identifying multiple influential spreaders, which relates to finding k (k > 1) nodes with the most significant influence, is of great importance both in theoretical and practical applications. It is usually formulated as a node-ranking problem and addressed by sorting spreaders’ influence as measured based on the topological structure of interactions or propagation process of spreaders. However, ranking-based algorithms may not guarantee that the selected spreaders have the maximum influence, as these nodes may be adjacent, and thus play redundant roles in the propagation process. We propose three new algorithms to select multiple spreaders by taking into account the dispersion of nodes in the following ways: (1) improving a well-performed local index rank (LIR) algorithm by extending its key concept of the local index (an index measures how many of a node’s neighbors have a higher degree) from first-to second-order neighbors; (2) combining the LIR and independent set (IS) methods, which is a generalization of the coloring problem for complex networks and can ensure the selected nodes are non-adjacent if they have the same color; (3) combining the improved second-order LIR method and IS method so as to make the selected spreaders more disperse. We evaluate the proposed methods against six baseline methods on 10 synthetic networks and five real networks based on the classic susceptible-infected-recovered (SIR) model. The experimental results show that our proposed methods can identify nodes that are more influential. This suggests that taking into account the distances between nodes may aid in the identification of multiple influential spreaders.


INTRODUCTION
Many real-world problems involve the identification of multiple influential nodes in complex networks, such as finding a few individuals who are critical to the spread of information on the internet, or who may speed up the transmission process of pestilence in crowds once infected [1]. The problem of identifying multiple influential nodes differs from that of discovering the most influential nodes. The latter refers to finding the k (k > 1) most influential spreaders, which is commonly addressed by ranking the influence of individual nodes. The former involves the identification of a set of k nodes with the maximum influence as a whole. That is, identifying multiple influential nodes should take into account the different roles that nodes play in the propagation process rather than just evaluating their individual influence [2].
Methods to identify multiple influential spreaders fall in three categories. The first regards this as an influence maximization (IM) problem. Some well-known methods include the greedy [3], new greedy [4], community-based greedy [5], k-medoid [6], twophase influence maximization [7], and collective influence [8] algorithms. However, as the IM problem is NP-hard, these algorithms are challenged by increasing network sizes, and thus are not applicable to huge real networks.
Methods in the second category attempt to identify multiple influential nodes by ranking their influences, which are calculated according to various topology-based centrality measures: 1) classic topological centrality metrics, such as degree centrality [9], betweenness centrality [10], and closeness centrality [10]; 2) centrality measures that take into account multiple (global or local) network features, such as KED centrality [11], efficiency centrality (EC) [12], composite centrality based on analytic hierarchy process [13], and classified neighbors centrality [14]; and 3) local-information-based iterative algorithms such as PageRank [15], LeaderRank [16], and VoteRank [17]. However, the ranking approach may not always find a set of nodes with the maximum influence [18], possibly because they separately measure the influence of each node, and thus omit overlapping effects of topologically adjacent top-ranked nodes.
Algorithms in the third category consider the distance between nodes when evaluating node importance. For instance, the local index rank (LIR) algorithm [19] is based on the local index (LI) value of a node, which represents the number of neighbors whose degree exceeds that of the focus node. Spreaders are selected from nodes whose LI values are 0 (i.e., 0-LI nodes). However, the LIR method cannot avoid some adjacent 0-LI nodes, and sometimes there are not enough 0-LI nodes to be selected as spreaders. Another example is the independent set (IS) algorithm [20], which divides nodes into independent sets by the Welsh-Powell coloring algorithm and selects spreaders in the largest independent set to ensure that selected nodes are non-adjacent. However, special situations may occur, such as not enough spreaders in the largest independent set; meanwhile directly selecting rest spreaders in following independent sets may derogate the advantages brought by independent set. We propose three methods with different degrees of dispersion to identify multiple spreaders. The first one is LIR-2 method which extends the concept of the local index to secondorder neighbors and does not restrict the spreaders' selection from the 0-LI nodes. By doing so, this method enlarges the distance between the 0-LI nodes and can guarantee to select enough spreaders. The second one is IS-LIR method which hybrids LIR and IS to ensure that nodes in the same independent set are non-adjacent. The third one is IS-LIR-2 method, which hybrids the improved second-order LIR method and IS method so that the selected spreaders are more dispersed. Comparing the proposed three methods with traditional methods for multiple spreader identification on 10 synthetic networks and five real networks based on the SIR propagation model, we find our methods more effective in maximizing the size of the spreading coverage, and that a higher dispersion of the selected multiple spreaders helps to amplify the spreading.
The rest of this paper is organized as follows.Sec. 2 introduces work relating to the identification of multiple influential spreaders. Sec. 3 formalizes the research problem and proposes our method. Sec. 4 describes our experiments, including baseline methods, the SIR propagation model, evaluation metrics, parameter settings for experiments, and datasets. Sec. 5 provides the experimental results and discusses why diversity should be considered when we select a set of influential spreaders. We summarize our work in Sec. 6.

RELATED WORK
Identifying a set of influential nodes in a network is important for designing network immunization [21], system control strategy [22] and improving the network robustness [23,24]. Work about multiple spreader identification falls in three categories. The first regards it as an influence maximization problem [3], and thus utilizes optimization algorithms to directly identify a set of spreaders. The greedy algorithm [3] is a classic example. These algorithms are accurate but time-consuming, and thus do not suit large-scale networks. Some researchers employ information about network structures to reduce the time complexity while maintaining the high accuracy of classic optimization algorithms. The NewGreedy algorithm [4] removes edges that do not contribute to propagation, so as to speed up the simulation process. The community-based greedy algorithm (CGA) [5] mines the top-k spreaders from detected communities so as to reduce the running time. Another algorithm [6] constructs an information transfer probability matrix and uses the k-medoid clustering algorithm to find the most centrally located nodes in clusters as spreaders. Two-phase influence maximization (TIM) [7] includes the phases of parameter estimation and node selection to reduce time complexity.
Methods in the second category select the top-ranked spreaders, whose influence is calculated based on network topological information. Classic indicators such as degree centrality [9], betweenness centrality [10], closeness centrality [10], and coreness centrality [25], have been utilized to estimate the influence of spreaders. Some researchers take into account multiple (global or local) network features when measuring the importance of spreaders [26]. For instance, KED centrality [11] combines the number and diversity of paths. Composite centrality based on the analytic hierarchy process (AHP) [13] combines degree, betweenness, and closeness centrality. Classified neighbors centrality (CNC) [14] classifies the neighbors of a focal node into four groups according to the removal order in the process of k-shell decomposition, weights each class differentially, and sums them to characterize the spreading capacity of the node. PageRank [15], LeaderRank [16], and VoteRank [17] all consider the importance of a node itself and its connections with other nodes to identify influential nodes. These rank-based algorithms often have simple forms and low time complexity and can effectively mine a single important node. However, they may not efficiently find multiple important spreaders because they seldom consider interactions between spreaders, i.e., they ignore the overlapping effects of top-ranked nodes if they are topologically adjacent.
Algorithms in the third category attempt to minimize the overlapping effects of spreaders during selection. The SuperNode algorithm [27] uses the Blondel community detection algorithm to get the community division in the network, and selects important nodes from the communities according to size so that the selected nodes have some distance. An independent set (IS)-based partitioned ranking algorithm [20] divides nodes into independent sets by the Welsh-Powell coloring algorithm, then selects the top-ranked nodes in the largest independent set based on certain centrality indicators. The local index rank (LIR) algorithm [19] selects spreaders from nodes with 0-LI values, i.e., those whose direct neighbors have lower degrees than themselves. However, there may not be enough 0-LI nodes to be selected as spreaders in some cases, and the selection of adjacent nodes cannot be avoided. We seek to overcome the above deficiencies by extending LIR methods to two-layer neighbors and integrating them with IS methods.

METHODS
We formalize the problem of multiple influential spreader identification and propose the LIR-2, IS-LIR, and IS-LIR-2 algorithms, which consider the diversity of nodes to different degrees.

Formulation of Research Problem
Given a graph G (V, E), where V {v 1 , v 2 , . . . , v N } denotes the node-set and whose size is N, and E {e 1 , e 2 , . . . , e M } denotes the edge-set, whose size is M. A method to address the problem of multiple influential node identification can be regarded as a function f (·) to select a node subset S ⊆ V with a given k (1 < k < N) nodes, which should have the maximum influence on graph G, i.e., S* arg max S f(S, G).

LIR-2 Method
LIR-2 improves on LIR [19], where the local index (LI) of node v i is the number of its first-order neighbors of greater degree, i.e., E} contains the neighbors of v i , and Q(x) 1 when x > 0, and otherwise Q(x) 0. Nodes with LI values of zero (i.e., 0-LI nodes) are ranked by degree, and the top-ranked nodes are selected as spreaders.
LIR-2 extends the neighbors of node v i from first to second order. The second-order local index LI 2 of node v i is defined as where N(v i ) and N (N(v i )) denote the firstand second-order neighbors, respectively, of node v i , Q(x) 1 when x > 0, and otherwise Q(x) 0. According to the definition, the LI 2 value of node v i is the number of its firstand second-order neighbors of greater degree.
The LIR-2 method sorts nodes by LI 2 values within degrees, and selects those of top rank as spreaders, as described in Algorithm 1.

Algorithm 1. LIR-2
Figures 1A,B illustrate LIR and LIR-2, respectively, on a toy network with 20 nodes and 41 edges. Figure 1C shows a single 0-LI node (node 20). Therefore, 0-LI nodes are insufficient for the selection of multiple spreaders. As LIR-2 is not limited to the selection of top-ranked spreaders from nodes with 0 LI 2 values, they can select spreaders as required.

IS-LIR Method
The LIR method cannot avoid the selection of adjacent nodes. We combine LIR with the IS method to ensure that nodes in the same independent set are non-adjacent. The proposed IS-LIR method uses the Welsh-Powell algorithm to divide nodes into different independent sets, then calculates LI for nodes in independent sets that are ranked in descending order. Nodes are selected from the ranked independent sets, one by one, based on the LIR method. The IS-LIR algorithm is outlined in Algorithm 2. Figure 1D illustrates the IS-LIR method on a toy network as an example, first using the Welsh-Powell algorithm to color all nodes in four colors (blue, green, yellow, and pink). Nodes of the same color constitute independent sets, which are sorted by node size, and nodes are sorted by degree within each independent set. We now have a node list, whose top members are selected as the influential spreaders. For instance, using the IS-SIR method, if we seek three effective spreaders on the toy network, we will select nodes 20, 8, and 1 in the blue set.

IS-LIR-2 Method
IS-LIR-2 combines IS and LIR-2 to select spreaders from more dispersed candidates. Its process, as shown in Algorithm 3, is similar to that of IS-LIR, but nodes in each independent set are ranked based on LI 2 values.
Frontiers in Physics | www.frontiersin.org January 2022 | Volume 9 | Article 766615 Algorithm 3. IS-LIR-2 Figure 1E illustrates how IS-LIR-2 runs on the toy network. Like the IS-LIR method ( Figure 1D), nodes are colored with four colors. Three spreaders, nodes 20, 8, and 4, are selected according to their LI 2 values, as shown in Figure 1F.

EXPERIMENT SETTINGS
We introduce the classic SIR model, which will be utilized to simulate epidemic spreading, and present two evaluation metrics to compare the performance of the proposed methods with eight baseline methods: degree centrality ranking (DC) [9], LIR [19], degree centrality ranking based independent set (IS-DC) [20], eigenvector centrality ranking based independent set (IS-EV) [20], neighborhood centrality ranking based independent set (IS-ND) [20], and VoteRank [17]. We describe the synthetic and real networks used in our experiments, and discuss parameter settings.

SIR Model
The SIR model classifies each node in a propagation process into the three states of susceptible, infected, and recovered. All nodes are initially susceptible, except a few in infected states. In our simulations, the infected nodes at time step t 0 are those identified as influential nodes by our proposed methods and the baseline methods for comparisons. At each time step, infected nodes at the end of the previous time step randomly select a neighbor node, which, if susceptible, will be infected with probability μ. All infected nodes recover with probability β. Recovered nodes cannot be infected again, and cannot affect susceptible neighbor nodes. Simulations end when there are no infected nodes in the network.

Evaluation Metrics
We use two measures to evaluate the performance of our methods in identifying effective influential spreaders. The outbreak size proportion [28] at time step T is where n R(T) and n I(T) are the numbers of susceptible and infected nodes, respectively, at the end of the time step T, and N is the total number of nodes. Frontiers in Physics | www.frontiersin.org January 2022 | Volume 9 | Article 766615 The average shortest path length of the identified spreaders represents the dispersion among them [28], and is defined as where l (u, v) is the shortest path length between nodes u and v; when |S| 1, L 0. A larger L indicates a smaller overlapping neighbor area between nodes in the spreader set.

Synthetic and Real Networks
To evaluate the effectiveness of our proposed methods in identifying influential spreaders on networks with different topological structures, we compare them with benchmark algorithms on 10 synthetic networks and four real networks. The synthetic networks include three small-world networks generated based on the Watts-Strogtaz (WS) small-world network model [29], four scale-free networks generated based on the Barab asi-Albert scale-free network model [30], and three networks with community structures generated by the LFR community network model [31]. Table 1 presents key parameter settings for the 10 synthetic networks, and Table 2 summarizes their basic topological features.
The five real networks used in this study include a football network [32], a collaboration network [33] between jazz musicians (referred to as jazz network), a contact network between high school students (referred to as high-school network) [34], an email network [35], and a power network [29]. The football network includes United States college Division I football games in 2000, where nodes represent teams, and edges are regular-season games between two connected teams [32]. The jazz network describes collaborations between jazz musicians, where each node represents a jazz musician, and an edge denotes that two musicians have played together in a band. The highschool network shows contacts between high school students in specific classes (called "classes préparatoires" in Lycée Thiers, France). The email network presents email communications at the University Rovira i Virgili in Tarragona, Spain, in 2003. Nodes are users, and each edge represents that at least one email was sent. The power network is a topological representation of the Western States Power Grid in the United States, where an edge denotes a power supply line and a node can be a generator, transformer, or substation. Table 3 summarizes the basic topological features of the five real networks.

Parameter Settings
Our experiments based on the SIR model explored the proportion of final outbreak size F (t end ) with respect to the effective infected probability λ and proportion of spreaders p. We set the parameter of the recovered probability β 1 < k > used by He et al. [27].
We also carried out a sensitivity analysis on the size of the spreader set p, varying it between 0.01 and 0.15, i.e., p ∈ [0.01, 0.15], with a step of 0.01, and the effective infected probability λ was fixed at 2.0.
In addition, we explored the final outbreak size proportion F (t end ) while varying the effective infected probability λ, where λ ∈ [1.5, 2.5] with a step of 0.1, and fixed the scales of spreaders at p 0.08. Results were averaged over 1,000 independent runs.

RESULTS AND DISCUSSION
We present the experimental results evaluating the proposed methods, and determine whether identified spreaders are effective while varying the infection probability λ. We show TABLE 1 | Key parameter settings in generating synthetic networks. N is the number of nodes; p is a random reconnection probability; < k > is the average degree; m is the number of new edges in every iteration; τ1 is the exponent of the degree sequence; τ2 is the exponent of the community size distribution; μ is a mixing parameter that is the average ratio of the external and total degrees; MD is the maximum degree of the network. M is the number of edges; < k > is the average degree; L is the average shortest path length; D is the network diameter; C is the average clustering coefficient.     Frontiers in Physics | www.frontiersin.org January 2022 | Volume 9 | Article 766615 8 the relationships between the dispersion and the effectiveness of influential spreaders identified by our proposed methods and the baseline methods. Figure 2 displays the final outbreak size F (t end ) for different numbers of spreaders (denoted by the proportion of selected spreaders p) identified by different methods based on SIR simulations, and shows that our proposed methods generally outperform the baseline methods on synthetic and real networks. On WS networks, IS-LIR-2 has the largest final outbreak scale on WS1. IS-LIR-2, IS-LIR, and VoteRank perform similarly to or better than other algorithms on WS2 and WS3. On BA networks, the performance of IS-LIR-2 and LIR-2 is superior to the other methods, especially on BA2, BA3, and BA4. On LFR networks, IS-LIR performs better than the other methods, and IS-LIR-2 performs better on LFR2 but not so well on LFR3. In experiments on real networks, IS-LIR and IS-LIR2 could identify more influential spreaders in most cases on almost all five real networks. However, LIR-2 was not significantly superior on real networks except the arenas-email network. LIR-2, IS-LIR, and IS-LIR2 had obvious advantages selecting multiple spreaders in most cases. This implies that to take into account the dispersion of selected nodes can improve performance.
As the infected probability in the SIR model is a key parameter that may affect the final break size of infections, we explored the performance (represented by the final outbreak size proportion F (t end )) of our proposed methods with different values of the infected probability λ. As shown in Figure 3, whether λ is small or large, IS-LIR and IS-LIR-2 have significant advantages over baseline methods on most of the synthetic and real networks. Specifically, on WS networks, as the infected rate increases, the performance of IS-LIR-2 increases significantly on WS1 and WS2, and IS-LIR performs best on WS2. On BA networks, IS-LIR-2 and LIR-2 are consistently superior to other algorithms on most BA networks. Focusing on LFR networks, we can see that IS-LIR is always better than the baseline methods except on the LFR2 network, where IS-LIR-2 performs better. On real networks, we can see that IS-LIR and IS-LIR-2 maintain their advantages whether λ is small or large. Figure 4 presents the structural characteristics of influential nodes identified by LIR-2, IS-LIR, and IS-LIR2 and the baseline methods, and shows that spreaders identified by IS-LIR-2, IS-LIR, and LIR-2 have the largest harmonic mean of the average shortest path length L between any two nodes in most cases, except the LFR3 and football networks. On LFR3, multiple spreaders identified by IS-LIR, IS-ND, and IS-EV have the top three average shortest path lengths (as shown in Figure 4J). On the football network, vote-rank and IS-DC identified spreaders with larger average shortest path lengths than our proposed method in a few cases (as shown in Figure 4K). These results may explain why our proposed methods outperform the baseline methods in identifying multiple influential spreaders (as shown in Figure 2): if the identified spreaders have a larger mean shortest path length, they may result in a more heavier infection spreading. This implies that taking into account the dispersion of nodes can help find the most influential spreaders.

CONCLUSION
To effectively identify a set of influential spreaders is important in infectious disease prevention or information dissemination. To address this problem, inspired by the LIR method [19] and IS method [20], we proposed the LIR-2, IS-LIR, IS-LIR-2 algorithms, which take into account the dispersion of selected spreaders in different ways. In evaluation experiments on 10 synthetic networks and five real networks, our proposed methods, especially IS-LIR and IS-LIR-2, were more effective than six baseline methods at identifying more influential spreaders. One potential reason is that the spreaders found by our methods have a larger average shortest path length, i.e., the selected spreaders are more dispersed, so as to reduce the opportunity to infect the same nodes in the propagation process. IS-LIR, LIR-2, and IS-LIR-2 achieved a good balance between expanding the final spreading range of the spreaders on the SIR model and increasing the topological distance between them. However, we merely studied static, undirected, and unweighted networks. How to extend our methods to other types of networks, and how to investigate their sensitivity to specific network characteristics are two interesting questions to be addressed in future work.

DATA AVAILABILITY STATEMENT
The data used in the study are all available via the cited references, further inquiries can be directed to the corresponding author.