Assessing the Structural Vulnerability of Online Social Networks in Empirical Data

Assessing the structural vulnerability of online social networks has been one of the most engaging topics recently, which is quite essential and beneficial to holding the network connectivity and facilitating information flow, but most of the existing vulnerability assessment measures and the corresponding solutions fail to accurately reveal the global damage done to the network. In order to accurately measure the vulnerability of networks, an invulnerability index based on the concept of improved tenacity is proposed in the present study. Compared with existing measurements, the new method does not measure a single property performance, such as giant component size or the number of components after destruction, but pays special attention to the potential equilibrium between the removal cost and the removal effect. Extensive experiments on real-world social networks demonstrate the accuracy and effectiveness of the proposed method. Moreover, compared with results of attacks based on the different centrality indices, we found an individual node’s prominence in a network is inherently related to the structural properties of network. In high centralized networks, the nodes with higher eigenvector are more important than the others in maintaining stability and connectivity. But in low centralized networks, the nodes with higher betweenness are more powerful than the others. In addition, the experimental results indicate that low centralized networks can tolerate high intentional attacks and has a better adaptability to attacks than high centralized networks.


INTRODUCTION
With the revolution of the WWW technology, Web 2.0 characterized by social collaborative technologies is emerging and fast-growing. People are increasingly inclined to cultivate their virtual social relations and virtual life on the existing prevalent online social networks [1], such as Facebook, Blogger, Wiki, and Digg. These online social networks can provide favorable platforms for people to exchange opinions or information with one another [2]. Specifically, online social networks are creating ties for us with a very wide range of people, which not only are bonded in relationships with acquaintances, as well as maintain close relationships with friends, schoolmates, and family members, but also are embodied in some new relationships in an online virtual world.
In order to defend some potential disruptions and facilitate information flow, assessing the vulnerability of online social networks has been one of the most engaging topics [3,4]. The concept of vulnerability is generally used to find and characterize a lack of robustness and resilience of a complex system [5]. The vulnerability of a network structure was analyzed first by Albert et al. [6] and was regarded as a previously overlooked "Achilles' heel." Initially, vulnerability assessment was focused on some simple and generic models such as the Erdös-Rényi (ER) random model and the Barabási-Albert (BA) scale free model [6,7]. Over the years, some scholars have found that the inherent preferential attachment mechanism and the structural properties of network may be responsible for the vulnerability of network [8][9][10]. Especially, a series of numerical simulations were introduced to study tolerance to random removals and intentional attacks in complex networks [11][12][13]. Most experimental studies have shown that the Barabási-Albert (BA) network and other similar heterogeneous networks are very robust to random removals but are very fragile against intentional attacks based on the degree or betweenness [6,14]. For homogeneous networks such as regular networks and random networks, the effect of random removals is equivalent to that of intentional attacks [6], while for small-world networks, long-range link attacks can cause their collapse directly [13]. Some achievements have been made in the research of some typical network models, but how the dynamical processes, such as resilience to damage or tolerance to attacks, are influenced by the specific topological structure of a network remains unknown.
In recent years, there has been much effort directed at developing methods for vulnerability assessment [15][16][17]. The main results are largely based on two aspects, including critical node identification and removal effect evaluation. The former reflects the nodal prominent position in maintaining the network connectivity or facilitating information flow, while the latter refers to how to quantify the effect caused by the removal of a finite number of nodes. Indeed, the identification of critical individuals is an influence maximization problem [18], which aims to select a minimal node set to generate a maximal outcome in a given network. The quantification methods can be roughly classified into three categories: centrality-based algorithms, random-walk algorithms, and greedy-based algorithms. Structural connectivity has become the primary test criterion for vulnerability assessment [19]. In most instances, these evaluation metrics such as the characteristic path length [6] and the network efficiency [20,21] are relatively straightforward and can more clearly characterize the changes in the connectivity of the target network before and after some nodes are removed. However, they only provide a useful topological snapshot for connected networks and are not suitable to assess the network vulnerability in terms of disconnectivity [15]. In addition, the existing measurements are difficult to reach equilibrium between the removal cost and the removal effect.
The primary purpose of this article is to fill this gap by exploring a new method to effectively quantify the vulnerability of the network structure. The new method focuses on how to identify the importance, or status, of a node in the network, and on further use of available resources to efficiently disrupt network operation, which comprehensively takes account of the cost with which one can disrupt a network and the attack effect. The contributions of this study can be summarized as the following: 1) An invulnerability index based on the concept of improved tenacity is proposed to measure the adaptability to attacks. 2) Low centralized networks can have a better adaptability to attacks than high centralized networks.
3) The experimental results verify the outperformance of the proposed method.
The rest of the article is organized as the following: In Methods, in order to assess the vulnerability of networks more properly, we present an invulnerability index based on the concept of improved tenacity to examine network adaptability to attacks. Generally, a network with a higher invulnerability index performs better under intentional attacks. In Network Data, we will examine the static properties of real online social networks empirically, in order to summarize the generalized differences in the topological structure of various online social networks. Especially, in Discussion, we will display the threshold behavior of the aforementioned networks on experimental observation, and further compare the efficiency of node removal with different centrality indices to find the vulnerability of online social networks.

The Attack Strategies
Inspired by the well-known percolation theory in statistical physics, the robustness and resilience of a network is usually defined as the network structural degradation caused by the removal of some critical nodes [19]. Tolerance to random removals and intentional attacks is understood as the ability of the network to maintain operations and connectivity under the loss of some nodes or links [8]. In order to ensure the efficiency of attacks, it is necessary to identify the most vulnerable nodes in a network. Indeed, the identification of critical nodes is an influence maximization problem [18]. The quantification algorithms can be roughly classified into three categories: centrality-based algorithms, random-walk algorithms, and greedy-based algorithms.
Centrality-based algorithms perform a fundamental quantification by considering geodesics between nodes to evaluate nodal importance. Up to now, many centrality-based algorithms have been proposed, such as degree centrality [22,23], betweenness centrality [22,24], closeness centrality [22,25], eigenvector centrality [26], and other improved centralitybased algorithms [27][28][29]. The random-walk algorithms include the well-known PageRank [30] and other improved algorithms [31]. Random-walk algorithms work only well for directed networks. Greedy-based algorithms formulate the influence measurement as a discrete optimization problem, and their elementary strategies are to select the spreaders that contribute the largest incremental influence one by one, according to a specified influence cascade model [32]. In terms of algorithm construction, although greedy-based algorithms can Frontiers in Physics | www.frontiersin.org October 2021 | Volume 9 | Article 733224 achieve excellent results, they also have very high computational complexity and are not suitable for large-scale social networks. Previous studies have shown that the adaptability of networks behaves differently from various attack strategies [20,33]. Thus, in this article, we will study tolerance to various attacks in real online social networks and further find a minimized set of nodes triggering the collapse of network. We will consider four straightforward and efficient centrality indices as attack strategies to identify the importance of nodes.
1) Degree centrality: The algorithm measures a node's influence according to the number of edges attached to it, which reflects the ability of a node to connect directly with other nodes. 2) Betweenness centrality: The algorithm measures a node's influence through the ratio of the shortest path over the nodes to the number of all paths, which considers the global structure information of a given graph. 3) Closeness centrality: The basic idea behind the closeness centrality is that a node is central if it is "close" to many other nodes [34]. Thus, the closeness centrality score of node i is defined as the reciprocal of the sum of geodesic distances to all other nodes. 4) Eigenvector centrality: The algorithm is based on the principle that a node should be viewed as important if it is linked to other nodes which are important themselves. Thus, the eigenvector centrality of node i is defined as the proportional to the sum of the eigenvector centralities of the nodes it is connected to [28].
Because the removal of nodes under intentional attacks changes the balance of structure and leads to a global redistribution over all the networks, we recalculate the degree centrality, the betweenness centrality, the closeness centrality, and the eigenvector centrality, every time a small fraction of nodes is removed.

Evaluation Metrics
Numerous empirical results on real networks have revealed that the heterogeneous topology structure may be fit for most real networks [35][36][37], where degree distribution significantly deviates from a Poisson and low degree nodes are far more abundant than the nodes with high degrees. Due to the inhomogeneity of general networks, removing some critical nodes will decrease the network connectivity and lead to the loss of the global information-carrying ability of the network [6,20]. Generally, when assessing the vulnerability of a network under intentional attacks, three performance criteria should be concerned in the framework of graph theory [38]: 1) The number of components that are being removed.
2) The number of disconnected subgroups after intentional attacks.
3) The size of the largest remaining group within which mutual communication can still occur.
Most of the online social networks can be abstracted as a noncomplete connected graph G (V, E), where individual members and personal relationships can be defined by a set of nodes Vand a set of edges E, respectively. In general, a good social network should have short distance between nodes, average distance, and high connectivity. In fact, there has been much effort directed at developing methods for evaluating network adaptability to attacks. In most instances, the characteristic path length and the network efficiency as evaluation metrics can only provide a useful topological snapshot for connected networks [39]. But for disconnected networks, the geodesic distance between any two nodes belonging to two disconnected subgroups is identically zero or infinity, which will directly affect the accuracy of evaluation results. In graph theory, some helpful indicators of evaluating network vulnerability have been proposed. These metrics relates to network topology and attributes, such as toughness [40], integrity [41], tenacity [42], and scattering number [43]. The detailed description of each indicator is shown in Table 1.
As one of the basic concepts of graph theory, connectivity plays a vital role in network performance and is fundamental to vulnerability measures. The concept of connectivity K(G) of G is defined as where |S| is a cutset of V(G). In Eq. 1, the connectivityK(G) asks for the minimum number of nodes whose removal renders the graph Gdisconnected. As one of the graph theoretical concepts, connectivity deals with the criterion (1). Toughness and integrity are two other graph concepts used in the vulnerability assessment. The notion of the toughness T(G) ofG, originally introduced by Chvátal [40], is defined as follows: where w(G − S) stands for the number of components of G − S. Unlike the connectivity K(G), the toughness T(G) incorporates the relationship between the size of the cutset and the number of components after destruction and takes into account of the criteria (1) and (2). But the toughness T(G) is still insufficient to measure the network vulnerability. Considering the graphs G 1 and G 2 (see Figure 1), two graphs have the same connectivity and toughness, where K(G 1 ) K(G 2 ) 1 and T(G 1 ) T(G 2 ) 1 3 , but they are really different in the vulnerability of graphs. For instance, after the minimum cutset u 1 } { is removed, we find that the G 1 has been divided into three small components, while the vast majority of nodes of G 2 have been retained in the largest connected component u 2 , u 6 , u 5 } { , within which mutual communication among the remaining nodes can still occur. It implies that the connectivity K(G) and toughness T(G) cannot accurately reveal the global damage done to the network.
The notion of integrity introduced as another vulnerability parameter of graphs [41] focuses on the criteria (1) and (3). For a non-complete connected graph G, its integrity I(G) is defined as follows: where m(G − S) denotes the giant component size after destruction. Obviously, the disruption is more successful if the disconnected network contains more components and is much more successful if, in addition, the components are small. Unfortunately, connectivity and toughness give the minimum cost to disrupt a network but fail to indicate accurately what remains after the disruption. Although Barefoot's integrity has taken the size of the largest remaining component after destruction into account, it cannot indicate the extent of the damage. Scattering number S(G) Focusing on the relationship between the removal cost and the number of components after destruction Hendry [43] FIGURE 1 | Non-complete connected graphs G 1 (A) and G 2 (B). The notion of tenacity was originally proposed in Ref. [42], where they introduced the mix-tenacity to measure the vulnerability of Harary graphs. The precise definition of tenacity is defined as follows: The tenacity R(G) of graphs directly integrates all three criteria, such as the cost of network breakage, the number of components, and the giant component size, and is considered to be a reasonable measure for the vulnerability of graphs. As shown in Figures 2A,B, it is easy to know that R(G 1 ) min 1+2 , which indicates the adaptability to attacks for G 1 is better than G 2 .
In general, if the network remains more disconnected subgroups and smaller connected component size after destruction, the disruption is more successful. As shown in Figures 3A,B, G 3 and G 4 all have the same number of nodes and edges. After u 1 , u 2 } { and u 2 , u 1 , u 4 } { are removed respectively, the minimum toughness and the minimum tenacity can be obtained, which are The cutsets u 1 , u 2 } { and u 2 , u 1 , u 4 } { are the minimum removal costs of the graph G 3 and the graph G 4 , respectively. But we find that there are differences both in the attack efficacy in the graph G 3 and graph G 4 , where for the graph G 3 , the minimum removal cost is 2 and the giant component size also is 2; while for the graph G 4 , the minimum removal cost is 3 and the giant component size is 1. As discussed earlier, the tenacity R(G)is still an imperfect criterion to assess network vulnerability.
In Ref. [43], Hendry used the concept of scattering number to measure the vulnerability of extremal non-Hamiltonian graphs and found that it was more efficient for measuring the degree of global destruction. The scattering number S(G) of Gis defined as follows: So we think that it is necessary to add the criterion to reveal the global damage done to networks under attacks, and keep its priority in assessing the vulnerability of networks. Therefore, we propose an improved tenacity based on the concepts of scattering number and tenacity, which is named R sca (G) and defined as follows: As shown in Figures 3A,B, R sca (G 3 ) min 1+2 , which indicates the anti-interference capability of G 3 is better than G 4 .
Compared with existing evaluation metrics [6,20,21,[39][40][41][42], our notion of improved tenacity is not to measure a single property performance such as giant component size or the number of components after destruction, which is insufficient to evaluate the network vulnerability by only considering whether or not it is a disconnected network, and how fragmental the network becomes, but to pay special attention to the potential equilibrium between the giant component size and the number of components under intentional attacks. As shown in definition (6), the number of componentsw(G − S)and the size of the largest connected componentm(G − S)are re-calculated after each iteration, so the whole computational complexity of the proposed method isO(n 2 ). The result demonstrates that the proposed algorithm has relatively less computational burden in evaluating the NETWORK DATA

Data Description
Because online social networks serve as much social function as other kinds of social interaction, including e-mail exchanging, text messaging, instant messaging, digital video sharing, and so on, each edge meaning a connection in online social networks is complex and changeable, whereas each node meaning an individual of online social networks is constant.
To measure the vulnerability of networks more properly, our method will pay more attention to the importance of nodes, rather than edges. Our data set is composed of six undirected and unweighted networks, that is, networks that have a binary nature, where the edges between nodes are either present or not, and each edge has no directional character. Table 2 indicates that the six real social network include the following: OClinks is a representative online community network, where users are from the University of California, Irvine, which is from Panzarasa et al.'s [44]; Twitter, Wiki-Vote, and Facebook can be downloaded from the Stanford network dataset (http://snap.stanford.edu/data/index.html); and Lilac and RenRen are collected from the online social networks by us.
In Table 1, we show the main topological properties of a series of online social networks, belonging to two different types: the online community service based on group-centered service and the social network service based on individualcentered service. The first three networks, Lilac, OClinks, and Wiki-vote, are three examples of online community services, where the users have a common interest and purpose and can exchange information or seek help. In these networks, the nodes are the registered users, and the links represent a relationship between two users existing message exchange. In the latter part of Table 1, we show three examples of social network services. Twitter is a social news website, where users may mention other people or follow other people to make his/her posts, so the links imply communication between users existing mention or comment. RenRen is a real-name social networking internet platform in China, where users can connect and communicate with each other or enjoy a wide range of other features and services, so the links reflect different kinds of social relationships between the users. Facebook is an ego network consisting of friends lists from Facebook.

Structure of Networks
The most basic topological characterization of networks can be obtained in terms of the degree distribution P(k), defined as the probability that a node is chosen uniformly at random has degree k or, equivalently, as the fraction of nodes in the graph having degree k [19]. In recent years, scientists approached the study of real networks from the available databases and found most of the real networks having inhomogeneous structure, where the connections within nodes of the highest degree are rather sparse, and a large number of nodes just have a few connections. Moreover, most of real networks exhibit power law-shaped degree distributionP(k) ∼ k −c , with exponents varying in the range 2 < c < 3 [35,36], and a little of them follow shifted power law distribution, stretched exponential distribution, or more complicated distributions [37].
The empirical results demonstrate that the degree distributions P(k)of aforementioned networks are subjected to two types: the segmented power law distribution (Lilac, OClinks, and Wiki-Vote) and the stretched exponential distribution (Twitter, RenRen, and Facebook). The insets in Figures 4A-C show the degree distributions P(k) of Lilac, OClinks, and Wiki-Vote are heavy-tailed, and approximately follow the power law distribution, where c 1.72, c 0.99, and c 1.31, respectively. However, when the size of data set is small, it may happen that the data have a rather strong intrinsic noise due to the finiteness of the sampling. In order to avoid the statistical fluctuations, one better possibility is to measure the cumulative degree distributionsP(x ≥ k). The cumulative degree distributions P(x ≥ k)of Lilac, OClinks, and Wiki-Vote show two different scaling regions: a slow region and a rapid decaying region, and are well-approximated by the segmented power law distribution. The crossover takes place between k 10 and k 100, and the cumulative degree distributions P(x ≥ k) of Lilac network can be defined by the following equation: The similar trend is shown in Figure 4B, where the probability P(x ≥ k) of the OClinks network can be fitted by the following equation: As shown in Figure 4C, the probability P(x ≥ k) of the Wiki-Vote network can be fitted by the following equation: Otherwise, the insets in Figures 5A-C   , where the graph of log P(x ≥ k) versus k k0 is characteristically stretched, and a stretching exponent c takes a value between 0 and 1. The stretching exponent c can be obtained fromP(x ≥ k), as we add the slope of log P(x ≥ k) in a log-log plot. As shown in Figure 5, the distribution functions of Twitter, RenRen, and Facebook can be , respectively. The stretched exponential distribution is obtained by inserting a fractional power law distribution into the exponential distribution: asc 1, the usual exponential function is recovered; asc → 0, the distribution follows the power law distribution. In general, the power law distribution is characterized by a slower than exponentially decaying probability tail. In contrast with Twitter, RenRen, and Facebook, wherec 0.79, c 0.89, and c 0.94 approachingc 1, some extremum nodes in Lilac, OClinks, or Wiki-Vote can occur with a more non-negligible probability.
In this section, we have examined the static properties of a variety of online social networks empirically and found that two types of networks, the online community service and the social network service, have completely different structural properties. As shown in Figures 4A-C, Lilac network, OClinks network, and Wiki-Vote network exhibit a highly inhomogeneous degree distribution, where the simultaneous presence of a few nodes tending to link many other nodes, and a large number of poorly connected nodes. But in Figures 5A-C, the curves of degree distributions have witnessed that the nodes in Twitter, RenRen, and Facebook networks are more evenly distributed than those in Lilac, OClinks, and Wiki-Vote networks, where extreme value still runs at a relatively low level. As it can be noticed, Twitter, RenRen, and Facebook as social network services are mainly based on an individual-centered online platform for organizing

DISCUSSION
When the vast majority of real networks, especially online social networks, are fragmented into relatively tiny isolated components, these networks will lose transmission capacities between individual components, indicating the collapse of network is approaching. So, we only find an optimal threshold that can trigger the collapse of the network. In general, the problem can be analytically treated by using percolation theory, where one defines a critical probability f c below which the network percolates, and a set of critical exponents can characterize the phase transition. In Ref. [6], Albert et al. have studied how the properties of some networks with given order and size change when a fraction f of the nodes are removed, where the average characteristic path length as an order parameter displays for both errors and attacks a threshold-like behavior.
In this section, we first study the changes in the improved tenacity R sca (G) when a small fraction f of the nodes is removed gradually, and use the new criterion characterizing the phase transition to obtain the critical probability f c . Then, we compare four attack strategies, that is, degree centrality, betweenness centrality, closeness centrality, and eigenvector centrality (the algorithms defined as in The Attack Strategies), to estimate which solution is the most effective and also to determine the most important nodes for online social networks. As shown by the curved lines in Figure 6, the performance of improved tenacity R sca (G) in the aforementioned networks displays a threshold-like behavior: first, R sca (G) drops with the fraction of removed nodes f increasing, indicating the ability of the network to FIGURE 6 | From (A-F), indicating the performance of improved tenacity R sca (G) in six online social networks (from Lilac to Facebook) when a fraction f of the nodes is removed. ○, preferential removal of the most connected nodes; △, preferential removal of the nodes with maximum betweenness; *, preferential removal of the nodes with maximum eigenvector; and □, preferential removal of the nodes with minimum closeness. Degree, betweenness, eigenvector, and closeness are recomputed each time the nodes are removed. Inset refers to the performance of the giant component size.
Frontiers in Physics | www.frontiersin.org October 2021 | Volume 9 | Article 733224 maintain its connectivity properties is sensitive to intentional attacks, and the size of fragments is increasing significantly. Especially, the insets of Figure 6 show the giant component size m(G − S) rapidly decreases with f increasing. Second, we find that the curve of improved tenacity R sca (G) abruptly decays at the critical value f c , the same trend as the giant component size m(G − S)(see the inset of Figure 6, where the slope of curve is sharply downward), implying f c is precisely the threshold triggering the collapse of network. In addition, it is worth noticing that the critical value f c can be obtained fast by using the definition (6) Wiki-Vote, the attacks based on the eigenvector centrality actually have a better performance than the attacks based on other indices, rooting in the critical fractionf ce < f cd < f cb < f cc .
Although in Lilac f ce f cd , the giant component sizem(G − S)under the attacks based on the eigenvector centrality is significantly smaller than the result based on the degree centrality. The conclusion described before indicates the role of the nodes with high eigenvector is the most important than the others in maintaining connectivity of the network. In the diffusion of information, especially in online social networks, a user with high eigenvector centrality has connections to many other users that are themselves highly connected and central within the network, thus multiplying his or her capabilities in maintaining communication of network. But in the cases of Twitter, RenRen, and Facebook, the attacks based on the betweenness centrality f cb cause a greater amount of damage than the others, where f cb < f cc < f ce ≤ f cd . The main reason for the differentiation from various networks may closely relate with their organizing features and technical features, which are characterized by their topological structures.
Another important conclusion that can be drawn from the results presented is that the performance of improved tenacity R sca (G) in Twitter, RenRen, or Facebook is better than that in Lilac, OClinks, or Wiki-Vote, which implies the former has higher anti-interference capability than the latter. In general, a low centralized network can improve network resilience by reorganizing network to increase local control and the execution of a service. Analogously, for the lack of obvious centralization and a strict inhomogeneous topology structure, social network services, like Twitter, RenRen, or Facebook, can tolerate higher intentional attacks based on some critical individuals than high centralized networks. In addition, as shown in Table 3, although the critical probability f c of Twitter or RenRen is much larger than the threshold f c of Lilac, OClinks, or Wiki-Vote, the corresponding proportion of the giant component size of the former is alsways lower than that of the latter. Therefore, it is difficult to judge which network has higher adaptability facing intentional attacks from the attack cost or the giant component size scale. Compared with the single criterion, such as removal cost, the giant component size, and the number of components, our proposed method can comprehensively consider the attack effect and attack cost.

CONCLUSION
In this article, we synthetically take account of the cost with which one can disrupt a network and the effect of, and a new evaluation method based on the concepts of scattering number and tenacity. Compared with existing evaluation metrics, our method focuses on the potential equilibrium between the attack effect and the attack cost. For this purpose, we first examined empirically the static properties of six online social networks, including three online community services and three social network services, that is, Lilac, OClinks, Wiki-Vote, Twitter, RenRen, and Facebook, and found that there are wide differences in the topological structure of networks.
Second, we studied the changes in the improved tenacity R sca (G) when a small fraction f of the nodes is removed gradually and found the curve of improved tenacity displays a threshold-like behavior, when the minimum of tenacity approaches zero. Then, we compared the four solutions of intentional attacks based on the different indices, that is, degree centrality, betweenness centrality, closeness centrality, and eigenvector centrality, and found that an individual node's prominence in a network is inherently related to structural properties: the role of the nodes with higher eigenvector is more important than the others in maintaining stability and connectivity of high centralized networks, such as Lilac, OClinks, and Wiki-Vote, but the nodes with higher betweenness are more powerful than the others in low centralized networks, such as Twitter, RenRen, and Facebook. Moreover, the empirical study revealed that low centralized networks can tolerate high intentional attacks and have higher anti-interference capabilities than high centralized networks.

DATA AVAILABILITY STATEMENT
The original contributions presented in the study are included in the article/Supplementary Material; further inquiries can be directed to the corresponding authors.

AUTHOR CONTRIBUTIONS
Conceptualization, DZ and CG; methodology, DZ and ZZ; validation, DZ and GL; formal analysis, DZ, ZZ, and CG; resources, DZ and GL; data curation, DZ; original draft preparation, DZ; revise and editing, DZ, ZZ, and CG; project administration, CG; funding acquisition, DZ and ZZ. All authors have read and agreed to the published version of the manuscript.