Reconstruction of Unfolding Sub-Events From Social Media Posts

Event detection plays a crucial role in social media analysis, which usually concludes sub-event detection and correlation. In this article, we present a method for reconstructing the unfolding sub-event relations in terms of external expert knowledge. First, a Single Pass Clustering method is utilized to summarize massive social media posts. Second, a Label Propagation Algorithm is introduced to detect the sub-event according to the expert labeling. Third, a Word Mover’s Distance method is used to measure the correlation between the relevant sub-events. Finally, the Markov Chain Monte Carlo simulation method is presented to regenerate the popularity of social media posts. The experimental results show that the popularity dynamic of the empirical social media sub-events is consistent with the data generated by the proposed method. The evaluation of the unfolding model is 50.52% ∼ 88% higher than that of the random null model in the case of “Shanghai Tesla self-ignition incident.” This work is helpful for understanding the popularity mechanism of the unfolding events for online social media.


INTRODUCTION
Unfolding sub-events of a social media event could tell a storyline of public opinions during the event development [1]. Every time when a large-scale incident occurs, around the theme, it will be accompanied by the generation of a lot of discussion and various opinions. A sub-event is a component of a complex event since the topic of public opinions evolves with the development of events. When individuals, celebrities, enterprises, or governments encounter a public relations (PR) crisis, it is difficult to grasp the direction of public opinion from the uncontrolled interpretation of thousands of people. It is vital for PR managers to clarify the trend of public opinion from sub-events of the incident.
For PR crisis events, it has similar characteristics of emergency or epidemic events, such as natural disasters [2,3], epidemic spreading [4,5], and sports competitions [6,7]. The information related to disaster events can be uploaded and reported, which contributes to the disaster reporting [8]. On social media, events and their related sub-events can be discussed or explored through public online posts.
Sub-event identification faces two challenges of ambiguous distinguishability. First is whether similar expressions are effectively distinguished. Online posts contain a massive amount of re-posts or similar user expressions. Second is whether the related expressions can be effectively distinguished. The discussions and expressions will form different topics, reflecting the subevents from the perspective of user-generated content. But a post belonging to which sub-event needs to be classified. A clear division of sub-events can provide effective support for correlation and evolution analyses.
Inspired by the idea, we present a mode to detect and correlate the sub-events, which aims to unfold a complex event into correlated sub-events and predict the popularity dynamic of social media events. During the modeling process, it is about to solve the two issues which are the ambiguities of sub-event classification (the former two steps of Figure 1) and correlation between sub-events (the latter two steps of Figure 1). As shown in Figure 1A, after collecting the social media posts, a fast clustering method is used to cluster similar posts. The procedure is to reduce the redundancy among replicate posts and each classification stands for a summarized post. In order to unfold the sub-event to meet with the knowledge of PR managers, expert labeling is given and used to predict the unlabeled summarized posts ( Figure 1B). Each label represents a topic concerned by PR managers, which is defined as a sub-event. The topic correlation is measured by the number of paired posts between sub-events ( Figure 1C). Finally, by using the Markov Chain Monte Carlo simulation, each development trend of the sub-event can be depicted and compared to the real world topic evolution ( Figure 1D). This procedure regenerates the results of sub-event popularity curves and will be verified by a null model with random labels.

Unfolding Events From Public Information
In order to correctly observe the filtering of the results from public information, a classic model considers the impact of sharing such information on the analytical foundations of reliable sensing [9]. The observations can be obtained by the text, image, video, and voice message provided by social media users. [10]. Based on these observations, several unfolding methods have been developed. CrisisTracker's clustering system [11] includes event detection, content ranking, and summarization while retaining the drill-down functionality to raw reports. The security information and event management systems could also connect events by pattern matching [12]. An ontology method systematizes the available solutions under a modular-and platform-independent conceptual framework [13]. An iterative expectation-maximization algorithm is proposed to find the truth of the events in social sensing with information flows. Among these studies, the verification of events or subevents is based on the supervised learning with specific labels, whereas PR crisis usually has no label for identification.
Although some research has examined the use of social media for mitigating crises and emergencies [14][15][16], the use of specialized detection methods [17] for clarifying the ambiguity of classification is still lacking. The main challenge is to find the popularity mechanism of social media events. In this article, we use public observations to sort out the sub-events by combining the expert knowledge and correlate these sub-events to a topic tree and popularity trends for the event storyline.

Sub-Event Detection
An event usually contains the cause and result stages, where the sub-event refers to one of the stages of an event [18]. The subevent detection can be achieved by many classic unsupervised methods as follows: 1) the burst-topic detection is used to identify important moments, which argues that the sharp increase in the number of status updates corresponds to the occurrence of important moments in the event [19]. 2) The event summarization usually contains machine learning techniques such as hidden Markov model [20], hierarchical Dirichlet processes [21], and graph optimization formulation [7]. 3) The clustering approaches include word co-occurrence [22], hierarchical clustering algorithm [23], K-nearest neighbor clustering approach [24], artificial neural networks [10], support vector machine [25]. 4) The spatial and temporal distribution methods are also widely used [3,26,27].
One major theoretical issue that has dominated the unsupervised detection field for many years concerns the ambiguity of classification for a sub-event. Semi-supervised approaches have also been explored for this task, especially concerning crisis events [28,29]. However, due to a lack of expert knowledge, the effect of classification may derive from the common sense of PR management. In this article, we proposed a simple procedure to summarize the sub-events by combining the clustering-based single pass algorithm and graphbased label propagation algorithm by introducing the expert knowledge. The Single Pass Clustering (SPC) is a method to simply merge similar posts. The Label Propagation Algorithm (LPA) is to solve the ambiguity and gives a clear classification based on expert knowledge.

Sub-Event Correlation
The correlation approach contains a causality or correlation pattern of sub-events. Two kinds of methods can reveal the unfolding event to evolve. The first one is graph-based methods, which concerns the correlation pattern of sub-events. A maximum-weighted bipartite graph matching is created to correlate events [30]. The recurrent sequence model [31,32] has experimented with a recurrent neural network of LSTM for script learning to predict the probability of the next event. An event-oriented similarity graph is designed to represent the relationship among sub-events [18]. A subgraph similarity is used to measure the event relationships and generate an evolution correlation [33]. The second one is causal inference methods, which concern the causality patterns of sub-events. The generalization of redefining mining aims to find the correlation between disjoint sets of related objects [1]. An event-level attention mechanism is utilized to represent the relations between subsequent events [34]. A logical correlation is proposed for common sense inference of the given event [35]. An event ontology knowledge model is built to construct the evolution patterns [36]. These methods are based on a network or sequential perspective. However, if sub-event correlation refers to topiclevel correlation, there will be a multiple pair problem. One subevent contains several posts about a topic and so does the other sub-events. The correlation of sub-events happens between the topic posts. PR managers are sensitive to the posts that change with the topic evolving [37], but few studies have supported the topic-level correlation. Although the LDA-based model could extract the topics [2,38], the correlation between the posts inside of topics is still an open question. In this article, the Word Mover's Distance (WMD) method is applied to calculate the correlation of the posts in different topics (sub-events). Then, the Markov Chain Monte Carlo (MCMC) simulation method is introduced to predict topics' evolutionary trends.

Single Pass Clustering
The SPC method is a classical method for streaming data clustering. For data streams arriving in sequence, the method processes the data once at a time in the order of input. It is an incremental algorithm, which has a high time efficiency. The shortcoming is that the method depends on the input order. If the data streams arrive in different orders, different clustering results will appear.
Given the Weibo post document set d = {d 1 , d 2 , . . . , d m }, each document d i contains a variable length sequence of words w 1 i , w 2 i , . . . , w Ti i . We use Doc2VecC to vectorize each post and the words in it. The Doc2VecC method defines the probability of observing a target word w t : where w t is the target word, c t is the word's local context,x is the global context, v T is a trainable parameter, V is the vocabulary used in the training corpus, U is the learned matrix in which each row represents a vector for one word, and T is the length of document.
The loss function is: Using the training model, each document can be represented as an average of embeddings of the words: where d i is the vector for document d i and w is a row in U and is the embedding for word w.
The similarity of the two post document vectors d i and d j is measured by cosine metric: The SPC method is used to cluster the posts roughly since it only process the post documents once. The algorithm is as follows: June 2022 | Volume 10 | Article 918663 Step 1: Assign the first document d 1 as the representative for D 1 .
Step 2: For d i , calculate the document similarity S with the representative for each existing cluster.
Step 3: If S max is greater than a threshold value S T , add the item to the corresponding cluster and recalculate the cluster representative; otherwise, use d i to initiate a new cluster.
Step 4: If d i remains to be clustered, return to step 2. The representative is the mean vector of a cluster. After the SPC process, we denote the document vector i ∈ [1, m] from cluster j ∈ [1, n] as d i,j , and the corresponding document as d i,j . The clustering set is expressed as D = {D 1 , D 2 , . . . , D n }.
The number of cluster n is much smaller than the length of posts m. The micro-blog's posts have the attributes of redundancy since a large proportion of user's re-posts. The SPC method is to largely reduce the redundancy among posts.
In order to summarize the words of each clustering, we define Then, the vector of the summarized document D j can also be calculated by Eq 3. After we get the summarized posts, the next task is to label these data.

Label Propagation Algorithm
The expert knowledge is introduced to label the summarized posts. Experts need to label a small part of the summarized posts to feed the LPA. The LPA considers that the label of each node should be similar to most of its neighbors, and the label is "propagated" to form the same "label" within the same "community" based on the network perspective.
Given annotated data (D 1 , y 1 ), . . . (D l , y l ) and the labeled set Y l = {y 1 , . . . , y l } ∈ {1, . . . , C}, where the category C is given by expert and present in the labeled data. Unlabeled data are (D l+1 , y l+1 ), . . . (D l+u , y l+u ), and Y u = {y l+1 , . . . , y l+u } is the labeled set to predict, where l + u = n and L ≪ u. The Label Propagation Algorithm (LPA) is used to predict Y u by Y l and X = X l ∪ X u = {D 1 , . . . , D l+u }.

Algorithm 2. Label Propagation Algorithm (LPA)
A fully connected graph is created so that each sample point (labeled and unlabeled) is treated as a node. The following weight calculation is used to set the weights of the edges between two points i,j: where the parameter σ is adjustable. Then, the probabilistic transition matrix T ∈ (l + u) × (l + u) is defined as: The element T ij is the probability of label j propagating to label i. By probability propagation, the probability distribution is concentrated in a given class, and then the node labels are passed through the weights of the edges. We can express the random walks as given below: where y i [c] is the probability of node D i ∈ X u to have label c. The probability T t ij is to jump from node D j and end up in node D i in t steps. The number of steps is a large number (infinity). Since the probabilistic transition matrix T can be written as a block matrix: In the matrix form, Eq 8 can be induced as flowing: where the label vectors of labeled nodesŶ l Y l and the label vectors of unlabeled nodesŶ u (I − T ul ) −1 · T uu . Finally, one can get the label of

Word Mover's Distance
In order to correlate the posts between the sub-events, the WMD method is introduced. According to the LPA results, each label represents a sub-event and includes several summarized posts. The WMD is used to calculate the pairs between summarized posts of sub-events. The WMD method measures the semantic distance of the two documents. Each document is a summarized post. The post document with labeled c is added into the set C c {d i [c]}, i ∈ {1, . . . , n}, c ∈ {1, . . . , C}, representing a subevent c of summarized documents.
In order to build the correlation between sub-events, Word Mover's Distance (WMD) is used to identify the similarity between classifications. WMD is a distance between two text documents x, y. Let |x|, |y| be the number of distinct words in x, y. The normalized frequency vectors of each word in x and y are respectively expressed as f x ∈ R |x| and f y ∈ R |y| (so f T x 1 f T y 1 1). Then, the WMD is defined as where F is the transportation flow matrix with F ij denoting the amount of flow traveling from word i in x to word j in y and S is the transportation cost with S ij = S (w i , w j ) being the distance between two words measured by the Doc2VecC.

Algorithm 3. Word Mover's Distance (WMD)
According to the WMD method, one can establish relevant relationships of sub-events according to the similarity between the post d i in sub-event classifications C k and the post d j in subevent classifications C l . We denote the set of paired posts between classifications as where Θ is a threshold value.

Markov Chain Monte Carlo
The WMD method gives the pairs between different sub-events. The core task of our method is to acquire the prior probability and evolution probability, so that the correlation and evolutionary trends can be built. The prior probability of each sub-event is calculated by using the statistical probability: where |C i | is the number of summarized documents for subevent i. The evolution probability between sub-event pairs is calculated using the conditional probability: According to the Metropolis rejection defined by Hastings, the acceptance probability is: The Metropolis-Hastings update makes one proposal l, which is the new state with probability α(k, l) but otherwise, the new state is the same as the old state k. By using the Metropolis-Hastings algorithm, one can get the sample collection, which the element is the type of sub-event. Given the length of sample collection T and the number of time slice, each time step t includes the Δn samples. The probability of a subevent C k in the time step t is defined as:

Algorithm 4. MCMC: Metropolis-Hastings algorithm
In the end of the model process, the regenerated popularity curves of every sub-event can be obtained.

Model Evaluation
The regenerated popularities have to be evaluated by comparing the real dynamic model and a random model for reference.

The Real Popularity Dynamic
The real evolution of the "Shanghai Tesla self-ignition incident" is measured by where each time step t includes the Δn overall documents in 2 days and | C k (t)| is the number of real sub-events C k (t) in each time step.

Jensen-Shannon Divergence
Jensen-Shannon Divergence (short for JSD) [39] is introduced to measure the similarity between real distribution p 1 and MCMC distribution p 2 and is defined as: where p 1 and p 2 are the two distributions to be compared and H(p) represents the Shannon entropy. The lower bound is JSD = 0 only when two distributions are identical. The smaller the JSD value is, the more similar the two distributions are.

Null Model
Then, a null model is built for the reference effect. Keeping the other steps of the proposed method, the null model replaces the LPA process with random labels. The evaluation still compares the simulated popularity curve and real evolutionary curve of each sub-event. The improvement rate is calculated by the difference of JSD between the null model and the proposed model divided by the JSD value of the null model.

EXPERIMENTAL RESULTS
The experiment dataset comes from the competition of WRD Big Data, which are about the "Shanghai Tesla self-ignition incident" Weibo data, with 61,688 blog posts from 21 April 2019 to 5 May 2019. The incident is about a Tesla car suddenly smoking and self-igniting, which caused heated public debates on safety and the enterprise's responsibility. Data pre-processing process is conducted to delete the data labeled as robots, the data of retweets without own comment, and microblogging texts less than 10 words. In the remaining 40,119 blog posts, after replacing the deleted stop-words, emojis, special characters, HTML tags, and URLs of various hyperlinks, the TextRank algorithm is used to extract the keywords from the set of blog posts after the word segmentation, and each blog post contains 10 keywords. The unfolding model is conducted as follows.  The owner responded: It was not charging at the time of the incident, and it has just finished supercharging a few hours ago. The car owner said that he parked the car 1 h before the fire without charging. In fact, the car finished the supercharging only a few hours before the fire, which increased its cruising range to another 350 kms As is shown in Table 1, there are two typical posts that can be summarized according to the similarity threshold. Here, we set the similarity threshold as 0.75 in SPC. The first kind of similarity is the posts talking about the same content, such as the records 1 and 2 can be seen as one. The second is simply the same content's re-post, such as the records 3 and 4 are also summarized as one. When the similarity of the post is smaller than the threshold, the records would not be summarized. The records 5 and 6 still stand respectively for two posts. In the last two columns, experts label the summarized posts according to the keywords of the events. There are 8 labels concluded by three experts, i.e. Event Happen, Corporate Respond, Client Respond, Media Report, Fire Control, Weibo Discuss, Event Processing, and Expert Opinion, which are labeled in the first 600 summarized posts.
The second step is to extract the sub-events. The results are in the form of labeling, which can be seen in Table 2. It gives the standards of expert labeling and the number and prior probability of labeling after the process of the LPA method. The standards of labeling are defined by experts when the first 600 summarized posts are labeled. The frequency of each sub-event C is counted by expert labeling and LPA labeling. The prior probability of labeling is calculated by averaging the number of summarized posts.
The third step is to correlate the sub-events. Through the WMD method, the numbers of pairs between sub-events are used to calculate the evolution probability. The results are shown in Figure 3 as a topic-changing tree. Based on prior probability and evolution probability, the MCMC simulation gives the probability distribution of each sub-event.
Finally, the fourth step is to verify the development of the sub-event. The regenerated sub-event curves are compared with the real popularity curves as shown in Figure 4. The JSD value equals 0.0950, 0.0841, 0.0635, 0.06804, 0.2304, 0.2135, 0.3727, and 0.1377 respectively for Event Happen C 1 , Corporate Respond C 2 , Client Respond C 3 , Media Report C 4 , Fire Control C 5 , Weibo Discuss C 6 , event processing C 7 , and expert opinions C 8 . The results are 87.03, 88, 86.87, 57.37, 75.48, 65.33, 50.52, and 80.54% higher than that of the null model (seen in Table 3). In this article, we use Single Pass Clustering (SPC) to summarize the massive posts. The step is to reduce the redundancy among similar posts and form summarized posts. Then, the Label Propagation Algorithm (LPA) is introduced so that the small-scale expert labels can spread to the whole datasets. Each label is a topic concerned by PR managers and represents a sub-event. The SPC and LPA processes complete the sub-event detection. Among the summarized posts between sub-events, we use Word Mover's Distance (WMD) to pair the correlated documents. Markov Chain Monte Carlo (MCMC) simulation is finally used to correlate the sub-events and predict each sub-event evolutionary. The WMD and MCMC complete the sub-event correlation. The results show that the procedure is 50.52%8 8% higher than the random null model in the case of "Shanghai Tesla self-ignition incident".
The reconstruction method can help to intuitively understand different sides of the events and the hotspot shift of public opinion. But there are several limitations of this article. First, external knowledge deserves further study to enhance the comprehensibility and accuracy of sub-events. Second, similarity measurements are essential for the results of classification [40], and which measurement is stable for Weibo post classification is an open question. Third, time-line correlation should be introduced into topic-level sub-event development trends [41]. Lastly, the approach of network reconstruction [42,43,44] can be integrated into content reconstruction.

DATA AVAILABILITY STATEMENT
The raw data supporting the conclusion of this article will be made available by the authors, without undue reservation.   Table 3.