Information Cascades Prediction With Graph Attention

The cascades prediction aims to predict the possible information diffusion path in the future based on cascades of the social network. Recently, the existing researches based on deep learning have achieved remarkable results, which indicates the great potential to support cascade prediction task. However, most prior arts only considered either cascade features or user relationship network to predict cascade, which leads to the performance limitation because of the lack of unified modeling for the potential relationship between them. To that end, in this paper, we propose a recurrent neural network model with graph attention mechanism, which constructs a seq2seq framework to learn the spatial-temporal cascade features. Specifically, for user spatial feature, we learn potential relationship among users based on social network through graph attention network. Then, for temporal feature, a recurrent neural network is built to learn their structural context in several different time intervals based on timestamp with a time-decay attention. Finally, we predict the next user with the latest cascade representation which obtained by above method. Experimental results on two real-world datasets show that our model achieves better performance than the baselines on the both evaluation metrics of HITS and mean average precision.


INTRODUCTION
Social media online platforms, such as Twitter, Sina Weibo, WeChat, and so on, have greatly promoted the rapid spread of information on the Internet, which leads to an increasingly important impact on daily life of people. Cascade [1] consists of a series of users' behaviors on the social network like share, comment, like and so on, which is regarded as a temporal sequence as shown in Figure 1. Who participated in the diffusion we call them infected users. Cascade is usually considered as the basis of information diffusion on social networks. Modeling and predicting cascades is conducive to understanding and quantifying user influence on social network. Cascade prediction aims to predict the process of information diffusion in the future based on observed cascades, which is of great significance to decision-making on social networks such as viral marketing [2] and support for Internet of Things [3].
Traditionally, the existing works on cascade prediction mainly learn the cascade feature from the diffusion path or user relationship graph, which can be summarized into the following two categories: 1) Analysis methods based on traditional topological structure and characteristics. In the early stage, these methods were usually based on the network topology [4] and the network propagation mechanism [5,6,7] to model the cascading diffusion process; In order to further improve the predictive performance of the model, a series of feature-driven methods [8] such as learning user influence and susceptibility [9] have been proposed. 2) Method based on deep learning. With the successful application of end-to-end learning models such as DeepCas [10], various neural network models applied to cascade prediction have further improves the model performance. It has gradually become the main method on cascade diffusion prediction task. In recent works, Topo-LSTM [11] and DeepInf [12] modeled the network topology and predicted propagation through user-level representation learning. On the other hand, some researchers tried to use neural networks to learn the temporal feature in the cascade, and thus DeepHawkes [13], RNN-based CRPP method [14] and some other models were proposed.
However, traditional methods based on propagation mechanism [5] and features [9] often rely on manual definition through a large number of studies and observations. What's more, the complex manual features [8] often limit the generalization and robustness. As for deep learning methods, existing works either study from the relationship of users from network topology [11] or cascade temporal feature [13] unilaterally. While in realistic social network, information diffusion is affected by spatial and temporal feature together. It is of great significance to study the potential relationship between the two aspects in information diffusion.
To solve this problems, in this paper, we propose an end-toend neural network framework DLIC (Deep Leaning Information Cascades) that combines social network topology and temporal cascade information. The framework first learns the feature representations of each user through a graph attention based on the topological of social network. And then a recurrent neural network is built for learning cascade representation in different time intervals, and a time-decay attention mechanism is introduced for assigning different weights to them. The final user representation which learns previous cascade feature would be fed into decoder layer to predict the next infected user.
Extensive experiments on two public cascade diffusion datasets, namely Twitter [15] and Douban [16], validate the performance of our model compared with several baseline methods. The results indicate the improvements on HITS@100 by 2.2 and 2% on the twitter and douban datasets, respectively. Besides, our model performs best in all other metrics on both real datasets.
The paper is organized as follows: Section 2 introduce the relative works in cascade prediction task, while Section 3 describes our DLIC model. Section 4 shows our experimental results and discusses the effect of different features. Finally, Section 5 gives a conclusion of our main findings and future works.

BACKGROUND
The goal of cascade prediction is to model the diffusion regularity in social network, which aims to effectively describe the propagation mechanism of information and predict the future diffusion path. The research on cascade prediction can be divided into tow categories, namely the methods based on traditional machine learning or deep learning techniques.
In terms of cascade feature learning, the most of works based on traditional machine learning have proposed many methods to learn the diffusion probability among different users from the observed cascade information. [8] modeled cascade information through a marked Hawkes self-exciting point process and predicted with content virality, memory decay and user influence. [17] learned the embedded feature representations of users on social networks in a latent space through independent cascade model. [18] proposed IEDP model based on information-dependent embedding, which mapped users to a latent embedding space in observed time sequence of the cascade diffusion process, and the prediction is made according to the distance of embedding representation. [19] proposed an opinion leader mining model EIC based on the extended independent cascade model, which integrated network structure characteristics, individual attributes and behavior characteristics together. [20] designed a route decision model by a data-driven method. [21] constructed interaction rules based on multi-dimensional features such as user influence, sentiment and age to simulate the process of information diffusion in social networks. [22] used the survival analysis model to learn the susceptibility and influence of users, which were used to calculate the diffusion probability among users. [23] proposed a feature extraction method from user behavior under urban big data. [24] argued that the spread of rumors is composed of multiple factors and proposed a multi-featured spread model. [25] computed the epidemic risk of COVID-19 by combining the number of infected persons and the way they pass through the station.
In recent years, with the rise of representation learning in deep learning methods, more and more deep learning models such as LSTM, RNN, GCN, etc. have also been used in the work of cascade prediction. The DeepHawkes model proposed by [13] used end-to-end deep learning to simulate the explainable factors FIGURE 1 | A toy example of cascade prediction, which contains several users who participated in information diffusion. A → B → C is a observed cascade in social network, we will predict who is the next infected user, D or E?
Frontiers in Physics | www.frontiersin.org August 2021 | Volume 9 | Article 739202 of the Hawkes process and modeled cascade information. [26] proposed an attention-based RNN to capture the crossdependence in the cascade and a coverage strategy to overcome the misallocation of attention caused by the memoryless of the traditional attention mechanism. [10] also proposed an end-to-end model to learn the cascade graph, which automatically learned the representation of a single cascade from the global network structure without manual features. [11] introduced a new data model named diffusion topologies and proposed a novel topological recurrent neural network Topo-LSTM. DeepInf proposed by [12] took the local area network among users as input, and learned their potential influence in social network through graph convolutional networks. [27] proposed a sequential information diffusion neural network with structure attention that considers the process of information diffusion and the structural characteristics of the user graph through a recurrent neural network. [28] also proposed an attention network to solve the diffusion prediction problem, which can effectively explore the implicit diffusion dependence among information cascade users. [14] proposed competing recurrent point process on RNN network, which models both the diffusion process and the competition process. [29] proposed a multi-scale diffusion prediction model based on reinforcement learning, which integrates the macroscopic information into the RNN-based microscopic spread model for predicting infected users. [30] proposed to perform multi-task joint learning framework to understand user relationships and predict cascades with graph attention networks and recurrent neural networks. [31] estimated traffic time from trajectory of taxi in different fine-grained time intervals based on deep learning. [32] designed a RNN model with a multi-relational structure, which not only captures the traditional time dependence, but also captures the explicit multi-relational topological dependence through a hierarchical attention mechanism.
In particular, the spatial-temporal feature learning methods that using the Graph Network and RNNs achieve remarkable results. They also belong to deep learning methods. [33] and [34] aimed to predict objective trajectories. They constructed graph based on spatial coordinate and learned the subsequent positional information with RNNs. Moreover, soft attention and self attention are used to enhance representation learning, separately. [35] proposed to a social recommendation via a dynamic graph method. They encoded the long-short term preferences for users in a session based on RNN. And then they learn the dynamic graph features for user and his friends through graph attention, which are used for recommendation. On the basis of above-mentioned work, we proposed to learn the user spatial features with graph attention and then encode cascade temporal features with RNN.
In summary, the methods based on deep learning which avoid the defects of feature engineering have gradually become the major technique in cascade prediction task, but most of previous research only focused on the representation of cascade. The lack of unified modeling about user structure and temporal feature is still a key problem to be solved.

METHODS
In this section, we start with formalization of the cascade diffusion prediction problem. Then we introduce the framework of our model, which learns the structural context among users through graph attention and then integrates the temporal feature into cascade representation by time decay effect. Finally, we present the overall algorithm and details.

Problem Formalization
Cascade is a behavior of information adoption by people in a social network. To formalize our problem, we first introduce some terminologies. A user who shares information in social network is called infected user. Given users set U {u 1 , u 2 , . . . , u N }, cascades set C {c 1 , c 2 , . . . , c M }, where N and M are the number of users and cascades respectively.
a sequence of infected user and timestamp in a diffusion process, where |c i | is the number of infected user, t i is the infected timestamp of u i . The relationship among users can be represented by G {U, E}, where E [e ij ] ∈ R |U|×|U| is the adjacency matrix of social graph. e ij 1 implies that there is an edge between user u i and u j in the social graph.
Cascade prediction can be divided into macro level and micro level. Macroscopic diffusion prediction aims to predict the final cascade scale. The purpose of this paper is microscopic diffusion prediction, whose aim is to predict who is the next infected user u i+1 based on social graph G and cascades set C before time t i .

Overview of Technical Framework
In order to illustrate how to capture the potential spatial-temporal information in cascade. We introduce the proposed DLIC model. As shown in Figure 2, the framework of DLIC model takes the social network and cascade as input and outputs the next infected user one by one.
The main part of the DLIC model consists of four components: 1) User embedding layer: learning user relationship based on social graph to obtain their representation through graph attention, which reflects the different influence of users. 2) Cascade encoding layer: feeding the embedding representation according to the order of observed cascade to encode cascade path through recurrent neural network. 3) Time-decay attention: the cascade representation would be further extracted through assigning different attention weights based on timestamp slice. 4) Decoding and Output layer: The last hidden state of encoding layer is the representation of this cascade. It would be took as input into decoding layer to predict the next infected user and output them one by one. Next, we give a detailed introduction to these components.

User Embedding
Social network refers to the relationship graph among users. The behaviors such as follow, like, reply and forward forms the topology of social network. This structure affects and promotes the information diffusion. Therefore, it makes sense that learn the feature of users in the social Frontiers in Physics | www.frontiersin.org August 2021 | Volume 9 | Article 739202 network. However, The user graph in the social network is usually very huge and complicated. If we learn the features from all users directly, it will not only take up a lot of hardware resources and time, but also may cause model performance degradation due to some noise nodes such as paid posters. So we propose a user sampling method. As Figure 3 shows, for the K observed users in cascade, we randomly selected several their neighbor nodes shown as yellow nodes. The nodes with higher degree would be selected more easier. And the others would be discarded show as white. We can obtain a subgraph for each cascade and finally integrate them as the new user graph G. For the obtained subgraph through above methods, we feed the adjacency matrix into a multi-layer graph attention network [36] with multi-heads to learn the user representations. Specifically, we assume vector sets h {h 1 , h 2 , . . . , h N }, h i ∈ R F as features of all users, where N is the number of users, F is the number of feature dimension. And then we apply a linear transformation in h as the Eq. 1 shows: v i Wh i Where W ∈ R F′×F is an independent trainable weight matrix. For a pair of neighbor nodes i and j, i.e., e ij 1, we learn the attention weight z ij between them. Firstly, for each neighbor j of user i, we apply a linear transformation again after a concatenation operation for their feature vector v i and v j to obtain the attention coefficients c ij a (Wv i Wv j ), where is the concatenation operation, a ∈ R (F′×F′) is a trainable matrix. c ij  represents the importance of node j relative to node i. And then it is activated with LeakyRelu function. Finally, we obtain their neighbor attention weight for each node by softmax function. The process as Eq. 2 shows: Where N (i) is neighbor set of user i. Then user feature representation will be updated according to above attention weight of neighbors. Specifically, we obtain the new hidden representation h i through a weighted sum operation based on above attention coefficients as Eq. 3 shows: Where σ is a nonlinear activation function, i.e., RELU (·). Finally, we adopt the multi-head attention to stabilize the process of user feature learning. Each head attention executes the transformation of Eq. 3 independently and then concatenate them to obtain feature presentation, which contains different attention of user neighborhood. The result h concat i would be regarded as the input on next GAT layer and it is computed by Eq. 4: Where is concatenation operation, E is the number of heads, h i e is the ouput of each head attention. In final layer of GAT, the sensitivity of splicing operation is reduced. Therefore, the output of user embedding presentation would be calculated by average pooling operation with Eq. 5 to extract the feature among all attention heads and this operation can also save the memory.

Cascade Encoding Layer
A user who participated in cascade diffusion is not only affected by the latest infected user, but also influenced by previous users. As shown in Figure 4, we construct a cascade path A → B → C → D which ordered by the timestamps of infected users. We can see that B is affected by A and C is affected by B which looks like a chain according to the relationship among them. Though C is the latest infected user, A may have greater influence so that D still receives message from A and become the next infected user. It means each user in cascade may affect users in the whole diffusion process from start to finish. However, a cascade does not record the message forwarding source of users. The long distance dependence of cascade feature is a problem that needs to be solved.
RNN has shown its effectiveness in many fields, which provides a theoretical support for the learning of cascade sequences. For a given cascade sequence c {(u 1 , t 1 ), (u 2 , u 2 ), . . . , (u |c i | , t |c i | )} which is composed of user u i and timestamp t i together, we encode them, respectively, to learn for the later prediction. We adopt Gated Recurrent Unit (GRU) to learn the cascade sequential information based on users in our model. According to the order of cascade, the embedded presentation h user i of ith user would be feed into GRU cell to obtain the hidden state h s i GRU(h s−1 i , h user i ) one by one. Where h s−1 i is the hidden state of previous user, s 1, 2, . . . , |c i | is the step of recurrent neural network. GRU is mainly composed of reset gate and update gate. They are calculate as follow: The reset gate r i is calculated as Eq. 6 shows: Where σ(·) is sigmoid activation function, W r ∈ R H×F , U r ∈ R H×H and b r ∈ R H are independent trainable parameters.
The update gate v i is calculated as Eq. 7 shows: Similarly, where W v ∈ R H×F , U v ∈ R H×H and b v ∈ R H are also trainable parameters.
Then GRU uses reset gate r i to remember the current hidden stateĥ i as Eq. 8 shows: Where 0 represents hadamard product, W h ∈ R H×F , U h ∈ R H×H and b h ∈ R H . Finally, update gate z i activates the actual hidden state as Eq. 9 shows:

Time Decay Attention
In social network, the influence of previous message usually decays with time passing because of timeliness. Users are usually more sensitive to the latest messages. The information that is closer to the user's infection time usually has a greater impact on users. However, the time function of traditional methods based on artificial definition generally can not describe this effect exactly and is hard to decide which one should be used. In order to learn the influence of time on cascades, the following time decay attention mechanism is employed to learn the weight coefficient of current user to the previously infected users. Firstly, we divide the maximum observed time t |c i | into k equal-sized intervals t 0 0, t 1 , t 1 , t 2 , . . . , [t |c i |−1 , t |c i | . The mapping function from continuous time to each interval is showed as Eq. 10: Where t i is the infected time of user i in cascade c. We define a parameter λ f(T−t) for each interval as the time decay weight. We can get the final hidden state h i ′ as the presentation of cascade c which is assembled by a weighted average pooling mechanism as showed in Eq. 11:

Decoding and Output Layer
In order to predict the subsequent infected user, we feed the presentation h |ci| ′ learned by encoding layer into GRU cell in decoding layer. The output of GRU cell is the predicted infected user and it would be feed into next GRU cell as input to continue predicting. To identify when to stop predicting, we append a tag < EOS > in the end of cascade when training. The decoding layer will continuously output the predicted infected users until the output is < EOS >. It means the cascade stop propagating. The calculation process of infection probability for each user is showed as Eq. 12: Where p i ∈ R |U| is the infection probability of user i in next propagation, W p , b p is the trainable weight matrix and bias, respectively. The training objective function to maximize the log-likelihood of all cascades is defined as Eq. 13: Where p c i [j] is the infection probability of the user i to the user j in cascade c. Θ is all the trainable parameters in training model. The whole calculation process of our model is as shown in Algorithm 1.

EXPERIMENTS
In this section, we compare the prediction performance of the proposed DLIC model with baselines and present the empirical evaluations to demonstrate the effectiveness of our model. Moreover, we perform detailed analysis to understand the role of each component in DLIC.

Datasets
In this paper, we verify the performance of our model on two public datasets. The datasets are split into training, validation and test set for 80, 10, and 10%, respectively. Table 1 shows the statistics of datasets.
Twitter [15] dataset records retweets URL among users on Twitter during October 2010. The cascade consists of all the users who retweeted are sorted according to the time. There are 3,442 cascades which contains 12,627 users in this dataset.
Users on Douban [16] can comment on books they have read. Users' comments on a book at a certain time can be regarded as infected. The diffusion process of a book is regarded as a cascade. There are 10,602 cascades which contains 23,123 users in this dataset.

Evaluation Metric
The purpose of cascade prediction is to predict the next infected user based on given observed users. In order to simplify the task and make it easy to evaluate, we regard it as a retrieval task that detect k infected users in the remaining users. Therefore, we first rank the uninfected nodes according to the predicted infection probability, and then evaluate the Top-k infected users according to k 10, 50, 100, respectively. The evaluation metrics are mean average precision (MAP) and HITS.
MAP@k: Mean average precision for a set of cascade predictions is the mean of the average precision scores for each cascade. We assume there are M infected users in top-k users so we can obtain a set of recall value R (1/M, 2/M, . . . , M/ M). Then for each r ∈ R, we can calculate the maximum precision max r′>r P (r′) to obtain average precision AP. Finally, mean Algorithm 1 | The Algorithm description of DLIC average precision is calculated by the average of AP in cascades set C that we predicted. The formula is showed as Eq. 14: MAP c∈C AP(c)

|C|
HITS@k: The rate of the top-k ranked nodes containing the next infected node. The formula is showed as Eq. 15: Where p (·) is an indicator function. If there is actual infected user in prediction result of cascade c, then p 1. Otherwise, p 0.

Baselines
In order to evaluate the performance of the DLIC model, the following baselines are applied to the same dataset to compare with the proposed model.
Topo-LSTM [11] is a model based on LSTM, which extracts directed acyclic graph from social graph and integrates its features into hidden state, which is used to predict the next node and its network structure.
SNIDSA [19] calculates the pairwise similarity of all user pairs, captures the structural dependency among users, and designs a gating mechanism to merge temporal and structural information into RNN.
FOREST [29] uses RNN to encode microscopic cascade information, which is used to learn the structural characteristics of the cascade. Also, it improved the performance through a reinforcement learning framework from macroscopic level to predict the infected nodes.

Experimental Settings
The DLIC model proposed in this paper adopts the seq2seq as the framework and graph attention network with 8 heads is used to optimize the target task. During the experiment, we set the number of observed users K 10, then randomly sampled 20 neighbors of each node and used half-precision fp16 for training, and the objective function is optimized through the Adam algorithm. The specific hyperparameters setting of the experiment are shown in Table 2.

Experimental Results
In order to verify the effectiveness of our proposed model, we compared it with the state-of-the-art cascade prediction methods on two datasets, trying to evaluate the effect of predicting the future infected users with the metric of HITS and MAP. The result is the average of five experiments as shown on Table 3.
The experimental results show that the DLIC model proposed in this paper has improved all metrics on the two datasets. The results indicate improvements on MAP@k and HITS@k by more than 1 and 1.6% separately on Twitter, more than 0.4 and 0.9% separately on Douban, which proved that DLIC performs the best over other SOTR baselines in cascading prediction task.
The overall superiority of DLIC over the baselines mainly comes from two facts: 1) In the aspect of user network structure, thanks to the improvements of encoding of structural context, we achieve a better performance in the user embedding representation. The previous works only considered the influence of neighboring user nodes, while DLIC learns global user features through graph attention network. 2) In the aspect of cascading features, the improvements mostly come from the latent influence of user activation time. The previous works only regarded timestamps as a sequence of users activation order or simple learning parameters, while DLIC learns the weights of different time periods by introducing time decay attention mechanism.  Overall, the proposed DLIC model which combines users relationship and cascade feature could achieve better performance. It proves that research on unified modeling is a effective way to predict cascade diffusion. We will consider it for our future work.

Ablation Study
Our model uses a graph attention on the embedding layer to learn the user topology structure and takes the time decay attention to learn temporal feature of cascades. In order to explore the influence of each component in our proposed model, we remove them separately for experiments. For this purpose, we present two simplified versions of DLIC, denoted as w/o GAT and w/o time decay.
The results of ablation experiments on two datasets verify the effectiveness of components mentioned above. When we remove the corresponding component, all metrics have decreased on both datasets, which show they are effective and reflect the different impacts of them in our model. As we can see on Figure 5, results on MAP@50 and HITS@50 decrease by 0.9 and 1.8%, respectively, on Twitter, and 0.7 and 2.2% respectively on Douban when we remove GAT. Results on MAP@50 and HITS@50 decrease by 0.2 and 0.7% respectively on Twitter, and 0.3 and 0.6% respectively on Douban when we remove time decay attention.  In summary, it can be seen that GAT has a great impact on our model, because the user topology structure is learned so that more context information is integrated to user embedding representation. The time-decay attention alone has little effect on the model when we remove time-decay. However, it can improve model performance when combined with the graph attention network.

Analysis of User Neighbor Sampling
To effectively utilize the user relationship from social network, we construct subgraph by sampling several user neighbors to learn the user representation with graph attention. For this purpose, we select different number of user neighbor sampling for experiment.
As we can see in Figure 6, with the increase of the number of user neighbor sampling, the performance of our model decreases first, then increases to the top, finally continues to decline on the metric of MAP@k. The result on metric of HITS@k also increases first and then decrease. Our proposed model achieves the best performance on both metric when we sample 20 neighbors. It shows that user representation with relationship structure is conductive to the cascade prediction. The main reason is the position of user in social network can reflect the influence to a certain extent. However, too many neighbors who is lack of influence may also introduce noise, which leads to the decline of prediction performance. Based on the above analysis, finally we sample 20 user neighbors for experiment.

CONCLUSION
In this paper, we proposed a cascade prediction method based on graph attention recurrent neural network for cascade prediction task. The main creative point is that our model can learn the spatial-temporal feature at the same time based on GAT and time-decay attention, respectively. Experiments on two public real-word datasets verify the effectiveness of our model in cascade prediction task and analyse the performance of different components. In the future, we plan to explore more attention mechanism to further mine the structural information between cascades.

DATA AVAILABILITY STATEMENT
The original contributions presented in the study are included in the article/Supplementary Material, further inquiries can be directed to the corresponding author.