Burst and Memory-aware Transformer: capturing temporal heterogeneity

Burst patterns, characterized by their temporal heterogeneity, have been observed across a wide range of domains, encompassing event sequences from neuronal firing to various facets of human activities. Recent research on predicting event sequences leveraged a Transformer based on the Hawkes process, incorporating a self-attention mechanism to capture long-term temporal dependencies. To effectively handle bursty temporal patterns, we propose a Burst and Memory-aware Transformer (BMT) model, designed to explicitly address temporal heterogeneity. The BMT model embeds the burstiness and memory coefficient into the self-attention module, enhancing the learning process with insights derived from the bursty patterns. Furthermore, we employed a novel loss function designed to optimize the burstiness and memory coefficient values, as well as their corresponding discretized one-hot vectors, both individually and jointly. Numerical experiments conducted on diverse synthetic and real-world datasets demonstrated the outstanding performance of the BMT model in terms of accurately predicting event times and intensity functions compared to existing models and control groups. In particular, the BMT model exhibits remarkable performance for temporally heterogeneous data, such as those with power-law inter-event time distributions. Our findings suggest that the incorporation of burst-related parameters assists the Transformer in comprehending heterogeneous event sequences, leading to an enhanced predictive performance.


Introduction
Temporal heterogeneity is frequently referred to as burst within the context of complex systems.Numerous natural and social phenomena exhibit bursty temporal patterns such as single-neuron firing (Kemuriyama et al., 2010;Chan et al., 2016;Metzen et al., 2016;Zeldenrust et al., 2018), earthquakes (Corral, 2004;de Arcangelis et al., 2006), solar flares (Wheatland et al., 1998), and human activity (Barabasi, 2005;Karsai et al., 2018).The term temporal heterogeneity rigorously implies that the distribution of inter-event times, which is the time intervals between two consecutive events, exhibits a heavy-tailed distribution such as a power-law distribution.Moreover, when the system is generally temporally heterogeneous, it implies the presence of temporal correlations among inter-event times.For example, the inter-spike interval distribution display temporally heterogeneous patterns, which cannot be simply interpreted as a random or regular process.Numerous studies have addressed temporal correlations between bursty spikes using approaches such as the non-renewal process (Shahi et al., 2016), intensity functions with voltage-dependent terms Lee et al.
Figure 1 illustrates the distinction between temporally heterogeneous inter-event times and those that tend toward homogeneity.The event sequences in Figures 1A, D, F serve as examples of temporal heterogeneity with a power-law inter-event time distribution.The event sequences in Figures 1B, C, E, G present instances that exhibit a more homogeneous random characteristic with an exponential inter-event time distribution.Evidently, the bursty event sequence exhibits clustered events within burst trains, in contrast to the non-burst sequence.Such uneven event occurrences can affect the prediction of event sequences.Without properly accounting for the complicated correlation structure and heterogeneity therein, naive models may struggle to effectively discern hidden patterns.
Event sequence data encompass the temporal occurrences of events spanning various domains, ranging from natural phenomena to social activities.Unlike time series data, event sequence data are defined by sequentially ordered timestamps that signify the timing of individual event occurrences.Numerous studies have focused on predicting the timing of subsequent events have been conducted using temporal point processes (TPPs) (Daley and Vere-Jones, 2008).One of the most widely employed TPP is the Hawkes process (Hawkes, 1971).This process embodies a self-exciting mechanism, wherein preceding events stimulate the occurrence of subsequent events.In contrast to the Hawkes process, the self-correcting process provides a feasible method for establishing regular point patterns (Isham and Westcott, 1979).
The Poisson point process can be employed to generate entirely random and memory-less events (Kingman, 1992).In the FIGURE (A, D, F) Heterogeneous event sequences with a power-law inter-event time distribution.These event sequences exhibit a high burstiness parameter with significant temporal heterogeneity.(B, C, E, G) Event sequences with an exponential inter-event time distribution.These event sequences have burstiness parameters close to and memory coe cients clustered around .
Poisson process, the inter-event time (IET) follows an exponential distribution.The Cox process is a generalized Poisson process in which the intensity function varies with the stochastic process (Cox, 1955); thus, it is also referred to as a doubly stochastic Poisson process.Cox processes are frequently employed to model and predict the arrival of insurance claims, enabling insurers to assess risk and manage resources effectively (Rolski et al., 2009).If the intensity function is not entirely random, as in the Cox process, but given as a deterministic time-varying function, it is referred to as an inhomogeneous Poisson process.
Our research was primarily motivated by the idea that incorporating temporal heterogeneous characteristics into event sequence predictions yields a superior performance in forecasting events.We propose a Burst and Memory-aware Transformer (BMT) model, signifying its capability to train the Transformer by leveraging insights derived from burstiness and memory coefficient, both of which are associated with temporal heterogeneity.Notably, these two metrics were incorporated as embedding inputs for the Transformer architecture.Moreover, a loss function related to these metrics was formulated and employed, thereby enabling the model to naturally capture temporal heterogeneity.The overall schematic diagram of the BMT model is depicted in Figure 2.
The main contributions of this paper is summarized as follows: • The BMT model was developed to integrate insights from the complex systems theory into the Transformer-based temporal point process model, enhancing the capability to incorporate temporal heterogeneity.This study offers a preliminary approach to connect these two distinct disciplines.• The BMT model surpasses state-of-the-art models by effectively integrating burstiness and memory coefficient into both the embedding procedure and associated loss functions.This is confirmed through extensive numerical experiments across a range of scenarios, including those with and without burstiness and memory coefficient embedding and related loss functions, using real-world datasets and synthetic datasets generated via a copula-based algorithm.• Our investigation revealed that the BMT model offers particular advantages when dealing with temporally heterogeneous data, such as datasets characterized by a power-law inter-event time distribution, commonly observed in bursty event sequences.• Our research indicates that excluding either burstiness and memory coefficient embedding or their corresponding loss functions leads to a noticeable reduction in performance.This emphasizes the imperative nature of integrating both elements to achieve optimal performance.
In cases where the inter-event time distribution of the target event sequence exhibits a heavy-tailed distribution, such as a power-law distribution, or where the values of burstiness and memory coefficient significantly deviate from zero, the BMT model ensures superior performance compared to basic Transformerbased models.
The structure of the paper is outlined as follows: Section 2 introduces the background pertaining to the temporal point process, temporal heterogeneity, and generating method for synthetic datasets; Section 3 introduces our Burst and Memory-aware Transformer model; Section 4 presents numerical experiments on synthetic and real-world datasets; Section 5 presents the performance evaluation results; and Section 6 presents the conclusion.

Background . Temporal point process
A temporal point process (TPP) is a stochastic process involving the occurrence of multiple events as time progresses.The foundational data employed to construct the TPP model consists of event sequence data, encompassing event times {t i } n i=1 along with optional marks {κ i } n i=1 .For example, spike train sequences of neurons are composed of timings of occurrences, along with action potential as associated marks.
In this study, we examine the unmarked case to specifically investigate the effects of burst and memory phenomena, while excluding the influence of correlations with marks that do not align with the research direction.For the prediction of the marked TPP model, one approach involves the independent modeling of the target's marks by thresholding.Alternatively, based on contextual analysis (Jo et al., 2013), interactions with multiple neighbors within an egocentric network can be considered as marks and subsequently modeled.
TPP encompasses the modeling of the conditional intensity function λ(t|H t ) given the history of event times H t ≡ (t 1 , ...t n ).The notation for the history of event times, H t will be omitted for convenience.The intensity function characterizes the instantaneous event rate at any given time by considering past event occurrences.The probability density function P(t) and cumulative distribution function F(t) can be derived based on the intensity function, as follows (Rasmussen, 2018): (1) .

. Hawkes process
The Hawkes process, also known as the self-exciting point process, is for a situation where a preceding event excites the occurrence of a subsequent event (Hawkes, 1971).The intensity function λ(t) of the Hawkes process is defined as where the base intensity ζ and η are positive parameters.When a new event occurs during this process, the intensity increases with η and immediately decays exponentially.The probability of the next event occurring is highest immediately following the incidence of the previous event, and it gradually decreases as time elapses.As a result, this process causes events to cluster together.This includes events that happen quickly in a short time and then long times when nothing happens.The generalized Hawkes process is defined as follows: where ζ ≥ 0, η > 0, and γ (t) is a density on (0, ∞).

. . Self-correcting process
In contrast to the Hawkes process, the self-correcting process generates regular inter-event time sequences with randomness (Isham and Westcott, 1979).The intensity function λ(t) for the self-correcting process is defined as follows: where ζ and η are positive parameters.

. . Neural Hawkes process
A limitation of the Hawkes process is that the preceding event cannot inhibit the occurrence of a subsequent event.To overcome these limitations, the neural Hawkes process, which considers the nonlinear relationship with past events using recurrent neural networks, was introduced (Mei and Eisner, 2017).The intensity function λ(t) for the neural Hawkes process is defined as follows: where f (x) = β log 1 + exp(x/β) is the softplus function with parameter β which guarantees a positive intensity, and h(t)s are hidden representations of the event sequence from a continuous-time LSTM model.Here, the intensity we refer to is not the marked intensity λ k ; Instead, our focus is on the inherent temporal heterogeneity structure, excluding any interference from correlations between event types and times.

. . Transformer-based Hawkes process
The Transformer is a deep learning architecture for sequence processing such as natural language processing, with a multihead self-attention module that captures long-range dependencies within sequences (Vaswani et al., 2017).The Transformer is used not only in language models but also in computer vision, audio processing, and time series forecasting (Lim et al., 2021;Wen et al., 2022;Ma et al., 2023).Recently, the Transformer architecture has also been applied to modeling temporal point processes.The Transformer Hawkes Process (Zuo et al., 2020) and the Self-Attentive Hawkes Process (Zhang et al., 2020) were introduced to model the Hawkes process with a self-attention mechanism to capture the long-range correlations underlying both event times and types.
THP and SAHP differ in two aspects: their use of positional encoding and the form of the intensity function.SAHP employs time-shifted positional encoding to address the limitations of conventional methods, which solely account for the sequence order and neglect inter-event times.The intensity function of the THP model is the softplus function of the weighted sum of three terms: ratio of elapsed time from the previous event, hidden representation vector from the encoder, and base.Conversely, the intensity function of the SAHP model is formulated as a softplus of the Hawkes process terms, each of which is derived from the scalar transformation and nonlinear activation function applied to the hidden representation vector from the encoder.
For both the THP and SAHP models, across synthetic and real-world datasets, their performances in event type prediction and event time prediction surpassed that of the baseline model: Hawkes Process as described in Equation (3), Fully Neural Network model (Omi et al., 2019), Log-normal Mixture model (Shchur et al., 2019), Time Series Event Sequence (TSES) (Xiao et al., 2017), Recurrent Marked Temporal Point Processes (Du et al., 2016), and Continuous Time LSTM (Mei and Eisner, 2017).Given the superior performance of THP over the remaining baseline models, this study refrains from direct performance comparison with the SAHP and baseline models (Zuo et al., 2020), opting to concentrate exclusively on performance comparison with the THP model.

. Temporal heterogeneity
Temporal heterogeneity or burst is characterized by various metrics.The most fundamental quantity is the probability density function of the inter-event times.The inter-event time is defined as the time interval between two consecutive events, that is, τ i ≡ t i+1 − t i , where t i is i-th event time of the event sequence.
When the inter-event time distribution is heavy-tailed, the corresponding event sequence exhibits temporal heterogeneity.Specifically, the power-law inter-event distribution found in diverse natural and social phenomena is as follows: where a is a constant and α is a power-law exponent.

. . Burstiness parameter
Several metrics characterize the properties of temporal heterogeneity.Burstiness B measures the burst phenomenon (Goh and Barabási, 2008), and is defined as follows: where r ≡ σ/ τ is the coefficient of variation (CV) of the interevent time and σ and τ is the standard deviation and average of τ s, respectively.Here, B = −1 for regular event sequences, B = 0 for Poissonian random cases, and B = 1 for extremely bursty cases.
When the number of events is sufficiently small, the burstiness parameter causes errors.The fixed burstiness parameter considering the finite-size effect is as follows (Kim and Jo, 2016): We employed the fixed burstiness parameter (9) to handle short-length event sequences throughout this study.

. . Memory coe cient
The memory coefficient M quantifies the correlations between consecutive inter-event times within a sequence consisting of n inter-event times, that is, {τ i } i=1,...,n , as follows (Goh and Barabási, 2008): where τ 1 ( τ 2 ) and σ 1 (σ 2 ) are the average and standard deviation of the inter-event times τ 1 , τ 2 , ..., τ n−1 (τ 2 , τ 3 , ..., τ n ), respectively.This is the Pearson correlation coefficient between consecutive inter-event times.Here, M = 0 indicates no correlation, and M > 0 indicates a positive correlation, which means that a large inter-event time follows after a large inter-event time and vice versa for small inter-event time.M < 0 indicates a negative correlation, which means small inter-event time follows after the large inter-event time and vice versa for a large inter-event time. .

. Applications of B and M to BMT model
When plotting M on the x-axis and B on the y-axis for datasets with various inter-event time distributions, it can be observed that event sequences with similar inter-event time distributions tend to cluster at similar positions (Goh and Barabási, 2008).Essentially, if the ranges of B and M values are known, a rough estimate of the inter-event time distribution can be anticipated.Building on this insight, we devised a BMT model to facilitate learning by designing a method in which the values of B and M were combined and fed into the encoder as inputs.Specifically, when the values of B and M exhibit temporal heterogeneity in their ranges, the encoder of the Transformer can produce inter-event time prediction values with a heavy-tailed inter-event time distribution.
Moreover, B and M are not independent: they are intertwined and move in conjunction.For instance, even when attempting to alter only M by shuffling the inter-event times, B can also change.This serves as evidence that embedding both B and M concurrently yields superior performance compared with embedding either one of them individually.
. Copula-based algorithm for generating sequence of inter-event times To comprehend the impact of burstiness and memory coefficient on the model, we generated synthetic datasets using a copula-based algorithm (Jo et al., 2019).The content of the copula-based algorithm in this study was obtained from Jo et al. (2019).For convenience, we provide a brief overview of the relevant content.The copula-based algorithm models the joint probability distribution of two consecutive inter-event times, that is, P(τ i , τ i+1 ), by adopting the Farlie-Gumbel-Morgenstern (FGM) copula (Nelsen, 2006).The joint probability distribution according to the FGM copula is formulated as follows: where Parameter r is used to control the correlation between τ i and τ i+1 and is in the range of −1 ≤ r ≤ 1. F(τ ) is the cumulative distribution function (CDF) of P(τ ).After applying the transformation method (Clauset and Shalizi, 2009), the next inter-event time τ i+1 can be obtained as Jo et al. (2019) where F −1 is the inverse of F(τ ), c i ≡ rf (τ i ), and x is a random number sampled from a uniform distribution within interval 0 ≤ x < 1.The copula-based algorithm has the advantage of generating event sequences with independent control of the inter-event time distribution and memory coefficient.

Burst and Memory-aware Transformer . Discretization of B and M
Given that the burstiness parameter and memory coefficient are real numbers within the range of [−1, 1], it is necessary to discretize them for embedding within the Transformer.We adopt the uniform discretization transform; the range [−1, 1] is divided into segments of fixed length by the number of bins b, respectively, and subsequently mapped to a single natural number.The continuous values of the burstiness parameter B and memory coefficient M are discretized into natural numbers d B and d M , respectively.For example, when the number of bins is b = 4, then When the number of discretization bins is b, the number of d B,M is b 2 , corresponding to the vocabulary size of the Transformer.Then, we can obtain the one-hot vector of the discretized . Embedding event times, and B and M The event sequence S = {t i } n i=1 of n events and discretized and one-hot B & M, d B,M are fed into the self-attention module after proper encoding.First, the event times are transformed using the positional encoding method (Vaswani et al., 2017) to embed the temporal order information into an event sequence.The j-th element of sinusoidal positional encoding for the i-th event time t i is calculated as: where ω k = 1/10, 000 2k/d , the embedding index k is the quotient when dividing j by 2, and z t (t i ) ∈ R d , where d is the encoding dimension.By multiplying ω k with the event time t i , it is converted into an angle, which is then mapped to sine and cosine functions, providing different positional information for each event time.
For the given event times {t i } n i=1 , the inter-event times are τ i ≡ t i+1 − t i for i = 1, ..., n − 1.The burstiness parameter (9) and memory coefficient (10) were calculated for all partial sequences.This essentially implies that the input to the encoder is fed sequentially from t 1 , ..., t i ,..., t n , and for each of these instances, the B & M embedding incorporates the calculated B and M values up to t 1 (i.e., B 1 and M 1 ), ..., up to t i (i.e., B i and M i ), ..., and up to t n (B n = B and M n = M for the entire sequence).Note that, during the actual operation of the Transformer, computations are performed in parallel; thus, the sliding B & M embedding vectors form a lower triangular matrix.
The B & M embedding vector z e (B i , M i ) for the one-hot vector of the discretized B i and M i , d B i ,M i , is calculated using a linear embedding layer as follows: where W E ∈ R d×b 2 denotes an embedding matrix.Then for the i-th event, the event time embedding vector z t (t i ) ∈ R d and the B & M embedding vector z e (B i , M i ) ∈ R d are summed together to acquire the hidden representation of the i-th event z i ∈ R d as: Then the embedding matrix for a whole single event sequence is given by: where Z ∈ R n×d and n is the length of the event sequence, that is, the number of events in a single sequence. .

Self-attention module
After acquiring the embedding matrix Z for each event sequence according to Equation ( 18), we propagated Z into the input of the self-attention module.The resulting attention output S is defined as follows: where Here, Q, K, and V represent the query, key, and value matrices, respectively, obtained by applying distinct transformations to Z.
The transformation parameters are and W V ∈ R d×d V , respectively.In contrast to conventional RNN models, the self-attention mechanism enables an equitable comparison of not only recent values but also the significance of distant past values of the sequence.Consequently, this facilitates the learning of long-term dependencies.
The BMT model employs multi-head attention, similar to other Transformers.Multi-head attention enables the model to manage diverse patterns and contexts of the input sequence.The multihead attention output S is given by S = [S 1 , ..., S i , ..., S m ]W O , where S i ∈ R n×d V /m is the attention output for the i-th multi-head and After the multi-head attention, the resulting attention output S is subsequently passed into a position-wise feed-forward network, yielding hidden representations h(t) for the event sequence as: where , and b 2 ∈ R d are the parameters of each neural network.The i-th event of the event sequence corresponds to the i-th row of the hidden representation matrix H, that is, h(t i ) = H(i, :).Furthermore, masks are employed to prevent the model from learning about the future in advance.
The hidden representation H ∈ R n×d encapsulates insights regarding burstiness and memory coefficient for each event within the sequence, acquired through the self-attention mechanism.We further enhanced the incorporation of sequential information by applying LSTM to the hidden representation. .

Training and loss function
The BMT model employs five types of loss functions: (1) squared error of the event time, (2) event log-likelihood loss as described in Equation ( 22), (3) cross entropy of discretized B & M, (4) squared error of B, and (5) squared error of M.

. . Event time loss
The most crucial loss function within the model is how accurately it predicts the next event times.The next event time prediction is ti+1 = W t h(t i ), where W t ∈ R 1×d is the parameter of the event time predictor.To address this, the squared error loss function of the event times for the event sequence is defined as: where ti is the predicted event time. .

. Event log-likelihood
The typical approach for optimizing the parameters of the Hawkes process involves utilizing Maximum Likelihood Estimation (MLE).There are two constraints: (1) no events before time 0, and (2) unobserved event time must appear after the observed time interval.When the observed event sequences are t 1 , ..., t n ∈ [0, T), then likelihood of an event sequence is given by L ′ = P(t 1 ) • • • P(t n−1 )(1−F(T)), where F(•) is the cumulative distribution function, and the last term is for the second constraint.Using (1) and ( 2), and applying the logarithm function, we obtain the following log-likelihood: The first term denotes the sum of the log-intensity functions for the past n events, and the second term represents the non-event log-likelihood.
Here, the intensity function λ(t) is defined in the interval t ∈ [t i , t i+1 ] according to the following expression: where β is the softness parameter, w λ ∈ R d×1 is a parameter that converts the term inside the exponential function into a scalar, and h is the hidden representation derived from the encoder.The essence of this intensity function aligns with that of the Neural Hawkes Process, as shown in Equation ( 6).The softplus function formulation was employed to guarantee non-negativity of the intensity.

. . Discretized B and M loss
The model predicts the discretized Bi & Mi , dB i ,M i , based on the hidden representations h(t i−1 ) as: where W B,M ∈ R b 2 ×d is the parameter of the discretized B i & M i predictor, and pi (d ′ ) is the d ′ -th element of pi .To measure the accuracy of B i & M i embedding, the following cross-entropy between the ground truth discretized B i & M i , d B i ,M i , and the predicted discretized Bi & Mi , dB i ,M i , is calculated: where is the ground truth one-hot encoding vector.

. . B loss and M loss
Additionally, the model utilizes the squared errors of the burstiness parameter directly as: where B i and Bi are the ground truth and predicted burstiness parameters, respectively.The squared errors of the memory coefficient value can be defined in a similar manner.
where M i and Mi is ground truth and predicted memory coefficient, respectively. .

. Overall loss
By aggregating the aforementioned loss functions ( 21), ( 22), and ( 26)-( 28), the overall loss function of the model is defined as follows: where α 1 to α 4 are the hyperparameters that balance each loss function determined using the validation datasets.The overall framework of the BMT model is illustrated in Figure 3.

Experiments . Synthetic datasets
We generated synthetic data using the copula-based algorithm for two different inter-event time distributions.The model was tested for the exponential and power-law inter-event time distribution, which also have a different range of memory coefficient and burstiness, to directly understand the impact of temporal heterogeneity on the BMT model and other models.Along with the two synthetic datasets below, we tested the regular event sequences generated by the self-correcting process, as in Equation ( 5).The statistics of the datasets are displayed in Table 1.

. . Power-law inter-event time distribution
The power-law inter-event time distribution with a power-law exponent α is defined as , where θ (•) represents the Heaviside step function with a lower bound of 1.After substituting the inter-event time distribution into Equation ( 13), we obtain the next inter-event time τ i+1 from a given previous inter-event time τ i and random number x in 0 ≤ x < 1 as where ) (Jo et al., 2019).
A total of 1,000 sequences with a power-law inter-event time distribution were generated with different parameters according to Equation ( 30).The power-law exponent α, memory coefficient M, and the number of events for each event sequence are randomly and independently drawn from 2.1 ≤ α ≤ 2.9, −1/3 ≤ M ≤ 1/3, and 50 ≤ n ≤ 500, respectively.The initial inter-event time was randomly drawn from 1 to 2. Depending on the power-law exponent and memory coefficient, the burstiness ranged from 0.297 to 0.962.
As depicted in Figure 4, the power-law inter-event time datasets exhibit pronounced dispersion toward the region of larger burstiness and memory coefficients (B and M scatter plots).Moreover, these datasets show a power-law inter-event time distribution with exponent values α = 2.4 close to the average within the range of exponents 2.1 < α < 2.9.

. . Exponential inter-event time distribution
The exponential inter-event time distribution with mean µ is defined as P(τ ) = µ −1 e −τ/µ and the corresponding cumulative distribution function is F(τ ) = 1 − e −τ/µ and the relationship between the parameter and memory coefficient is r = 4M.After substituting the inter-event time distribution into Equation ( 13), we obtain the next inter-event time τ i+1 from a given previous inter-event time τ i and random number x in 0 ≤ x < 1 as follows: where (Jo et al., 2019).A total of 1,000 sequences with an exponential inter-event time distribution were generated using different parameters, according to Equation ( 31).The mean inter-event time µ, memory coefficient M, and the number of events n for each event sequence were randomly and independently drawn from 1 ≤ µ ≤ 100, −1/3 ≤ M ≤ 1/3, and 50 ≤ n ≤ 500, respectively.The initial inter-event time was set to µ for each event sequence.
As illustrated in Figure 4, the B and M scatter plots of the exponential inter-event time datasets show that B values are concentrated in the lower range, whereas M values exhibit a broader distribution spread both above and below.This contrasts with the self-correcting process datasets, where the B and M scatter plots show that both B and M clustered at ∼0.Although both datasets have an exponential inter-event time distribution, their heterogeneity differs owing to variations in the relationship between B and M.Even with an exponential interevent time distribution, appropriately shuffling inter-event times can generate event sequences with temporal heterogeneity (i.e., burst) characteristics.We examine this difference further later, as it plays a role in generating variations in performance.

. Real-world datasets
We adopted four real-world datasets to evaluate the models: the Retweets, StackOverflow, Financial Transaction, and 911 Calls  The dataset covers a five-year period, which is a relatively extensive time frame for prediction purposes.Therefore, we partitioned the data into monthly intervals.Additionally, to ensure statistical significance, we included only The dataset is available on https://www.kaggle.com/datasets/mchirico/montcoalert/.
Frontiers in Computational Neuroscience frontiersin.orgthose locations where the number of events exceeded 50 in the data.
Although there are other commonly used datasets, the burst and memory-aware characteristics assumed by the BMT model are applicable when the sequence length is sufficiently long.Furthermore, we sampled event sequences in quantities comparable to synthetic data while concurrently excluding sequences with short lengths.The time units for each dataset are as follows: Retweet and StackOverflow datasets are in days, Financial Transaction datasets are in milliseconds, and 911 Calls datasets are in minutes.The statistics of the datasets are displayed in Table 1.
As shown in Figure 5, when comparing the Retweets datasets (or Financial Transaction datasets) to the StackOverflow datasets (or 911 Calls datasets), it is evident that the Retweets datasets and Financial Transaction datasets are more temporally heterogeneous.In the B and M scatter plots, the Retweets datasets (or Financial Transaction datasets) are concentrated in regions with larger values for both B and M, whereas the StackOverflow datasets (or 911 Calls datasets) are centered around values near 0 for both B and M.However, when compared to the self-correcting process datasets, the StackOverflow (or 911 Calls datasets) datasets exhibit greater dispersion.Additionally, the inter-event time distribution reveals that the Retweets datasets and Financial Transaction datasets follow a power-law distribution (exponent of 1.36 and 1.70, respectively), whereas the StackOverflow datasets and 911 Calls datasets follow an exponential distribution.

. Impact of B and M embedding and losses
While altering the combination of loss functions during the experimental process, there were five control groups.
1. BMT-NoE&NoL (BMT without B & M embedding and without corresponding losses).The simplest scenario occurs when α 2 = α 3 = α 4 = 0, utilizing only time and event losses.In this case, only event time and intensity were considered.2. BMT-NoE&L (BMT without embedding for B & M, but with losses for either B or M).To incorporate the effects of the B & M losses, we also consider the case of α 2 = 0, α 3 > 0, and α 4 > 0 with time and event losses.Note that the case for α 2 > 0 relates to predicting the discretized on-hot B & M, and hence it is not applicable in this scenario.3. BMT-E&NoL (BMT without losses related to B & M, but with embedding for B & M).The control group examines the impact of loss for B & M; the representation vector remains consistent with the BMT model, as shown in Equation ( 18), but without L B,M , L B , and L M , that is, α 2 = α 3 = α 4 = 0. 4. BMT-B (BMT with B embedding only and the corresponding loss).In the case where only B is embedded and the model is trained, the loss is also computed exclusively based on B as α 2 = 0, α 3 > 0, and α 4 = 0 with time and event losses.5. BMT-M (BMT with M embedding only and corresponding loss).In the case where only M is embedded and the model is trained, the loss is also computed exclusively based on M as α 2 = α 3 = 0, and α 4 > 0 with time and event losses.

Results and discussion
We tested several hyperparameters for both the BMT and THP models and chose the configuration that yielded the best validation performance.The hyperparameters are as follows: the number of bins for discretization (b) is set to 40, mini-batch size is 16, dropout rate is 0.1, embedding dimensions (d and d H ) are both 128, selfattention dimensions (d K and d V ) are 32, with eight layers in the encoder and 8 heads.For the loss function, hyperparameters were fine-tuned, mainly as follows: α 1 = 1e3, α 2 = 4e3, α 3 = α 4 = 1e4.We employed the ADAM (adaptive moment estimation) optimizer with hyperparameters β set to (0.9, 0.999).Regarding the learning rate, we utilized PyTorch StepLR, initializing it at 1e-4 and reducing the learning rate by a factor of 0.9 every 15 steps.
The performance evaluation results for different models across diverse datasets are presented in Table 2.The results indicate that BMT achieves superior performance compared to THP and other control models in terms of the root mean squared error (RMSE) of the event times and log-likelihood.The main metric, RMSE, is a unit-adjusted value obtained by taking the square root of Equation ( 21).It measures how much predicted event times of the model differ from the actual event times.However, RMSE has a drawback, especially in the case of heterogeneous data, where it can perform well by accurately predicting large values while potentially struggling with smaller ones.To address this limitation, we introduce the event log-likelihood, defined in Equation ( 22), as a second metric.This metric arises when probabilistically modeling event sequences using the intensity function λ derived from Equation (1).A higher likelihood of the intensity function calculated with predicted event times of the model indicates that the model better probabilistically mimics the actual event sequence.Consequently, larger values of this metric correspond to better performance.Additionally, when considering the B and M losses in Equation ( 26), they represent how well the model captures discretized burstiness and discretized memory coefficients.Smaller values of these losses indicate better performance in replicating these aspects.
In particular, as the data became more heterogeneous, performance improvement became more pronounced.In synthetic datasets, the performance enhancement of the BMT model was greater for power-law inter-event time data than for self-correcting data, which is a less heterogeneous exponential inter-event time distribution (see Figure 4).Similarly, in real-world datasets, the overall performance of the BMT model was superior in the Retweets dataset, which exhibited a more power-law inter-event time distribution, compared to the StackOverflow datasets with a less heterogeneous exponential inter-event time distribution (see Figure 5).
When compared to the BMT-NoE&L model with respect to the RMSE of the event times, the BMT model shows that superior performance across all datasets except StackOverflow.This suggests that the inclusion of B & M embedding processes aids in augmenting the performance of the model by enabling the encoder to grasp the burst structure of event sequences.Compared with the BMT-E&NoL model, the BMT model demonstrates enhanced performance across all datasets, indicating that the integration of B & M losses into the overall loss function contributes to the   Summarizing the aforementioned findings, it is evident that both B & M embedding and B & M losses contribute to performance enhancement.Excluding either of these components would likely impede the attainment of a substantial performance improvement, comparable to that observed with the BMT model.If either of the B embedding or M embedding is omitted, a significant performance improvement comparable to that of the BMT model cannot be expected.This was substantiated by comparing the BMT model with the BMT-B and BMT-M models, which revealed the superior performance of the BMT model across all datasets.These results can also be observed in the training curves shown in Figure 6.
We also conducted experiments on mixed synthetic datasets, the results of which are presented in Table 3.The mixed synthetic datasets comprised a combination of three individual datasets: power-law, exponential, and self-correcting datasets.However, when separately examining the RMSE of event time and loglikelihood, the performance of the original BMT model appeared slightly inferior compared to some of the control BMT models, demonstrating an overall superior performance when considering both metrics together.
In summary, the BMT model demonstrates improved performance on heterogeneous data owing to its capability to capture heterogeneous characteristics through the embedding of B & M, combined with the inclusion of corresponding loss functions.
The BMT model has two limitations.First, in cases where the event sequence length is short, the incorporation of B and M into the BMT model may result in reduced effectiveness.This aspect originates from the statistical characteristics of B and M, because their meaningful representation is hindered by fluctuations and noise, particularly when the number of events is small.In the BMT model, during the calculation of sliding Frontiers in Computational Neuroscience frontiersin.orgThe bold value indicates the metric of the model with the best performance for each individual dataset.
B and M values, masking was applied to exclude the first three events.However, considering that temporal heterogeneity becomes a meaningful characteristic only when the length of the event sequence is sufficiently long, this limitation can be viewed as unavoidable.The second limitation is the inability to consider event types, which will be addressed in future studies.To account for event types, it is necessary to reflect the correlation structure between inter-event event types to generate synthetic data and subsequently test the model using these data.In the context of performance enhancement, the improvement of the BMT model over the THP model can also be attributed to the fact that the BMT model does not embed event types.This allows the model to focus more on predicting the event times.Because the BMT-NoE&NoL model is analogous to a version of the THP model that does not consider event types, comparing the performance of the BMT-NoE&NoL model with the BMT model would provide a more equitable assessment.However, upon comparing the BMT-NoE&NoL model with the BMT model, it becomes evident that the BMT model exhibits superior performance across all datasets, except for StackOverflow.

Conclusion
Our study addresses the challenges presented by bursty temporal patterns in event sequences across various domains.By leveraging recent advancements in predicting event sequences using Transformer models based on the Hawkes process with selfattention mechanisms, we introduced a Burst and Memory-aware Transformer (BMT) model.This model effectively captures the nuances of burst patterns by embedding burstiness and memory coefficient within its self-attention module.The incorporation of a specialized loss function tailored for burstiness and memory coefficient further refines the model's predictive capabilities.
Through comprehensive numerical experiments conducted on a diverse array of synthetic and real-world datasets encompassing various scenarios, we validated the outstanding performance of the BMT model by comparing it with the existing models and control groups.This is particularly evident in scenarios involving heterogeneous data, such as power-law inter-event time distributions.Hence, the explicit consideration of burstrelated parameters within the Transformer contributes to a deeper comprehension of complex event sequences, ultimately leading to an enhanced predictive performance.In future work, we will focus on integrating a multitude of insights from complex systems into the development of deep neural network models for temporal data.

FIGURE
FIGURE Schematic diagram of the Burst and Memory-aware Transformer model.Leveraging information from the preceding events, including burstiness B and memory coe cient M, the model predicts the timing of the next event through B & M embedding and the corresponding B & M loss.

FIGURE
FIGUREArchitecture of the Burst and Memory-aware Transformer model.IET; inter-event time; FF, feed-forward neural network; B, burstiness; M, memory coe cient.

FIGURE
FIGURERelationship between burstiness and memory coe cient (left) and inter-event time distribution (right) across three synthetic datasets: (A, B) power-law inter-event time, (C, D) exponential inter-event time, and (E, F) self-correcting process.For calculating the inter-event time distribution, logarithmic binning was employed.

FIGUREFrontiers
FIGURERelationship between burstiness and memory coe cient (left) and inter-event time distribution (right) for four real-world datasets: (A, B) Retweets, (C, D) StackOverflow, (E, F) Financial Transaction, and (G, H) Calls.For calculating the inter-event time distribution, logarithmic binning was employed.

FIGURE
FIGURETraining curves of RMSE for event times fitted on Financial Transaction datasets are presented for various BMT model scenarios: BMT-NoE&NoL, BMT-NoE&L, BMT-E&NoL, BMT-B, BMT-M, and the standard BMT model.
Then, one can obtain the discretized pairs of B and M as (d B , d M ), where d B and To map the pair into a unique natural number, the Cantor pairing function was employed.The Cantor pairing function maps discretized d B and d M into a unique natural number d B,M as d M are ranging from 1 to b.
TABLE Datasets statistics.
IET, inter-event time; B, burstiness; M, memory coefficient; S.D., standard deviation.Frontiers in Computational Neuroscience frontiersin.org TABLE Performance evaluation results across diverse datasets for di erent models.

TABLE (
improved performance of the model.Even in the prediction of one-hot discretized B & M, it can be observed that including B & M losses contributes to a reduction in cross entropy.No significant differences in performance were observed between the BMT-NoE&NoL and BMT-NoE&L models.This suggests that the incorporation of B & M losses is less significant in the absence of B & M embedding.
TABLE Performance evaluation results for the mixed synthetic datasets: power-law, exponential, and self-correcting datasets.