Financial Forecasting With α-RNNs: A Time Series Modeling Approach

The era of modern financial data modeling seeks machine learning techniques which are suitable for noisy and non-stationary big data. We demonstrate how a general class of exponential smoothed recurrent neural networks (α-RNNs) are well suited to modeling dynamical systems arising in big data applications such as high frequency and algorithmic trading. Application of exponentially smoothed RNNs to minute level Bitcoin prices and CME futures tick data, highlight the efficacy of exponential smoothing for multi-step time series forecasting. Our α-RNNs are also compared with more complex, “black-box”, architectures such as GRUs and LSTMs and shown to provide comparable performance, but with far fewer model parameters and network complexity.


INTRODUCTION
Recurrent neural networks (RNNs) are the building blocks of modern sequential learning. RNNs use recurrent layers to capture non-linear temporal dependencies with a relatively small number of parameters (Graves, 2013). They learn temporal dynamics by mapping an input sequence to a hidden state sequence and outputs, via a recurrent layer and a feedforward layer.
There have been exhaustive empirical studies on the application of recurrent neural networks to prediction from financial time series data such as historical limit order book and price history (Borovykh et al., 2017;Dixon, 2018;Borovkova and Tsiamas, 2019;Chen and Ge, 2019;Mäkinen et al., 2019;Sirignano and Cont, 2019). Sirignano and Cont (2019) find evidence that stacking networks leads to superior performance on intra-day stock data combined with technical indicators, whereas (Bao et al., 2017) combine wavelet transforms and stacked autoencoders with LSTMs on OHLC bars and technical indicators. Borovykh et al. (2017) find evidence that dilated convolutional networks out-perform LSTMs on various indices. Dixon (2018) demonstrate that RNNs outperform feed-forward networks with lagged features on limit order book data.
There appears to be a chasm between the statistical modeling literature (see, e.g., Box and Jenkins 1976;Kirchgässner and Wolters 2007;Hamilton 1994) and the machine learning literature (see. e.g., Hochreiter and Schmidhuber 1997;Pascanu et al. 2012;Bayer 2015). One of the main contributions of this paper is to demonstrate how RNNs, and specifically a class of novel exponentially smoothed RNNs (α-RNNs), proposed in (Dixon, 2021), can be used in a financial time series modeling framework. In this framework, we rely on statistical diagnostics in combination with cross-validation to identify the best choice of architecture. These statistical tests characterize stationarity and memory cut-off length and provide insight into whether the data is suitable for longer-term forecasting and whether the model must be non-stationary.
In contrast to state-of-the-art RNNs such as LSTMs and Gated Recurrent Units (GRUs) (Chung et al., 2014), which were designed primarily for speech transcription, the proposed class of α-RNNs is designed for times series forecasting using numeric data. α-RNNs not only alleviate the gradient problem but are designed to i) require fewer parameters and numbers of recurrent units and considerably fewer samples to attain the same prediction accuracy 1 ; ii) support both stationary and non-stationary times series 2 ; and iii) be mathematically accessible and characterized in terms of well known concepts in classical time series modeling, rather than appealing to logic and circuit diagrams.
As a result, through simple analysis of the time series properties of α-RNNs, we show how the value of the smoothing parameter, α, directly characterizes its dynamical behavior and provides a model which is both more intuitive for time series modeling than GRUs and LSTMs while performing comparably. We argue that for time series modeling problems in finance, some of the more complicated components, such as reset gates and cell memory present in GRUs and LSTMs but absent in α-RNNs, may be redundant for our data. We exploit these properties in two ways i) first, we using a statistical test for stationarity to determine whether to deploy a static or dynamic α-RNN model; and ii) we are able to reduce the training time, memory requirements for storing the model, and in general expect α-RNN to be more accurate for shorter time series as they require less training data and are less prone to over-fitting. The latter is a point of practicality as many applications in finance are not necessarily big data problems, and the restrictive amount of data favors an architecture with fewer parameters to avoid over-fitting.
The remainder of this paper is outlined as follows. Section 2 introduces the static α-RNN. Section 3 bridges the time series modeling approach with RNNs to provide insight on the network properties. Section 4 introduces a dynamic version of the model and illustrates the dynamical behavior of α. Details of the training, implementation and experiments using financial data together with the results are presented in Section 5. Finally, Section 6 concludes with directions for future research.

α-RNNS
Given auto-correlated observations of covariates or predictors, x t , and continuous responses y t at times t 1, . . . , N, in the time series data D : {(x t , y t )} N t 1 , our goal is to construct an m-step (m > 0) ahead times series predictor, y t+m F(x _ t ), of an observed target, y t+m ∈ R n , from a p length input sequence x _ t x t−j : L j [x t ] is the j th lagged observation of x t ∈ R d , for j 0, . . . , p − 1 and u t is the homoscedastic model error at time t. We introduce the α-RNN model (as shown in Figure 1): where the input weight matrix W h ∈ R H×d , the recurrence weight matrix U h ∈ R H×H , the output weight matrix W y ∈ R n×H , and H is the number of hidden units. The hidden and output bias vectors are given by b : For each index in a sequence, s t-p+2, . . . ,t, forward passes repeatedly update a hidden internal state h s ∈ R H , using the following model: where σ() : tanh() is the activation function andh s ∈ R H is an exponentially smoothed version of the hidden state h s , with the starting condition in each sequence, h t−p+1 σ(W h x t−p+1 ).

UNIVARIATE TIMES SERIES MODELING WITH ENDOGENOUS FEATURES
This section bridges the time series modeling literature (Box and Jenkins, 1976;Kirchgässner and Wolters, 2007;Li and Zhu, 2020) and the machine learning literature. More precisely, we show the conditions under which plain RNNs are identical to autoregressive time series models and thus how RNNs generalize autoregressive models. Then we build on this result by applying time series analysis to characterize the behavior of static α-RNNs.
We shall assume here for ease of exposition that the time series data is univariate and the predictor is endogenous 3 , so that the data is D : {y t } N t 1 . We find it instructive to show that plain RNNs are non-linear AR(p) models. For ease of exposition, consider the simplest case of a RNN with one hidden unit, H 1. Without loss of generality, we set U h W h ϕ, W y 1, b h 0 and b y μ. Under backward substitution, a plain-RNN, F W,b (x _ t ), with sequence length p, is a non-linear auto-regressive, NAR(p), model of order p: : When the activation is the identity function σ : I d , then we recover the AR(p) model with geometrically decaying autoregressive coefficients when ϕ < 1. The α-RNN(p) is almost identical to a plain RNN, but with an additional scalar smoothing parameter, α, which provides the recurrent network with "long-memory" 4 . To see this, let us consider a one-step ahead univariate α-RNN(p) in which the smoothing parameter is fixed and H 1.
This model augments the plain-RNN by replacing h s−1 in the hidden layer with an exponentially smoothed hidden stateh s−1 . The effect of the smoothing is to provide infinite memory when α ≠ 1. For the special case when α 1, we recover the plain RNN with short memory of length p ≪ N.
We can easily verify this informally by simplifying the parameterization and considering the unactivated case. Setting b y b h 0, U h W h ϕ ∈ R and W y 1: with the starting condition in each sequence, h t−p+1 ϕy t−p+1 . With out loss of generality, consider p 2 lags in the model so that h t−1 ϕy t−1 . Then and the model can be written in the simpler form with auto-regressive weights ϕ 1 : ϕ and ϕ 2 : αϕ 2 . We now see that there is a third term on the RHS of Eq. 8 which vanishes when α 1 but provides infinite memory to the model sinceh t−2 depends on y 1 , the first observation in the whole time series, not just the first observation in the sequence. To see this, we unroll the recursion relation in the exponential smoother: where we used the property thath 1 y 1 . It is often convenient to characterize exponential smoothing by the half-life 5 . To gain further insight on the memory of the network, Dixon (2021) study the partial auto-correlations of the process y t+m + u t to characterize the memory and derive various properties and constraints needed for network stability and sequence length selection.
FIGURE 1 | An illustrative example of an α-RNN with an alternating hidden recurrent layer (with blue nodes) and a smoothing layer (white block), "unfolded" over a sequence of six time steps. Each lagged feature x t−i in the sequence x _ t is denoted by the yellow nodes. The hidden recurrent layer contains H units (blue nodes) and the ith output, after smoothing, at time step t is denoted byh i t . At the last time step t, the hidden units connect to a single unactivated output unit to give y t+m (red node).
4 Long memory refers to autoregressive memory beyond the sequence length. This is also sometimes referred to as "stateful". For avoidance of doubt, we are not suggesting that the α-RNN has an additional cellular memory, as in LSTMs. 5 The half-life is the number of lags needed for the coefficient (1 − α) s to equal a half, which is s −1/log 2 (1 − α).

MULTIVARIATE DYNAMIC α-RNNS
We now return to the more general multivariate setting as in Section 2. The extension of RNNs to dynamical time series models, suitable for non-stationary time series data, relies on dynamic exponential smoothing. This is a time dependent, convex, combination of the smoothed output,h t , and the hidden state h t :h where + denotes the Hadamard product between vectors and where α t ∈ [0, 1] H denotes the dynamic smoothing factor which can be equivalently written in the one-step-ahead forecast of the form Hence the smoothing can be viewed as a dynamic form of latent forecast error correction. When (α t ) i 0, the i th component of the latent forecast error is ignored and the smoothing merely repeats the i th component of the current hidden state (h t ) i , which enforces the removal of the i th component from the memory. When (α t ) i 1, the latent forecast error overwrites the current i th component of the hidden state (h t ) i . The smoothing can also be viewed as a weighted sum of the lagged observations, with lower or equal weights, where g(α) : t−1 r 0 (1 − α t−r )+ỹ 1 . Note that for any (α t−r+1 ) i 1, the i th component of the smoothed hidden state (h t+1 ) i will have no dependency on all the lagged i th components of hidden states {( h t−s ) i } s ≥ r . The model simply forgets the i th component of the hidden states at or beyond the r th lag.

Neural Network Exponential Smoothing
While the class of α t -RNN models under consideration is free to define how α is updated (including changing the frequency of the update) based on the hidden state and input, a convenient choice is use a recurrent layer. Remaining in the more general setup with a hidden state vector h t ∈ R H , let us model the smoothing This smoothing is a vectorized form of the above classical setting, only here we note that when (α t ) i 1, the i th component of the hidden variable is unmodified and the past filtered hidden variable is forgotten. On the other hand, when (α t ) i 0, the i th component of the hidden variable is obsolete, instead setting the current filtered hidden variable to its past value. The smoothing in Eq. 12 can be viewed then as updating long-term memory, maintaining a smoothed hidden state variable as the memory through a convex combination of the current hidden variable and the previous smoothed hidden variable.
The hidden variable is given by the semi-affine transformation: which in turn depends on the previous smoothed hidden variable. Substituting Eq. 13 into Eq. 12 gives a function ofh t−1 and x t : : We see that when (α t ) i 0, the i th component of the smoothed hidden variable (h t ) i is not updated by the input x t . Conversely, when (α t ) i 1, we observe that the i th hidden variable locally behaves like a non-linear autoregressive series. Thus the smoothing parameter can be viewed as the sensitivity of the smoothed hidden state to the input x t .
The challenge becomes how to determine dynamically how much error correction is needed. As in GRUs and LSTMs, we can address this problem by learning α F (W α ,U α ,b α ) (x _ t ) from the input variables with the recurrent layer parameterized by weights and biases (W α , U α , b α ). The one-step ahead forecast of the smoothed hidden state,h t , is the filtered output of another plain RNN with weights and biases (W h , U h , b h ).

RESULTS
This section describes numerical experiments using financial time series data to evaluate the various RNN models. All models are implemented in v1.15.0 of TensorFlow (Abadi et al., 2016). Times series cross-validation is performed using separate training, validation and test sets. To preserve the time structure of the data and avoid look ahead bias, each set represents a contiguous sampling period with the test set containing the most recent observations. To prepare the training, validation and testing sets for m-step ahead prediction, we set the target variables (responses) to the t + m observation, y t+m , and use the lags from t − p + 1, . . . t for each input sequence. This is repeated by incrementing t until the end of each set. In our experiments, each element in the input sequence is either a scalar or vector and the target variables are scalar. We use the SimpleRNN Keras method with the default settings to implement a fully connected RNN. Tanh activation functions are used for the hidden layer with the number of units found by time series cross-validation with five folds to be H ∈ {5, 10, 20} and L 1 regularization, λ 1 ∈ {0, 10 − 3 , 10 − 2 }. The Glorot and Bengio uniform method (Glorot and Bengio, 2010) is used to initialize the non-recurrent weight matrices and an orthogonal method is used to initialize the recurrence weights as a random orthogonal matrix. Keras's GRU method is implemented using version 1,406.1078v, which applies the reset gate to the hidden state before matrix multiplication. See Appendix 1.1 for a definition of the reset gate. Similarly, the LSTM method in Keras is used. Tanh activation functions are used for the recurrence layer and  Each architecture is trained for up to 2000 epochs with an Adam optimization algorithm with default parameter values and using a mini-batch size of 1,000 drawn from the training set. Early stopping is implemented using a Keras call back with a patience of 50 to 100 and a minimum loss delta between 10 − 8 and 10 − 6 . So, for example, if the patience is set to 50 and the minimum loss delta is 10 − 8 , then fifty consecutive loss evaluations on mini-batch updates must each lie within 10 − 8 of each other before the training terminates. In practice, the actual number of epoches required varies between trainings due to the randomization of the weights and biases, and across different architectures and is typically between 200 and 1,500. The 2000 epoch limit is chosen as it provides an upper limit which is rarely encountered. No random permutations are used in the minibatching sampling in order to preserve the ordering of the time series data. To evaluate the forecasting accuracy, we set the forecast horizon to up to ten steps ahead instead of the usual step ahead forecasts often presented in the machine learning literature-longer forecasting horizons are often more relevant due to operational constraints in industry applications and are more challenging when the data is non-stationary since the fixed partial auto-correlation of the process y t+m + u t will not adequately capture the observed  changing partial auto-correlation structure of the data. In the experiments below, we use m 4 and m 10 steps ahead. The reason we use less than m 10 in the first experiment is because we find that there is little memory in the data beyond four lags and hence it is of little value to predict beyond four time steps.

Bitcoin Forecasting
One minute snapshots of USD denominated Bitcoin mid-prices are captured from Coinbase over the period from January 1 to November 10, 2018. We demonstrate how the different networks forecast Bitcoin prices using lagged observations of prices. The predictor in the training and the test set is normalized using the moments of the training data only so as to avoid look-ahead bias or introduce a bias in the test data. We accept the Null hypothesis of the augmented Dickey-Fuller test as we can not reject it at even the 90% confidence level. The data is therefore stationary (contains at least one unit root). The largest test statistic is −2.094 and the p-value is 0.237 (the critical values are 1%: -3.431, 5%: -2.862, and 10%: -2.567). While the partial autocovariance structure is expected to be time dependent, we observe a short memory of only four lags by estimating the PACF over the entire history (see Figure 2). We choose a sequence length of p 4 based on the PACF and perform a four-step ahead forecast. We comment in passing that there is little, if any, merit in forecasting beyond this time horizon given the largest significant lag indicated by the PACF. Figure 3 compares the performance of the various forecasting networks and shows that stationary models such as the plain RNN and the α-RNN least capture the price dynamics-this is expected because the partial autocorrelation is non-stationary.
Viewing the results of time series cross validation, using the first 30,000 observations, in Table 1, we observe minor differences in the out-of-sample performance of the LSTM, GRU vs. the α t -RNN, suggesting that the reset gate and extra cellular memory in the LSTM provides negligible benefit for this dataset. In this case, we observe very marginal additional benefit in the LSTM, yet the complexity of the latter is approximately 50x that of the α t -RNN. Furthermore we observe evidence of strong over-fitting in the GRU and LSTM vs. the α t -RNN. The ratio of training to test errors are respectively 0.596 and 0.603 vs. 0.783. The ratio of training to validation errors are 0.863 and 0.862 vs. 0.898.

High Frequency Trading Data
Our dataset consists of N 1, 033, 468 observations of tick-bytick Volume Weighted Average Prices (VWAPs) of CME listed ESU6 level II data over the month of August 2016 (Dixon, 2018;Dixon et al., 2019).
We reject the Null hypothesis of the augmented Dickey-Fuller test at the 99% confidence level in favor of the alternative hypothesis that the data is stationary (contains no unit roots. See for example (Tsay, 2010) for a definition of unit roots and details of the Dickey-Fuller test). The test statistic is −5.243 and the p-value is 7.16 × 10 − 6 (the critical values are 1%: -3.431, 5%: -2.862, and 10%: -2.567).
The PACF in Figure 4 is observed to exhibit a cut-off at approximately 23 lags. We therefore choose a sequence length of p 23 and perform a ten-step ahead forecast. Note that the time-stamps of the tick data are not uniform and hence a step refers to a tick. Figure 5 compares the performance of the various networks and shows that plain RNN performs poorly, whereas and the α t -RNN better captures the VWAP dynamics. From Table 2, we further observe relatively minor differences in the performance of the GRU vs. the α t -RNN, again suggesting that the reset gate and extra cellular memory in the LSTM provides no benefit. In this case, we find that the GRU has 10x the number of parameters as the α t -RNN with very marginal benefit. Furthermore we observe evidence of strong over-fitting in the GRU and LSTM vs. the α t -RNN, although overall we observe stronger over-fitting on this dataset than the bitcoin dataset. The ratio of training to test errors are respectively 0.159 and 0.187 vs. 0.278. The ratio of training to validation errors are 0.240 and 0.226 vs. 0.368.

CONCLUSION
Financial time series modeling has entered an era of unprecedented growth in the size and complexity of data which require new modeling methodologies. This paper demonstrates a general class of exponential smoothed recurrent neural networks (RNNs) which are well suited to modeling non-stationary dynamical systems arising in industrial applications such as algorithmic and high frequency trading. Application of exponentially smoothed RNNs to minute level Bitcoin prices and CME futures tick data demonstrates the efficacy of exponential smoothing for multi-step time series forecasting. These examples show that exponentially smoothed RNNs are well suited to forecasting, exhibiting few layers and needing fewer parameters, than more complex architectures such as GRUs and LSTMs, yet retaining the most important aspects needed for forecasting non-stationary series. These methods scale to large numbers of covariates and complex data. The experimental design and architectural parameters, such as the predictive horizon and model parameters, can be determined by simple statistical tests and diagnostics, without the need for extensive hyper-parameter optimization. Moreover, unlike traditional time series methods such as ARIMA models, these methods are shown to be unconditionally stable without the need to pre-process the data.

DATA AVAILABILITY STATEMENT
The datasets and Python codes for this study can be found at https://github.com/mfrdixon/alpha-RNN.

AUTHOR CONTRIBUTIONS
MD contributed the methodology and results, and JL contributed to the results section.

GRUs
A GRU is given by: When viewed as an extension of our α t RNN model, we see that it has an additional reset, or switch, r t , which forgets the dependence of h t on the smoothed hidden state. Effectively, it turns the update for h t from a plain RNN to a FFN and entirely neglect the recurrence. The recurrence in the update of h t is thus dynamic. It may appear that the combination of a reset and adaptive smoothing is redundant. But remember that α t effects the level of error correction in the update of the smoothed hidden state,h t , whereas r t adjusts the level of recurrence in the unsmoothed hidden state h t . Put differently, α t by itself can not disable the memory in the smoothed hidden state (internal memory), whereas r t in combination with α t can. More precisely, when α t 1 and r t 0,h t h t σ(W h x t + b h ) which is reset to the latest input, x t , and the GRU is just a FFN. Also, when α t 1 and r t > 0, a GRU acts like a plain RNN. Thus a GRU can be seen as a more general architecture which is capable of being a FFN or a plain RNN under certain parameter values. These additional layers (or cells) enable a GRU to learn extremely complex long-term temporal dynamics that a vanilla RNN is not capable of. Lastly, we comment in passing that in the GRU, as in a RNN, there is a final feedforward layer to transform the (smoothed) hidden state to a response:

LSTMs
LSTMs are similar to GRUs but have a separate (cell) memory, c t , in addition to a hidden state h t . LSTMs also do not require that the memory updates are a convex combination. Hence they are more general than exponential smoothing. The mathematical description of LSTMs is rarely given in an intuitive form, but the model can be found in, for example, Hochreiter and Schmidhuber (1997). The cell memory is updated by the following expression involving a forget gate, α t , an input gate z t and a cell gate c t c t α t +c t−1 + z t + c t . (A2) In the terminology of LSTMs, the triple ( α t , r t , z t ) are respectively referred to as the forget gate, output gate, and input gate. Our change of terminology is deliberate and designed to provided more intuition and continuity with RNNs and the statistics literature. We note that in the special case when z t 1 − α t we obtain a similar exponential smoothing expression to that used in our α t -RNN. Beyond that, the role of the input gate appears superfluous and difficult to reason with using time series analysis. When the forget gate, α t 0, then the cell memory depends solely on the cell memory gate update c t . By the term α t +c t−1 , the cell memory has long-term memory which is only forgotten beyond lag s if α t−s 0. Thus the cell memory has an adaptive autoregressive structure.
The extra "memory", treated as a hidden state and separate from the cell memory, is nothing more than a Hadamard product: which is reset if r t 0. If r t 1, then the cell memory directly determines the hidden state. Thus the reset gate can entirely override the effect of the cell memory's autoregressive structure, without erasing it. In contrast, the α t -RNN and the GRU has one memory, which serves as the hidden state, and it is directly affected by the reset gate.
The reset, forget, input and cell memory gates are updated by plain RNNs all depending on the hidden state h t .
Reset gate : r t σ(U r h t−1 + W r x t + b r ) Forget gate : α t σ(U α h t−1 + W α x t + b α ) Input gate : z t σ(U z h t−1 + W z x t + b z ) Cell memory gate : c t tanh(U c h t−1 + W c x t + b c ).
The LSTM separates out the long memory, stored in the cellular memory, but uses a copy of it, which may additionally be reset. Strictly speaking, the cellular memory has long-short autoregressive memory structure, so it would be misleading in the context of time series analysis to strictly discern the two memories as long and short (as the nomenclature suggests). The latter can be thought of as a truncated version of the former.