Factor-Based Framework for Multivariate and Multi-step-ahead Forecasting of Large Scale Time Series

State-of-the-art multivariate forecasting methods are restricted to low dimensional tasks, linear dependencies and short horizons. The technological advances (notably the Big data revolution) are instead shifting the focus to problems characterized by a large number of variables, non-linear dependencies and long forecasting horizons. In the last few years, the majority of the best performing techniques for multivariate forecasting have been based on deep-learning models. However, such models are characterized by high requirements in terms of data availability and computational resources and suffer from a lack of interpretability. To cope with the limitations of these methods, we propose an extension to the DFML framework, a hybrid forecasting technique inspired by the Dynamic Factor Model (DFM) approach, a successful forecasting methodology in econometrics. This extension improves the capabilities of the DFM approach, by implementing and assessing both linear and non-linear factor estimation techniques as well as model-driven and data-driven factor forecasting techniques. We assess several method integrations within the DFML, and we show that the proposed technique provides competitive results both in terms of forecasting accuracy and computational efficiency on multiple very large-scale (>102 variables and > 103 samples) real forecasting tasks.


INTRODUCTION
The pervasiveness of interconnected devices (IoT) and the consequent big data revolution are shifting the focus of forecasting to problems characterized by very large dimensionality (n > 100, . . . , 1,000), non-linear cross-series dependencies and long forecasting horizons. However, most multivariate forecasting methods in the literature are restricted to low dimension (n < 10) vector time series, linear forecasting techniques and short horizons.
The most common approaches to multivariate forecasting are model-driven and data-driven (Januschowski et al., 2020). Model-driven approaches include vector regressions (VAR, VARMA, VARIMA, VARMAX) (Lütkepohl, 2005) and kernel-based regression (Exterkate et al., 2016). Vector AutoRegressive (VAR) models showed a good capability in capturing linear dependencies in applied domains (e.g. wind farm) (Cavalcante et al., 2017). The authors of (Zhao et al., 2018) proposed a correlation constrained and sparsity controlled VAR to reduce the effective number of parameters in model training. However, the main VAR-based model drawback is the parameter size growth at the increase of the lag sample and dimension of the task. Data-driven approaches (notably machine learning) proposed feature-based and representation-based techniques to deal with large-variate settings. Feature-based techniques are based on a multivariate extension of well known univariate forecasting techniques such as k-nearest neighbors (Talavera-Llames et al., 2019) or Support Vector . Such techniques tend to augment the dimensionality of the original data by adding expert-driven combinations of the original features. However, the time required to identify such features and the overhead introduced by the presence of additional features hinders the applicability of such techniques, especially for large dimensions. For this reason, representation based Deep-Learning (DL) methods have been more and more adopted because of their success in modeling non-linear dynamic cross dependencies between variables. In particular, Recurrent (Agarwal et al., 2020;Smyl, 2020;Hewamalage et al., 2021) and Convolutional Neural Networks (Sen et al., 2019) are the most promising models for predicting multivariate time series. Unfortunately, the training process of such models is characterized by a heavy computational load and a need for specialized hardware making them unsuitable for largescale settings. Moreover, the lack of interpretability of the model and the automatically determined features hinders the extraction of useful information concerning the relevant variables for forecasting.
To cope with the limitations of the existing approaches, we propose an extension of DFML (Bontempi et al., 2017;De Stefani et al., 2018), a hybrid forecasting technique inspired by the Dynamic Factor Model (DFM) approach, a successful forecasting methodology in econometrics (Forni et al., 2005). This paper is an extension and generalization of the original DFML work and addresses three main aspects: factor estimation, factor forecasting and factor recombination. Factor estimation returns a limited number of latent components (factors) from the original series, on which the forecast is performed. Then, the forecasts of the factors are transformed back to the original dimension to obtain the original forecast. The rationale of this approach is to reduce a high-dimensional multivariate problem to a small set of independent univariate problems, thus simplifying the forecasting task. To the best of our knowledge, this paper is the first systematic comparison of data-driven and model-driven strategies for factor estimation and factor forecasting for multivariate multi-step-ahead forecasting in a very large scale setting (>10 2 variables and > 10 3 samples).
In particular, the main contributions of this manuscript are: • A novel modular and extensible framework for multivariate and multi-step-ahead forecasting combining data-driven techniques and model-driven techniques in a Dynamic Factor fashion. • The assessment of the impact on the DFML accuracy of several methods for factor estimation, including both traditional and Deep Learning based techniques on real data. • The assessment of the impact on the DFML accuracy of several methods for factor forecasting, including state-ofthe-art techniques from both the statistical and machine learning field, on real data.
This work is organized as follows: Section 2 discusses the theoretical framework of time series forecasting. Section 3 introduces the extensions to the DFML framework and its components. Section 4 presents the benchmark setup while Sections 5 and Section 6 summarize and discuss the main experimental results, respectively. Section 7 concludes the paper outlining some future perspectives.

Mathematical Notation
A univariate time series is represented by a vector y of size N where N is the number of samples and y t is the time series value at time t 1, . . . , N. A multivariate time series is a collection of historical observations of n variables sharing the same time index, and represented by a matrix Y, with N rows and n columns. The jth column of Y, denoted with Y [j], is the univariate time series associated to the jth variable, j 1, . . . , n. In the following we will denote matrices in upper-case letters (e.g. Y) and scalars either in lower-case (e.g. y t ) or with the index notation (e.g. Y t [j]) (cf. Table 1).

Time Series Forecasting
Time series forecasting deals with the prediction of the future values of a given quantity of interest (a certain time series y), given a set of N historical observations. A comprehensive overview of the different time series forecasting tasks is presented in Figure 1.
In the univariate one-step-ahead form, this problem is formulated as the estimation of a Single-Input, Single-Output (SISO) auto-regressive mapping f : where e is the noise term, d ≥ 0 is the delay and m > 0 is called the embedding lag. An embedding procedure represents a time series y as a set of input-output pairs (y (I) , y t+1 ) with y (I) being the mdimensional [y t−d , . . . , yt−d−m+1 ] input vector. This formulation is general since it can be employed for estimating both a linear (AR) and a nonlinear mapping (NAR) and enables the adoption of supervised machine learning algorithms (Bontempi et al., 2013). In what follows, for the sake of simplicity, we will assume d 0.

Multi-step-ahead Univariate Forecasting
A multi-step-ahead univariate forecasting consists of predicting the next H > 1 values of a time series. Strategies for predicting univariate time series multi-step ahead have been extensively discussed in (Ben Taieb et al., 2010;Ben Taieb et al., 2012;Bontempi et al., 2013) and can be summarised into two main classes: single output and multiple output strategies.
Instances of the first class are the Iterated and the Direct strategy. The Iterated (or Recursive) strategy (Weigend and Gershenfeld, 1994;Cheng et al., 2006;Sorjamaa et al., 2007) learns a one-step-ahead model f REC : R m 1R y t+1 f REC (y t , . . . , y t−m+1 ) + e t+1 (2) and then uses it recursively H times to return a multi-step-ahead prediction. Though the iterated method is highly sensitive to the estimation error, it has been often used to forecast real-world time series (McNames, 1998;Saad et al., 1998;Bontempi et al., 1999). The Direct strategy (Weigend and Gershenfeld, 1994;Cheng et al., 2006;Sorjamaa et al., 2007) learns independently H models and returns a multi-step-ahead forecast by concatenating the H predictions. Since the Direct strategy does not use any estimated value as input, it is not prone to the accumulation of one-stepahead errors. Notwithstanding, no statistical dependencies between the predictions (Kline, 2004;Bontempi, 2008;Bontempi and Taieb, 2011) is considered and these methods often require higher functional complexity (Tong, 1983) than iterated ones in order to model the dependency between two distant instants (Guo et al., 1999). The Multi-Input Multi-Output (MIMO) strategy (Bontempi, 2008;Bontempi and Taieb, 2011) (also known as Joint strategy (Kline, 2004)) avoids the simplistic assumption of conditional independence between future values made by the Direct strategy (Bontempi, 2008;Bontempi and Taieb, 2011) by learning a single multiple-output model where F J : R m 1R H is a vector-valued function (Micchelli and Pontil, 2005), and E is a noise vector whose covariance is not necessarily diagonal (Matías, 2005). The MIMO strategy avoids the conditional independence assumption made by the Direct strategy as well as the accumulation of errors of the Iterated strategy. So far, this strategy has been successfully applied to several real-world multi-step time series forecasting tasks (Bontempi, 2008;Ben Taieb et al., 2009;Ben Taieb et al., 2010;Bontempi and Taieb, 2011).

Multivariate Forecasting
Here we extend the notions of the previous section to multivariate forecasting, taking into account possible cross dependencies among the time series. According to (Januschowski et al., 2020) there are three main approaches to deal with a multivariate forecasting problem: local modeling, global modeling and hybrid modeling. In local modeling, the multivariate forecasting task is decomposed into a set of n SISO or MISO tasks. In the case of SISO tasks, each of the n forecasting tasks is treated as an independent problem, thus ignoring the cross dependencies with the other series. In the case of MISO tasks, multiple series can be used as input covariates to forecast a single time series.
where f i : R m×n 1R, i 1, . . . , n. Although the choice of ignoring cross dependencies (partially, in the case of MISO task or totally, in the case of SISO) might seem disadvantageous at a first glance, it allows to greatly reduce the model complexity, thus reducing the variance of the model and its computational learning time.
Moreover, the local models could potentially be trained in parallel, thus improving even more the efficiency of the training. Due to this reduced computational complexity, local models are often used as benchmarks, and potentially outperforming more complex techniques [such as in the M4 Competition (Makridakis et al., 2020a)].
In global modeling, the multivariate problem is tackled as a single MIMO problem, where the model F: R n×m 1R H×n takes the embedding vectors of the n time series as input and produces the H-step ahead forecasts Analogously to the univariate case, we may perform the Iterated one-step-ahead strategy or the Direct h-step-ahead strategy: where F I : R n×m 1R n represents the single Iterated model, while F h : R n×m 1R n indicates the hth Directed model. Finally, the MIMO strategy can be extended to the multivariate case as well: with F JM : R n×m 1R H×n being the model jointly providing H-step ahead forecasts for all the n time series. This model category allows to properly model the cross-dependencies between the different time series, by increasing the complexity of the functional mappings that have to be estimated (cf. Eqs. 7-9). The number of parameters to be estimated usually grows quadratically (O (n 2 )) with respect to the number n of input time series, increasing the computational complexity of the estimation process, and limiting their application as the number of time series increases. In order to exploit the advantages, and limit the drawbacks of both categories, hybrid approaches have been developed, where both the global and the local approach coexist in different forms. For example in hierarchical forecasting models (Athanasopoulos et al., 2017;Taieb et al., 2017;Wickramasuriya et al., 2015), independent local forecasts are first generated and then brought together in a reconciliation process, in order to return coherent global forecasts. Additional approaches adopt kernel-based methods (Hwang et al., 2016) based on the composition of local models, as well as neural models integrating both local components and global components to perform the global forecast (Sen et al., 2019), or where the output of local models is used as input for the global models (Smyl, 2020).
Finally, a well-known hybrid model category is constituted by dynamic factor models (Stock and Watson, 2010), where the global forecasting problem is reduced to a set of local forecasting problem on a reduced number of components, via dimensionality reduction, in a way that the set of components should account for the variability of the original multivariate time series.

Dynamic Factor Models
The Generalized Dynamic Factor Model (DFM) is a technique for multivariate forecasting originating in econometrics (Forni et al., 2005) [for a detailed review see (Stock and Watson, 2010)]. The basic idea of DFM is that a small number of series (the factors) can account for the time behavior of a much larger number of variables. Such factors are latent, i.e., not directly observable and have to be estimated from the original data. Once estimated, they can be can be forecast instead of the original series, reducing the complexity of the multivariate forecasting process. In more formal terms: where Z t is the vector of unobserved factors of size q (q < <n), A i are q × q coefficient matrices, W is the matrix (n × q) of dynamic factor loadings and the disturbances terms E Y , E Z (also called idiosyncratic disturbances) are assumed to be uncorrelated. The latent factors follow a vector autoregressive time series process and usually do not have a direct interpretation with respect to the original time series. Note that though the seminal work on DFM adopted a frequency domain approach, we will limit to consider here the time domain only. The practical implementation of DFMs demands to address two main issues: the estimation of the factors (including their number) and the forecasting of their evolution. According to Stock and Watson (2010) there are three main ways to estimate dynamic factors in literature: the first employs parametric estimation (e.g. maximum likelihood), the second makes use of non-parametric methods (e.g. PCA) and the third relies on Bayesian estimation. In this paper, we will restrict to consider non-parametric methods, both linear (PCA, for which consistency was proved (Stock and Watson, 2010)) and nonlinear. In this case, the estimation of the number of components typically relies on visual diagnostics (e.g. scree plots) or information criteria. As far as forecasting is concerned, both one-step-ahead and multi-step ahead forecasting based on VAR have been proposed in the econometric literature. Multi-stepahead typically adopts either iterated or direct linear strategies (Section 2.2.1). For an extended study on the use of DFM and PCA for the forecasting of 149 monthly macroeconomic variables we refer the reader to Stock and Watson (2002).

THE DYNAMIC FACTOR MACHINE LEARNER FRAMEWORK
The DFM rationale is that, if a forecaster is able to obtain accurate estimates of factors, then the task of forecasting could be simplified substantially by using the estimated dynamic factors for forecasting, instead of using all n original series themselves. Moreover, when n > 10 2 the dimensionality reduction process allows the forecasting process to become tractable with standard statistical tools, often inapplicable for larger dimensions. The final forecasting performance depends mainly on two aspects: the factor estimation algorithm and the accuracy of the factor forecasting.
In (Bontempi et al., 2017;De Stefani et al., 2018) the authors proposed a machine learning extension of the DFM (called DFML-Dynamic Factor Machine Learner) which 1) relies on a linear (PCA) or nonlinear (autoencoder) technique for dimensionality reduction and 2) forecasts each factor independently using a nonlinear model and a univariate multistep-ahead forecasting strategy by using out-of-sample assessment. Here, we propose a further extension of the DFML framework, in order to include state-of-the-art components in both the factor estimation and the factor forecasting component. The factor estimation component is augmented by including additional non-linear techniques, namely deep feed-forward neural networks and recurrent neural networks encoder-decoder architectures based on LSTM (Hochreiter and Schmidhuber, 1997) and GRU (Cho et al., 2014) units.
The factor forecasting component is improved by the addition of well-known statistical forecasting techniques, such as those employed as benchmarks for the M4 competition (Makridakis et al., 2020a), namely Exponential Smoothing (Holt, 2004), Theta method (Assimakopoulos and Nikolopoulos, 2000), and a statistical ensemble technique of these benchmarks, as well as machine learning based techniques, such as MIMO lazy-learning technique (Bontempi and Taieb, 2011) and gradient boosting based techniques [such as LightGBM (Ke et al., 2017), among the top performers in the M5 competition (Makridakis et al., 2020b]. Overall, many compositions of linear/non-linear factor estimation and linear/non-linear forecasting are implemented and assessed. High variate multi-step forecasting is indeed one of the most challenging tasks in data science and requires an extremely careful management of the bias/variance trade-off by exploring several alternatives in series encoding and forecasting. For instance a non-linear recurrent factor estimation technique could reduce bias (yet increasing variance) in case of nonlinear low noise dynamics while more conventional statistical techniques may be effective in guaranteeing a lower variance (at the cost of a bias increase) in noisy settings with small number of samples.
The architecture of the DFML, and the comparison with the other models is depicted in Figure 2. It is worth noting that the factor estimation and the factor forecasting modules 1) follow an encoder-decoder like structure (Sutskever et al., 2014) and 2) the two components are decoupled from one another, easily allowing to further extend the architecture by plugging in new components and 3) the complexity of the forecasting step is made independent of n.

Factor Estimation
The problem of factor estimation involves the determination of a number of factors q, smaller than the original number of time series n, such that these factors give a good approximation of the dynamics of the original data. A common approach to produce an estimation of the factor is through dimensionality reduction procedures. A dimensionality reduction procedure assumes that the original multivariate N × n time series Y can be represented in a q < n dimensional space while retaining as much informative content as possible about the original dynamics. The lower dimension data will then be represented by the Z matrix, having dimensions N × q. Multiple techniques have been developed throughout the years, concerning dimensionality reduction, making both linear assumption about the structure of the lower dimension subspace [e.g. PCA (Hotelling, 1933)], as well as non-linear assumptions [e.g kernel PCA (Schölkopf et al., 1998), autoencoders (DeMers andCottrell, 1993)] [for a detailed review see Van Der Maaten et al. (2009)]. In this paper we implement three families of dimensionality reduction methods: techniques that do not take into account temporal dependencies, both linear (Section 3.1.1) and nonlinear (Section 3.1.2), and techniques that take into account temporal dependencies (Section 3.1.3).
are defined as weighted sums of the elements of Z with maximal variance, under the constraints that the weights are normalized and the principal components are uncorrelated with each other. It is well-known from basic linear algebra that the solution to the PCA problem is given in terms of the unit-length eigenvectors of the correlation matrix of Y. Let us order the eigenvalues λ 1 ≥ λ 2 ≥/ ≥ λ n and the corresponding eigenvector in the matrix W of size n × q. Given the multivariate time series matrix Y, Z [p] YW [, p] represents the projection of the series on the pth principal component. Though PCA has been developed originally for independent Gaussian observations, it has been found to be useful also in time series, the most common type of non-independent data. (Jolliffe, 2002). shows that, when PCA is not employed to perform statistical inference (as it is the case of forecasting), nonindependence of data should not limit the usage of PCA. The main difference, with non-independent data, is that, while in conventional PCA covariance is computed between variables measured at the same time, in time series it is possible to compute also covariances to model dependencies between variables at different times. (Tsay, 2014). provides an example of application of PCA directly on multivariate time series as well as to the residuals of fitted VAR models, showing its capabilities to find some stable relationships between variables. (Papadimitriou et al., 2005), on the other hand provides an example of summarizing multivariate time series in a streaming setting.

Feed-Forward Autoencoders
Feed-forward autoencoders are a specific category of neural networks trained to learn identity mapping from inputs to outputs (Vincent et al., 2010). Their architecture is characterized by having an input and output layer with the same number of nodes (corresponding to the number of original dimensions n) and by the composition of two symmetrical sub-networks: an encoder that transforms n-dimensional inputs Y t into some latent (encoded) q-dimensional representation Z t , and a decoder that reconstructs an n-dimensional approximationŶ t of the input Y t on the basis of the latent q-dimensional feature Z t . The two sub-networks are composed solely of feed-forward connections among the layers and might be composed of one or more hidden layers. The networks are usually trained as a single joint network (i.e. the output of the encoder is used as input of the decoder) using gradient descent techniques such as backpropagation, with the objective of minimizing the meansquared error between the input and the output (Vincent et al., 2010). In their simplest form, the mappings f θ and g θ′ are linear functions of the inputs and the encoded features Z t closely related to the PCA principal components (Bourlard and Kamp, 1988). If the hidden layers are non-linear, autoencoders behave very differently from PCA, with the ability to capture multi-modal aspects of the input distribution (Bengio, 2009;Vincent et al., 2010). In this paper we will consider two types of autoencoder, the base version having only one hidden layer in both the encoder and the decoder (henceforth base), and a version having two hidden layers in both the networks (called deep).

Recurrent Autoencoders
Recurrent Neural Networks (RNN) is a state-of-the-art neural network approach (see (Hewamalage et al., 2021) for a detailed review) where the presence of recurrent connections (i.e. allowing loops in the connection graphs between nodes) allow the modeling of dynamic temporal dependencies. In their simplest form (Graves, 2012), the recurrent connections come from a hidden state H t , which is also used for predicting future values Y t : The matrices W HY , W HH , W YH represent respectively, the connections between hidden layer H and output layer Y and the recurrent connections on the hidden layer, while B H and B Y indicate the biases of the two layers, respectively. Weights and biases are the learnable parameters of the network, typically by gradient descent algorithms such as backpropagation through time. A nonlinear activation function σ (generally a sigmoid function σ s (x) 1 1+e −x or an hyperbolic tangent σ t (x) e 2x −1 e 2x +1 ) allows the modeling of nonlinear dependencies, while the recurrent connections allow the modeling of long-term temporal dependencies. For more details concerning backpropagation through time (BPTT) and the internal structure of recurrent cells, we refer the interested reader to Graves (2012). Without loss of generality, the encoder-decoder architecture presented in Section 3.1.2, can be applied with recurrent neural networks: Where the encoder (Eq. 15 and Eq. 17) and decoder network (Eq. 18 and Eq. 19) will have independent weight and biases matrices. This encoder-decoder architecture is often referred as a sequenceto-sequence (S2S) (Sutskever et al., 2014) model in the literature (Hewamalage et al., 2021). Variations of this architecture (with multiple hidden layers and specific attention mechanisms) have been effectively used in the framework of time series forecasting (Du et al., 2020) and (Bianchi et al., 2017). Additionally, a theoretical study of the sequence-to-sequence framework for time series forecasting, allowing to determine theoretical bounds have been performed by (Kuznetsov and Mariet, 2018). Last but not least, recurrent encoder-decoder architectures have been effectively employed for dimensionality reduction in the signal processing field , (Susik, 2020). For these reasons, we assess in this paper two recurrent autoencoders based on LSTM (Hochreiter and Schmidhuber, 1997) and GRU (Cho et al., 2014) units, respectively. The choice is motivated by the fact that in the extensive study (Bianchi et al., 2017) concluded that gated units (such as LSTM and GRU) outperform other recurrent methods when the temporal dependencies can be non-linear and abrupt, and that there is no clear outperformance of LSTM over GRU, or vice versa. Both the LSTM and the GRU autoencoder are implemented as a three layer network: input, hidden and output layers. Both the input and output layer have a number of neurons equal to the number of input time series, a sigmoid activation function and fully connected to the hidden layer. The hidden layer is constituted of a number recurrent cells (LSTM or GRU) equal to the number of factors to estimate.

Factor Forecasting
Once the factors estimated, a forecasting of Z t is required in order to produce the forecasts for the original seriesŶ. It should be noted that, though some dimensionality reduction methods (cf. Section 3.1.1) produce decorrelated factors, effectively transforming the original MIMO task into q SISO forecasting problems; this does not apply to all the factor estimation methods. For this reason, in this paper, we considered both univariate and multivariate factor forecasting techniques, considering state-ofthe-art techniques from both the statistical and the machine learning domain (Januschowski et al., 2020).

Statistical Techniques
Statistical techniques [also called model-driven techniques (Januschowski et al., 2020)] usually define a series of assumptions on the available data, in order to provide a closed-form formulation of the model of the dependency between input and output. We consider here Exponential Smoothing, Theta and Combined method, Single Input -Single Output (SISO) techniques for one-step-ahead forecasting as well as VAR, a Multi Input -Multi Output technique (MIMO) for one-step-ahead forecasting. All those models can be adapted for multi-step-ahead forecasting by implementeing a recursive strategy (Section 2.2). The rationale for considering statistical techniques is that in several forecasting competitions on real-world data (Hyndman, 2020) simple forecasting techniques tend to outperform more complex methods.

Exponential Smoothing (ES)
Exponential smoothing is a family (Hyndman and Athanasopoulos, 2018) of SISO forecasting methods, originally introduced in (Holt, 2004), in which the forecasts are computed as a weighted average of the past values and weights decay exponentially with time. In simple exponential smoothing (SES), the forecasts are produced by.
where z t and z t are the value and the forecast of the factor z, respectively. The basic method has been extended in to include the historical trend of the time series as well as the presence of seasonality, in both an additive and multiplicative form (Gardner, 1985;Gardner, 2006), leading to the Holt-Winters and the Holt-Winters Damped techniques.

Theta
The Theta method (Assimakopoulos and Nikolopoulos, 2000) is based on the combination of multiple one-step ahead SISO individual forecasters, called Theta-lines. Each Theta-line y″ t,ϑ is constructed by taking a second-order approximation of the original time series, with a specific coefficient ϑ. The final forecast is returned by averaging the forecasts produced by the different theta lines.
The Theta ensemble composed by two lines, with the ϑ i coefficients being respectively equal to 0 and 2, despite its simplicity, outperformed all the competitors in the M3 realworld forecasting competition (Hyndman, 2020) and has been selected as benchmark method for the M4 forecasting competition.ẑ t 1 2 (z″ t,0 + z″ t,2 ) In (Hyndman and Billah, 2003) the author demonstrated the equivalence of the Theta method to a specific form of the exponential smoothing format, in which the drift parameter is half the slope of a linear regression fitted to the data.

Combined
Combined (also called Comb) is an ensemble method based on the combination of three SISO one-step-ahead exponential techniques (Gardner, 1985): Single Exponential Smoothing for capturing the level, Holt to linearly extrapolate, and Damped to dampen the linear trend. The combination of the models is obtained by averaging the three outputŝ For a detailed description of the methods, we refer the interested reader to the relevant reviews (Gardner, 1985;Gardner, 2006). Unlike Theta, Combined implements an ensemble of heterogeneous forecasting methods, each capturing a different characteristic of the original time series. Like Theta, Combined has been used as benchmark during the different M competitions (Makridakis et al., 2020a).

Vector Autoregressive
The Vector AutoRegressive model (VAR) is a one-step ahead MIMO model which expresses Z t as a linear combination of the past values Z t−i of the series with coefficient matrices A i plus an error vector term W t . The number m of considered autoregressive dependencies is also called order of the VAR model. In a compact form the model is: where A t−k+1 , k 1, . . . , m is a time-invariant coefficient matrix of size q × q and E [W (j)] 0, j 1, . . ., n. VAR and state space models have been shown to be equivalent in (Gilbert, 1993). VAR rely on the assumption of stationarity and are typically not suitable for large variate settings (e.g. n > 20) because of the high number of parameters to estimate (mn 2 , corresponding to m A k matrices). For this reasons, we considered the VAR technique as factor forecasting technique inside the DFML, with the number of factors q being small, but not as benchmark for the original problem, where the number n of variables is too large to make the problem computationally tractable. Like the previous methods, multi-step ahead forecasts are produced from this one-step-ahead method through the Recursive strategy.

Machine Learning Based Techniques
Machine learning techniques (or data-driven techniques in (Januschowski et al., 2020)) do not make any parametric assumptions on the data distribution. In this category, we will consider lazy learning (i.e. a single model technique) (Ben Taieb et al., 2012) and gradient boosting (an ensemble technique) (Ke et al., 2017). Those models can be used for multi-step-ahead forecasting both via a Recursive and a Iterative strategy (cf. Section 2.2). Additionally, we consider a SIMO implementation of the lazy learning model Joint (Bontempi and Taieb, 2011), in which all the h steps to be forecast are returned by a single model.

Lazy Learning
A lazy learning technique (Aha, 1997) delays the learning phase until the prediction time. In other words, these techniques perform a fit of the model only when a prediction is required by using a computationally efficient technique (e.g. linear). This entails a considerable reduction in the computational cost of the model, while still preserving a good accuracy. In order to apply local learning tomulti-step time series forecasting, the time series Z t is embedded into a dataset D N , made up of pairs (X , is a temporal pattern of length m including the samples [Z t , . . ., Z t−m−1 ], and the vector Y H [t] , is the consecutive temporal pattern of length H, i.e. the vector [Z t+H , . . ., Z t+1 ]. In our case, the predictive model is a local weighted regression algorithm , where the h step ahead prediction of a given sampleŶ h k is computed as the average of the k most similar samples to the considered sample: Where Y H [j] is the output vector of the jth closest neighbor. The similarity is defined in terms of a distance metric (e.g. a euclidean distance). The number of neighbors k used for the prediction is selected by minimizing the mean squared error with respect to the available values.
In (Bontempi and Taieb, 2011) the authors proposes a MIMO version of lazy learning for forecasting where the number k of neighbours is selected through minimization of an estimation of the h step forecasting error over a leave-one-out cross-validation procedure. The adoption of a k-nearest neighbors lazy learning technique in this study is motivated by two reasons: the reduced computational cost and the capability of the model to exploit local patterns in the data.

Gradient Boosting
Boosting aims to create an accurate forecaster by combining several "weak learners" models [i.e. models characterized by a high bias, and a low variance (Schapire, 1990) The weights associated with each sample are adapted in a way to increase the weights to those values that have been wrongly predicted. The process is repeated for the desired number of iterations (J in this case), and then the final predictionẑ t is computed as a weighted sum of the different learners.
The gradient aspect of a gradient boosting method is related to the fact that the sample weight update procedure is performed via a minimization procedure of a given error metric, performed via gradient descent. Moreover, when the chosen error metric is the mean squared error (also called quadratic loss), the gradient boosting procedure is equivalent to training each subsequent model in the ensemble on the residuals of the previous model. For more details concerning the inner workings for the model we refer the interested reader to (Taieb, 2014). In this paper we include LightGBM (Ke et al., 2017), a gradient-boosted based algorithm specifically optimized to deal with a large number of data instances and a large number of features respectively, implemented with both a Direct and Recursive strategy.

EXPERIMENTAL SETUP
The experimental study assesses and compares several implementations of the DFML, composing the different factor estimation techniques and factor forecasting techniques discussed in the article. Note, that for the sake of a robust assessment, we set the lag to m 3 and the number of latent factors to q 3 for all the considered methods. In order to improve the readability of the results, we employ the following naming convention: the prefix DF is used to indicate a dynamic factor based model, whereas the UNI prefix is used to indicate the benchmark, univariate methods used for comparison. In both cases, the prefix is followed by the name of the employed forecasting technique.

Benchmarks
The majority of benchmark techniques used in this article is based on a univariate decomposition of the original n-dimensional MIMO task into n SISO forecasting tasks. The motivation of this choice is twofold. On one hand, several forecasting competitions based on real data clearly showed the competitiveness of this approach, despite their simplicity (Hyndman, 2020 a random walk model returning as prediction the latest observation. Despite its trivial nature, in real-world tasks the Naive method is known to outperform more complex learning strategies, especially in presence of continuous sequences of constant values: for that reason it is considered as a baseline to normalize all our accuracy results in Section 4. The methods above are implemented with the code provided for the M4 competition (Center, 2020). The other multivariate benchmarks are the original Dynamic Factor Model (Forni et al., 2005) (DFM, here DF-PCA-VAR) and the original DFML (Bontempi et al., 2017;De Stefani et al., 2018) (DFML PC , here DF-PCA-Lazy-DIR and DFML A , here DF-Base-Lazy-DIR).

Dynamic Factor Machine Learner Framework
We test five different factor estimations techniques and nine different factor forecasting techniques, for a total of 45 different models. The factor estimation techniques are listed below together with the software used for the experiments.
• PCA: the implementation uses the basic R functions cov and eigen. • base: the base autoencoder is implemented by the rstudio/ keras library. The architecture is symmetric with a single hidden layer of size q and a ReLU and sigmoid activation functions are used for the hidden and output layer, respectively. • Deep: the deep autoencoder is implemented by the rstudio/ keras library. The architecture is symmetric with three hidden layers [with sizes (10, 5, q)], a ReLU activation function for the hidden layer and a sigmoid for the output layer.
• LSTM: The LSTM autoencoder is implemented by the rstudio/keras library. The architecture is symmetric with a single hidden layer (q LSTM cells) and a ReLU and a sigmoid activation functions for the hidden and output layer, respectively. • GRU: The GRU autoencoder is implemented by the rstudio/ keras library. The architecture is symmetric with a single hidden layer (q GRU cells) and a ReLU and a sigmoid activation function for the hidden and the output layer, respectively.
For all the neural-based techniques, the maximum number of epochs used for the training is set to 50. The factor forecasting techniques are listed below together with the software used for the experiments.
• Comb, ES, Naive, Theta: we use the implementations made available by the M4 competition (Center, 2020). The multistep-ahead forecast is obtained with a Recursive strategy. • VAR: the implementation (Eq. 26) is provided by the vars R library and a Recursive strategy returns the multi-stepahead forecast. • Lazy-DIR, Lazy-REC, MIMO: these methods denote the lazy learning (Section 3.3.1) with a Direct, Recursive and Joint multi-step-ahead forecasting strategy, respectively. The implementation is made available in the gbonte/gbcode github library by the multisteapAhead function with methods lazydir, lazyiter, mimo respectively. • LightGBM-DIR and LightGBM-Rec: the implementation is provided by the lightgbm R library. The Direct and Recursive strategies for multi-step-ahead forecasting have been implemented by the authors.
Unless specified otherwise, we employed the default values for the forecasting techniques hyperparameters in the experiments. The entire code used to run the experiments is available in .

Datasets
We consider three public datasets related to multivariate forecasting problems with high dimensionality (>10 2 variables and > 10 3 samples).
• Electricity consumption This dataset contains 26,304 samples of 321 variables. Each variable represents the hourly electricity consumption in KWh of 321 clients between 2012 and 2014 (Lai et al., 2018). This dataset has been obtained by preprocessing the original dataset (NREL, 2021)

Results Presentation
We consider a rolling window approach (Tashman, 2000) using a window size of w tr multivariate samples for training, and H ∈ {4, 6, 12, 24} multivariate samples for validation. The window size w tr is set to 2000 samples for the Electricity and Traffic datasets, while for the Mobility dataset, the window size is 400 in order to ensure the feasibility of the rolling approach. It should be noted that this evaluation technique, proposed in (Tashman, 2000) consists of an extension of the well-known crossvalidation principle for time-dependent data. All the time series are preprocessed via a z-score normalization (using the scale R function) and first-order differentiation to de-trend the data. For each window, a multivariate error measure, the Naive Normalized Mean Squared Error (NMMSE) is computed as follows: where NNMSE averages the univariate NNMSE [j] terms.
The statistical significance of the results is assessed via a Friedman statistical test (with post-hoc Nemenyi test). For each time series in the multivariate dataset, the considered forecasting techniques are ranked according to their values of NNMSE. Then, the average rank across time series is computed for every forecasting technique and employed as input for the Friedman test (Demšar, 2006). Finally, the post-hoc Nemenyi test is employed to assess the statistical significance of the results of the Friedman test, by determining the value of the critical difference (CD). Two forecasting techniques are considered to not have a statistically significant difference if the difference between their average ranks is smaller than the critical difference. In the results visualization, the methods are ordered according to their performance from left to right (the leftmost the best), while the black bar connects methods that are not significantly different (at p 0.05).
We present the results in two formats: 1) a CD plot highlighting the statistical significance over all horizons and 2) a tabular format, containing the NNMSE values for different horizons and grouping the methods according to three categories: DF-Stat denoting DFM approaches with statistical forecasting, DF-ML denoting DFM approaches with machine learning forecasting and UNI-Stat denoting univariate statistical baselines. Figure 3 shows the overall ranking and the corresponding critical distance according to a Friedman-Nemenyi test (Demšar, 2006). Tables 2, 3 report the average NNMSE for different horizons and groups of methods.

Mobility
From the analysis of the results we can derive the following considerations: • Statistical techniques for factor forecasting (DF-Stat methods) appear among the top 10 methods (Figure 3). • Across all the horizons, the non-linear autoencoders (base, Deep, LSTM, GRU) consistently outperform linear factor estimation techniques (PCA). Also, apart from the top two methods, the differences are rarely statistically significant (Figure 3). • DFML strategies consistently outperform the Naive baseline for different horizons (Tables 2,3). • The superiority of DFML over UNI-STAT methods (UNI-ES, UNI-Theta, UNI-Comb) is less clear-cut. Taking into considerations all horizons DFML is significantly better than UNI-STAT: nevertheless for large horizons, the accuracy of UNI-STAT and DFML techniques (both DF-Stat and DF-ML) tend to converge.

Electricity
Tables 4,5 report the NNMSE, averaged over all the tests sets. Figure 4 shows the ranking and the corresponding critical distance according to a Friedman-Nemenyi test (Demšar, 2006). On the basis of the results we can make the following considerations: • Across all the horizons, DFML with non-linear autoencoders (base, Deep, LSTM, GRU) is generally among the best methods (cf. Figure 4). However, some specific factor estimation/ forecasting pairs are consistently among the top performers [DF-PCA-{LAZY-DIR,LAZY-REC,MIMO} (Bontempi et al., 2017;De Stefani et al., 2018)]. • Apart from two combinations based on direct gradient boosting, the top 20s only includes lazy techniques (in the top 10s) and VAR (in the bottom 10s) (Figure 4). • DFML techniques generally outperform the Naive technique and the other univariate benchmarks (UNI-ES, UNI-Theta, UNI-Comb), also for longer horizons. The only exception is represented by the integration of recurrent autoencoders (LSTM, GRU) with statistical techniques (DF-Stat) which performs worse than the UNI-Stat benchmarks (Tables 4 and 5).

Traffic
Tables 6,7 report the NNMSE, averaged over all the tests sets. Figure 5 shows the ranking and the corresponding critical distance according to a Friedman-Nemenyi test (Demšar, 2006). From the analysis of the results we can make the following considerations: Frontiers in Big Data | www.frontiersin.org September 2021 | Volume 4 | Article 690267 • The main characteristic among the techniques outperforming the univariate benchmarks is the use of a lazy learning technique (either with the Direct (LAZY-DIR) or the Joint (MIMO) strategy) ( Figure 5). • Another recurring forecasting technique in the top 20s is the VAR, in combination with both linear and nonlinear dimensionality reduction techniques ( Figure 5).
• The combination of lazy learning with a recursive technique and recurrent autoencoder tends to produce abnormal values, probably due to error propagation or vanishing/ exploding gradients problems ( Table 7).

Computational Time
The total computational time (in seconds) of the different DFML techniques is represented in Figures 6,7, representing respectively, the computational time of the shortest and longest horizon for the dataset having the largest scale (i.e. Traffic). The total computational time FIGURE 3 | Mobility-Graphical representation according to (Demšar, 2006) of the results of Friedman statistical test (with post-hoc Nemenyi test) comparing the NNMSE of the best 20 methods against each other, aggregated across all horizons h. The methods are ordered according to their performance from left to right (the leftmost the best), while the black bar connects methods that are not significantly different (at p 0.05). includes the time required to train the factor estimation technique and the factor forecasting technique, as well as the time required to generate the forecasts. It should be noted that, while the time required to estimate the factors varies according to the selected techniques, the time allocated to factor forecasting (for a given method) is constant across the different factor estimation techniques, as they employ the same number of components, and therefore the same amount of data. As we can observe in the figure, the majority of the computational time is allocated to the factor estimation technique, with the differences between factor forecasting techniques being negligible (smaller than 1s). The only exception is represented by the LightGBM-DIR technique, where the increase in computational time is justified by the number of models to be trained which is proportional to the forecasting horizon H. The fastest technique in terms of computational time is the PCA, while the recurrent basedautoencoder (GRU and LSTM) are the slowest ones. It should be noted that, with the selected number of epochs, the upper bound for all the variants of the DFML is around 75s (Figure 7).

DISCUSSION
The idea of employing neural components in the framework of a dynamic factor model has already been tested by (Nakagawa et al., 2019) for a MISO one-step-ahead prediction of the returns in the Japanese stock market and by (Kim et al.,  2019) in the framework of generative modeling for image reconstruction. However, at the time of writing, to the best of our knowledge, we are not aware of any study implementing neural components in the framework of dynamic factor model for multivariate and multistep ahead forecasting. An additional contribution is constituted by the extensive study, on multiple real datasets, of the different compositions of linear and nonlinear factor estimation techniques as well as model-driven and data-driven factor forecasting techniques. We can summarize the findings from our experiments with the following considerations: • About the choice of a factor estimation technique in DFML: linear techniques seem to be the most promising ones, both in terms of forecasting accuracy and computational cost (Figures 4-7). Non-linear techniques both with and without recurrent components are comparable in terms of accuracy: nevertheless, the trade-off between the increase in accuracy and the overhead in terms of computational cost (and the consequent energetic overhead) needs to be carefully taken into account. • About the choice of a factor forecasting technique in DFML: there is no clear winner between model-driven and data-driven techniques. However, two forecasting techniques, VAR (model-driven) and lazy learning (datadriven) appear to be consistently in the top performers across different datasets (Figures 4,5). • About the choice of a multi-step-ahead forecasting strategy in DFML: the Direct (DIR) and Joint (MIMO) strategies consistently outperform the recurrent strategies, confirming the findings of (Bontempi and Taieb, 2011) and (Taieb, 2014) (Figures 4, 5).
• In the majority of the experiments, the DFML is significantly more accurate than the classical DFM (DF-PCA-VAR)), the Naive baseline and the univariate benchmarks (Figures 3,4). Note that outperforming a Naive baseline is not absolutely obvious in multivariate multi-step forecasting as discussed in publications like (Paldino et al., 2021). • Last but not least, depending on the type of forecasting problem (cf. Section 5.3), univariate factor forecasting techniques still represent a competitive alternative to more complex models (Tables 2,3,6,7).  Further experiments are foreseen to understand the impact of hyperparameters like the number of components q or the embedding order of the machine learning models m. In addition, the choice of problem specific neural network architecture, fine-tuning of the parameters, as well as longer training times could further improve the performances of the neural-based techniques, if the problem setting allows it. FIGURE 5 | Traffic-Graphical representation according to (Demšar, 2006) of the results of Friedman statistical test (with post-hoc Nemenyi test) comparing the NNMSE of the best 20 methods against each other, aggregated across all horizons h. The methods are ordered according to their performance from left to right (the leftmost the best), while the black bar connects methods that are not significantly different (at p 0.05).
Frontiers in Big Data | www.frontiersin.org September 2021 | Volume 4 | Article 690267 Multivariate time series forecasting is a major challenge due to the large dimensionality of the available data. In recent years, there has been a remarkable development of multivariate techniques, especially deep learning based ones. The majority of these techniques rely on models of considerable complexity, requiring longer computation times, and often lacking interpretability of the fitted model. This paper proposes an effective and reliable forecasting methodology based on a combination of model-driven (statistical) and (data machine learning) techniques. The obtained results are promising in terms of scalability and effectiveness. This study supports the idea that factor-based models can be a promising alternative to representation learning strategies, and that the combination of statistical and machine learning techniques (often considered in opposition rather than in synergy) could improve the forecasting performances. Additionally, simpler forecasting methods (e.g., univariate) often neglected, can still provide competitive results (Paldino et al., 2021), even in the case of high dimensional (n > 100) multivariate forecasting.
Further studies will focus on an online implementation of the factor-based framework, where both the factor estimation and the factor forecasting component could be incrementally updated as new data samples will be made available, as well as a more scalable implementation of the framework, in order to be able to tackle multivariate series with larger dimensionality. Further attention should be drawn also to the automatic selection of the most relevant parameters characterizing the framework (namely the number of factors for factor estimation and model order for factor forecasting), e.g. extending the preliminary work in De Stefani et al. (2018); Bontempi et al. (2017).

DATA AVAILABILITY STATEMENT
Publicly available datasets were analyzed in this study. The Electricity and Traffic datasets can be found on the Github page associated to the article (Lai et al., 2018) available at https://github.com/laiguokun/multivariate-time-series-data. The Mobility data analyzed for the study can be found on the corresponding Kaggle page (https://www.kaggle.com/giobbu/ belgium-obu) and in . The original contributions presented in the study are included in the article/Supplementary Material, further inquiries can be directed to the corresponding author.