Dynamic Time Warping as a New Evaluation for Dst Forecast with Machine Learning

Models based on neural networks and machine learning are seeing a rise in popularity in space physics. In particular, the forecasting of geomagnetic indices with neural network models is becoming a popular field of study. These models are evaluated with metrics such as the root-mean-square error (RMSE) and Pearson correlation coefficient. However, these classical metrics sometimes fail to capture crucial behavior. To show where the classical metrics are lacking, we trained a neural network, using a long short-term memory network, to make a forecast of the disturbance storm time index at origin time $t$ with a forecasting horizon of 1 up to 6 hours, trained on OMNIWeb data. Inspection of the model's results with the correlation coefficient and RMSE indicated a performance comparable to the latest publications. However, visual inspection showed that the predictions made by the neural network were behaving similarly to the persistence model. In this work, a new method is proposed to measure whether two time series are shifted in time with respect to each other, such as the persistence model output versus the observation. The new measure, based on Dynamical Time Warping, is capable of identifying results made by the persistence model and shows promising results in confirming the visual observations of the neural network's output. Finally, different methodologies for training the neural network are explored in order to remove the persistence behavior from the results.


Introduction
The disturbance storm time (Dst) index is calculated from the measurements of four ground-based magnetometer stations, located close to the equator and spread evenly across the Earth [45]. The index, introduced by Sugiura [44], is the average of the magnetic disturbance of the Earth's magnetic 1 Code: https://github.com/brechtlaperre/DTW_measure to provide a forecast of the Dst index up to 6h in advance. Similarly, many more used an Elman recurrent network and input from the solar wind to forecast the Dst index [see e.g. 3,32,35,49].
More recently, Gruet et al. [14] used a long short-term memory (LSTM) neural network instead, and combined it with Gaussian processes to provide both a regressive and a probabilistic forecast of the Dst for 1h to 6h in advance [18].
However, some authors detected problems with the forecast of the Dst. Stepanova and Pérez [43] used a feed-forward neural network and previous Dst values to provide a forecast up to 3h in advance. More advanced forecasts had shown a time shift between the observed and predicted Dst, forecasting geomagnetic storms too late. This effect was also detected by Wintoft and Wik [50], who evaluated forecasts of the Kp and Dst with their neural network up to 3h in advance. Their Kp forecast had a time shift between forecast and observation for 2h and 3h in advance, while Dst forecast of the main phase of geomagnetic storms showed time shifts at already 1h in advance.
This paper aims to highlight this time shift problem. Section 2 sets up an experiment where a recurrent neural network is trained to forecast the Dst 6h in advance. It compares our model with those in the literature, and concludes by highlighting the time shift observed in geomagnetic storm forecasts. Then, in Section 3 a new measure is introduced that is capable of accurately measuring this time shift between observation and prediction. Finally, Section 4 looks further into why this time shift behavior is observed, and potential solutions.

The experiment
This section concerns the details of the experiment, together with the discussion of the results that lead to the discovery of the problems. The initial problem was to train a neural network to forecast the values of the Dst-index at times t + l, l ≥ 1, while having information up to time t.
The layout is the following: first the data used to train the neural network model is explained. Second, the processing of the data is discussed. Then the method of evaluation is described. Afterwards, the neural network model is described in detail. Finally, the results of the model are analyzed and discussed.
The following terminology will be used throughout this section. Box et al. [5] defines Dst(t + l) as a forecast at origin t with a lead time l. We will differ from this terminology, instead naming the Dst(t + l) a forecast at forecasting horizon t + l.

The data
The data used to train and test the neural network model were obtained from the NASA/GSFC's OMNI database [26]. From the database, the hourly averages of the solar wind velocity V sw and density ρ sw , the IMF z-component B z and magnitude |B|, together with the geomagnetic Dst index were extracted. These physical quantities will be referred to as 'features' throughout the paper. In particular, these features were measured between 00:00, 14 January 2001 to 23:00, 31 December 2016, and the full extracted data set contains a total of 139,944 entries.

Preprocessing the data
The preprocessing of the data is done with the following steps. The data are first split into a training, test and validation set, to prevent information bias through validation leakage. Next, each set is scaled and normalized using scaling parameters measured from the training set. Finally invalid measurements are removed from each of the sets by using a sliding window. Each step of this procedure is discussed in more details below.
Before assigning entries to one of the data sets, the data are split into monthly samples. The first sample corresponds to the month January of 2001, the final sample to the month December of 2016, equating to 181 samples. The reason for this split is the high temporal correlation of the data. Hourly samples are highly correlated, which causes the model to artificially perform better on the test set [8].
The data sets are constructed as follows. The test set consists of the months April, August and December of each year, corresponding to 25% of the total data. These months have been chosen arbitrarily, and mainly ensure a good spread of the test data over the given time period. We expect to see the same kind of results when the test set would be taken differently. An experiment to determine the effect of the chosen test set on the performance will be done in Section 2. From the remaining data, 60% of the months are distributed randomly into the training set. The remaining 15% of the data are placed in the validation set. This corresponds to 77,544, 20,520, and 33,120 entries in training, validation and test set, respectively.
The choice of training, validation and test set plays a huge role in the performance of the model. A difference in these sets makes direct comparison of forecasting models difficult, as stated by Lazzús et al. [28]. In order to measure the variance caused by this choice, 10-fold cross validation is performed and the results are reported in Section 2.6.
After this step, the sets are scaled and normalized. This ensures that every feature lies within the same range of values, ensuring comparability of the different features and faster convergence of the machine learning algorithm [see e.g. 21]. The full transformation process is done in two steps: first the scaling constants are determined from the training set, then the transformation is applied on the training, validation, and test set. The features are transformed by removing the mean and scaling to unit variance: where X is the data set, µ train the mean of the training data set and σ train the standard deviation of the training set.

Evaluation of the model
Throughout the paper, evaluation of the model's forecast is done through three distinct methods. The first method compares the neural network model to a naive forecasting model, a so-called baseline model. The second method is evaluation by use of a set of metrics. The third method is k-fold cross-validation (k-fold CV), a statistical method to test the model's predictive capabilities. Al of these methods are explained in this section.

Baseline model
The baseline model is a simplified empirical law that can generate a zero order forecast. The most simple type of time-series forecast is done with the persistence model. This model assumes that the value at the current time step does not change, so the next time step is predicted as: This model is easily extended to forecast multiple hours in the future: The work by Owens et al. [34] has shown that the persistence model can be a reliable predictor for some solar wind parameters, comparable to numerical models when evaluated in a point-by-point assessment method. In particular, geomagnetic activity and solar wind speed show good results when evaluated with a 27-day persistence model, and can be used as a benchmark for more sophisticated models.

Metrics
Now the set of metrics used for the model evaluation are defined. The root mean square error (RMSE) and the Pearson linear correlation coefficient (R) are often used for the evaluation of time series. In addition, the set of fit performance metrics recommended by Liemohn et al. [31] will also be used. These are the linear fit parameters, the mean absolute error (MAE), the mean error (ME) and the prediction efficiency (PE). All these metrics and their definitions are summed up in this section. Define M i as the forecast of the model and O i the corresponding real observational value, with i = 1, . . . , N , and N the number of samples.
• The RMSE is defined as: This metric puts emphasis on outliers, which in the case of the Dst index corresponds to geomagnetic storms. A low RMSE thus corresponds to good accuracy of the forecast of geomagnetic storms.
• The Pearson linear correlation coefficient (R), is given by: In the case of a perfect prediction, B is 1 and A is 0.
• The MAE emphasizes the "usual" state of the index, which in the case of the Dst index corresponds to quiet time, where no geomagnetic storms are happening. The MAE indicates how well the model predicts these values. This value is defined as • The ME indicates if the model systematically over-or under-predicts the observations, based on its positive or negative sign. If the mean error is zero, under and over predictions are balanced: • Finally, the PE is used to quantify the model's ability at reproducing the time variation of the Dst index: whereŌ is the average of the observational values. The maximum value of the PE is 1, corresponding to a perfect prediction at all times. A prediction efficiency equal to or less than zero indicates that the model is incapable of forecasting the time variation seen in the observations.

K-fold cross-validation
K-fold CV is a training technique for the model in order to learn its capabilities, without having to make use of the test set and so prevent an information bias. When applying CV, the training and validation sets defined in Subsection 2.1 are used, unless stated otherwise.
The technique firsts randomly splits the (monthly) data into k equal-sized partitions (or folds). One of these folds is picked as the validation set, and the other remaining folds are used as the training set. The model is trained on the training set and evaluated on the validation set, and the evaluation is stored. This is repeated until every fold has been used exactly once as the validation set. By taking the average of the result of each run, an estimate of the predictive performance is given. This allows us to evaluate the model on different parameters, both hyperparameters and features, while preventing any optimization on the test set, as this would otherwise invalidate our results.

The model
A neural network model is constructed and trained to forecast the Dst index for forecasting horizons t + 1h up to t + 6h. As input, the model receives a multidimensional time series X t , containing the data described in Subsection 2.1, ranging from time t − 6h to time t, as displayed in Equation (10).
The output then consists of a 6-dimensional vector Y t , corresponding to the forecast Dst values for forecasting horizon t + 1h up to t + 6h.
The programming language Python, and in particular the package PyTorch [36], was used to implement and train the neural network. A link to the source code can be found in Appendix A.

Model description and training
The model architecture of the neural network model consists of a LSTM, combined with a dense neural network, as shown in Figure 1.
LSTMs are a type of recurrent neural network, where the output of the previous iteration is used as an additional input. When multiple past events are used as inputs, classic recurrent neural networks loose training information in a phenomena called gradient vanishing [17]. LSTMs are designed to work better for long time series, by incorporating two internal memory states: the hidden state, h t , and the memory state, c t . A detailed explanation of how these memory states retain information and how they work is presented in Hochreiter and Schmidhuber [18]. We refer the reader to this publication for mode details on the exact internal workings of the LSTM.
The LSTM memory machinery is encapsulated in the LSTM cell. Figure 1 shows how the input, using an internal neural network.
The final hidden state vector, h 6 , is then fed to a classical, fully connected feed-forward neural network, where the input is transformed to a vector of the size of the target vector Y t . Throughout the rest of the paper, we will refer to this model as the 'LSTM-NN model'.
LSTM internal dense NN Vector concatenation Figure 1: The architecture of the LSTM-NN model. The time series input is first given to an LSTM, which iteratively uses each time step of the input x t , together with the previous hidden state h t−1 , to update the memory state c t and construct a new hidden state h t . The LSTM cell requires as input the concatenated vector from x t and h t−1 , together with the output of the LSTMs internal dense NN taking that same vector as input. The final hidden state of the LSTM is given to a dense neural network, who transforms it to a six-dimensional output vector Y t = {y t+1h , y t+2h , . . . , y t+6h }. The output Y t represents the forecast Dst, as displayed in Equation (10).
The model is trained using the RMSProp method, an unpublished adaptive learning rate method proposed by Goeff Hinton (see http://www.cs.toronto.edu/~tijmen/csc321/slides/lecture_ slides_lec6.pdf). As with any other neural network based technique, LSTMs "learn" by iteratively adjusting the weights of the inter-neuron connections. The iterative process is similar to the Newton-Raphson method, adjusting the free parameter of the model by gradual changes based on the derivative of the error between the real output and the forecast output [16,29]. LSTMs are composed of a chain of neural networks that learn together. This requires a special optimization method called RMSprop. RMSprop ensures that the error is correctly propagated backwards through all the chain of neural networks that compose the LTSM. The method has been used successfully for training LSTMs in generating complex sequences [13].
The PyTorch library provides an implementation of this method. The error criterion of the model was the mean squared error loss: When the model is trained with a training set, the error on this set will be smaller for every iteration (or epoch). In order to prevent over-fitting, the process where the model starts memorizing the training set instead of learning the training set, after every epoch, the performance of the model on the validation set is checked. When the model performance stops improving on the validation set is a good indicator that the model is starting to over-fit on the training set, and we can stop the training of the model. This is the classic early stopping method.

Parameters of the model
When training a neural network, there are many parameters that can be chosen that have an impact on the performance. However, finding the optimal value for these parameters is problem-dependent, the so-called No Free Lunch theorem Wolpert and Macready [51], and finding the optimal values is a computationally exhaustive task. In our model, there are three sets of hyperparameters. The first set is intrinsic to the architecture of our model itself, the second set to the learning method, and the third set to the training method. The hyperparameters were obtained by manually tweaking their values over the course of 15 to 20 runs and evaluating their performance using 7-fold CV. We found that for our particuliar case, we observed no significant changes in the accuracy of the model caused In our model itself, the LSTM has a few hyperparameters that have impact on its performance. The first is the number of neurons in the hidden layer of the LSTM. This number must be large enough to ensure it can encode the process behind the data, but not too large to prevent over-fitting and plain computational cost. This number has been determined by performing CV, and we found the best performance to be around 50 neurons. Next there is the number of layers in the LSTM. Multiple LSTM's can be stacked on top of each other, where the first LSTM gives the intermediate hidden states (see Figure 1) as input for the second LSTM, and so on. This increases the complexity and computational cost of the model. In our search we tested using multiple layers of LSTMs, but did not find any significant performance increase and set the number of layers to 1. Finally, there is the option to make the LSTM bidirectional, where the model has access to both future and past states.
But this is unnecessary in the context of the Dst forecast, thus this has been set to false.
The RMSProp method has a lot of tweakable parameters, but we will focus on the two most important parameters, the learning rate and the momentum. The learning rate is the most important parameter, and controls how strongly the model weights will be changed by the error gradient. A too large learning rate might cause unstable learning, with the performance swinging widely during training and preventing convergence. A too small learning rate might cause the model to barely change and converge too slowly or not at all. The momentum parameter will affect the learning rate parameter throughout the training process, and will accelerate the training process. Most often, the momentum parameter is chosen close to 1. We found that setting the learning rate to 0.0003 and the momentum to 0.8 gave us the best performance, with stable convergence of the training error.
Finally, when training the model, we can also set a few parameters that can affect the performance.
There are two parameters that are important: the number of epochs and the batch-size of the training set. The number of epochs decide how many times we loop over the full training set for training.
This number must be large enough that the model has time to converge to the ideal solution before the training is stopped by the classic early stopping method. The batch-size determines how many samples of the training data are given to the model before the error is computed and backpropagation is applied. Setting the batch-size to one would corresponds with so-called online learning, where the model is trained separately on every sample. The opposite is offline learning, i.e. setting the batch-size to the size of the training set, so the model is optimized on the accumulated error over the complete training set. Offline learning is almost never used as it fails to learn more outlying cases, and online learning is more prone to over-fitting. Using a small batch-size is typically recommended.
We found that setting the batch-size to 64 gave a fast convergence and did not have a large impact on the performance.
Finally, we sum up the parameters of the model:

Results and discussion
The LSTM-NN model is now evaluated using the defined metrics and baseline model of Subsection 2.3. The evaluation is discussed and a comparison of the model to some of the latest forecasting models is made. Finally, the forecast is visually observed.
The first analysis examines whether the LSTM-NN model performs better than the persistence model defined in Section 2.3.1. Table 1 Table 1, we conclude that using the more complicated LSTM-NN model will result in better forecasts. In particular, forecasts made at forecasting horizon t + 3h to t + 6h show significant improvement in accuracy compared to the persistence model.
Next we compare the LSTM-NN model to the models reported in the work of Gruet et al. [14] and Lazzús et al. [28]. Both publications present a neural network trained on OMNIWeb data used to forecast the Dst index. The model by Gruet et al. [14] also makes use of LSTM modules in their model, while the model of Lazzús et al. [28] consists of a feed-forward neural network instead of a RNN, trained with a particle-swarm optimization method. The performance of their models were evaluated with the RMSE and Pearson correlation coefficient, and are summarized in Table 2 [14] and Lazzús et al. [28]. The confidence intervals were obtained from the cross-validation experiment, detailed in Section 2.6 and shown in Figure 2.   [28] has a performance that is comparable to that of the LSTM-NN model for forecasting horizons t + 1h to t + 4h. However, for later forecasting horizons, the LSTM-NN performs significantly better. The model by Gruet et al. [14] is outperformed by the LSTM-NN for the forecasting horizon t + 1h, but performance equivalently for all the later forecasting horizons.
Looking at the RMSE, we see that the model by Lazzús has comparable performance for forecasting horizon t + 1h. However, for later forecasting horizons the LSTM-NN model is significantly better.
This seems to agree with the observation made by Gruet et al. [14], stating that the choice of an LSTM module in the model architecture improves the accuracy of the forecast. Looking at the performance of Gruet, the LSTM-NN shows a significant improvement for forecasting horizons t + 1h to t + 2h, but is equivalent in performance for later times.
In conclusion, our model seems to be comparable to that of Lazz'us for forecasting horizons t + 1h and t + 2h, but outperforms it for later times. Our model shows some improvement over that of Gruet for forecasting horizons t + 1h and t + 2h, but is otherwise comparable to theirs.
Finally, a visual observation of the forecast is analyzed. Figure 3 displays three geomagnetic storms contained in the test set, together with the forecast of the LSTM-NN model for forecasting horizon t + 1h, t + 3h and t + 5h. The first column displays the t + 1h forecast, and seems to be an almost perfect prediction of the storm. However, the forecast of the Dst-index for forecasting horizon t + 3h and t + 5h, displayed in column 2 and 3 of Figure 3, shows a distinct delay in the forecasting of the main phase. Take for example the prediction at forecasting horizon t + 5h: the sudden offset of the storm is predicted 5 hours too late. Forecasting horizon (h)   Table 1, the bars and error bars respectively display the mean and variance of the performance of the 10-fold cross-validation, and the square and cross display the performance of the models reported by Lazzús et al. [28] and Gruet et al. [14], respectively. All of these values are also summarized in Table 2.
This brings us to the main problem of this paper. The purpose of the experiment was to create a LSTM-NN model that forecasts the Dst-index with the same accuracy and correlation as other presented architectures. We managed to create such a model, but, when visually inspecting the forecast, it was observed that there is a distinct time shift between forecast and observation. If geomagnetic storms are forecast only when they start, it means the LSTM-NN model will not give us any more information than the persistence model. While it is not possible to say that the models from Gruet et al. [14] and Lazzús et al. [28] also have this problem, we believe that one should pay close attention to this problem and ensure it does not happen.
An additional problem that most modern machine learning techniques have to face is that rare events can not be properly forecasted. Neural networks learning by gradient descent requires that patterns where multiple models specialize in the detection of different types of inputs and storms strengths.
These solutions are out of the scope of the present paper but will be studied in a future work. algorithm [see 4], a method that measures the relative similarity between two time series. At the very least, we expect the warping measure to be able to detect the forecast made by a persistence model. What follows first is a brief overview of the DTW algorithm, followed by the modifications we made to tailor the algorithm to our specific problem.

Dynamic Time Warping
The DTW algorithm is a method first developed for speech recognition and is now commonly used in the fields of economics, biology, and database analysis [see e.g. 42,48]. DTW is mainly used as a measure to investigate how much a sample time series matches or is contained in a target time series.
The strength of DTW is that it can compare two time series even though they might be shifted or stretched in time, which is a property that is essential to our goal. This section summarizes the algorithm developed by Berndt and Clifford [4]. A visualization of this algorithm is shown in Figure   4. Take two time series, Q and S, of length n and m, respectively.
The DTW algorithm first constructs a distance between these two time-series by placing them in an n × m grid. Each grid point (i, j) then corresponds to an alignment of q i and s j . An alignment is given a cost by a distance function d(q i , s j ). The distance function can be chosen freely, and for our case the Euclidean distance function, d(x, y) = (x − y) 2 is used. The DTW algorithm then searches for a path (the so-called warping path) P in this grid that minimizes the sum of said distance. The warping path P can be defined as: where each point p k corresponds to a grid point (i, j) k . The path must then minimize the cost function, so and must hold to the following conditions: 4. Warping window w: an optional constraint that sets the maximum distance in time between i k and j k to w: |i k − j k | ≤ w.
In order to find this optimal path, the following dynamic programming technique can be used.
Starting at point (1, 1), the cumulative distance ∆ at each grid point is computed by the following recursive equation: Once all the cumulative distances are computed, the optimal warping path can be found by starting at the point (n, m) and tracing backwards in the grid, taking the smallest value each time. This is displayed in Figure 5.
A warping window constraint can be added on the algorithm. This window will change the warping cost and warping path P . Let w ∈ N be the warping window, then Faster and better implementations of this algorithm exist [see e.g. 25,24,30,40], but they are outside the scope of this text.

The warping measure
It should be mentioned that the DTW algorithm does not satisfy the necessary properties to be a metric. For example, it is easy to see that the algorithm does not satisfy the triangle inequality.
Consequently, this method will be called a measure, and not a metric.
This measure does not make use of the warping cost, and instead uses the information contained in the warping path W . The measure is then able to determine how exactly a forecast time series is The DTW algorithm is applied on these time series, giving a cost matrix D of dimensions n × m.
The warping constraint defined in Equation (17) is applied, and w is set equal to the forecasting horizon time. However, an additional constraint is included: the warping window is restrained such that the algorithm only compares the prediction M at t + p with the observations O from time t to t + p, i.e. predictions are not compared to observations that are in the future. Applying this constraint can be done as a modification of the warping constraint defined in Equation (17): This is also illustrated in Figure 6. After computing the warping path, we take each step p k = (i k , j k ) and compute what we define as the warp value: Finally, a histogram is taken from all the different values of ∆t. The percentages reflect how time series M is shifted compared to time series O. We now present the results of this measure applied to the persistence model prediction and the LSTM-NN model prediction.

DTW measure applied to the persistence model
The warping measure is first applied to the forecast of the persistence model. The persistence model can be seen as the textbook example for this algorithm. Assuming that the persistence model is set as follows: then the algorithm should detect that almost all of the forecast values are shifted with a time p compared to the actual observation. The algorithm will not detect 100% of the values to be shifted with p, because of the constraint in the DTW algorithm that forces the beginning and the end points of the two time series to match, as discussed in Section 3.  The persistence model is applied to the test set defined in Section 2.1, and the resulting warp values are shown in Table 3. The results confirm our expectations, where except for a few percentile, all the values are detected to be shifted by the forecasting horizon.
One potential problem that can arise is when the time shift in the two compared time series is very large. First, the algorithm will take longer to run as the window-size w needs to be much larger.  Table 3, where around 2% of the shifted values are artificial due to the boundary constraint. Potential changes to the algorithm that could account for this problem is a topic for future work.

DTW applied to the LSTM-NN model
The normalized values of the DTW measure applied to the LSTM-NN model presented in Subsection 2.5 can be seen in Table 4. The highest percentages are located on the offset diagonal, identical to the results of the persistence model. As discussed before, this indicates that a shift in time exists between the observations and the model predictions. This confirms that our observation of the results discussed in Section 2.6 are happening throughout the whole time series. We notice that the second highest percentage of each row is located on the diagonal, indicating that the model is actually capable of providing some accurate prediction of the Dst for one hour into the future, similar to the observations of Wintoft and Wik [50] and Stepanova and Pérez [43].

Dst index analysis
What follows is a statistical analysis of the Dst index itself. The autocorrelation of the Dst is shown in Figure 8B. Notice the very high autocorrelation of the Dst index with itself for delay times up to t + 7h. This can also be seen in the lag plot, shown in Figure 8A. This could explain why the persistence model has such high accuracy and correlation when evaluated with the metrics of Section 2.3.2, shown in Table 1. We believe that this also explains why the linear fit parameters of the persistence are so high.
The partial autocorrelation is also an important value. The partial autocorrelation α(k), defined by Equation (23), behaves as the autocorrelation between z t and z t+k , adjusted from the intermediate variables z t+1 , z t+2 , . . . , z t+k−1 [5].
where k is the lag between the two time series values z t and z t+k , and P t,k (z) is an operator of orthogonal projection of z onto the linear subspace of the Hilbert space spanned by z t+1 , z t+2 . . . , z t+k .
The partial autocorrelation of the Dst can be seen in Figure 8C. This shows what can actually be learned from the Dst index: after the correction applied by the autocorrelation, only the Dst at time step t + 2h still has some significant correlation to the Dst at time t. This would explain why the neural network model has difficulty accurately predicting values beyond t + 1, and instead relies on behaving as a persistence model to predict the next values.

Removing the autocorrelation
The autocorrelation properties of the Dst index are most likely the causes of the problem in the forecast. Direct workarounds consist of either changing the input or the output. A first solution is not to include the Dst index in the input vectors, as done by Wu and Lundstedt [52]. This gives a forecast based purely on the solar wind parameters.
Another solution is to de-trend the Dst time series, and instead forecast the change in the Dst. Let us call ∆Dst the difference of the Dst between two time steps: This parameter has also been introduced by Wintoft and Wik [50]. However, they do not forecast the ∆Dst directly with their model, but use it as a parameter for data selection. A lag plot of ∆Dst shows that the correlation with the previous time step has almost completely vanished, as is visible in Figure 9A. Computing the autocorrelation confirms this, as seen in Figure 9B.
The ∆Dst is a new parameter that we can use to train the LSTM-NN model with. The experiment described in Section 2 is repeated, only this time the model will forecast the ∆Dst model. As input we use, next to the parameters described in Section 2.1, also the previous values of the ∆Dst. Table   5 and Table 6 show the results of the LSTM-NN using this data. The forecasting of the ∆Dst seems to work well for forecasting horizons of 1 to 2 hours. For later forecasting horizons, the correlation coefficient decreases sharply, and the prediction efficiency becomes close to zero. The RMSE does not increase substantially when the forecasting horizon increases. The results of the DTW measure are shown in Table 6. Notice the absence of a persistence effect, as most values are no longer on the offside diagonal. Only in the last row does there seem to be some delay, but this could be explained by taking the prediction efficiency of Table 5 into account, which is close to 0 for forecasting horizon t + 6h. This means that the forecast most likely no longer resembles the observed time series anymore, and the evaluation of the DTW algorithm does not have much meaning anymore. The forecast now accurately shows us that the predictive power of the LSTM is linked to the partial autocorrelation of the ∆Dst, and demonstrates the difficulty to provide an accurate forecast beyond t + 2h. The advantage of using the ∆Dst is that there is no longer a false sense of accuracy. The persistence effect gave the illusion of a strong forecast, while the ∆Dst does not. Using this as a basis, it will be much more transparent when a forecasting model provides us with an actual accurate Finally we discuss the possible causes of why forecasting the Dst is so difficult. We believe that there are two problems that we have not yet taken into account. The first is the variation of the geo-effectiveness of the quiet solar wind, mainly caused by how the tilt of the Earth effects the interaction of magnetosphere with the solar wind. Together with the inclination of the equatorial plane of the Sun, this causes a yearly variation which was not taken into account in this experiment.
This is called the Russel-McPherron effect [39] and has been shown to effect the Dst index [41]. The second is that we believe that it is misguided to forecast the Dst index using the solar wind data measured at L1. These measurements are taken too close to the Earth, which causes an intrinsic limit on how far in the future we can give a forecast. We believe that having measurements at L5 would provide a large improvement in our abilities to provide timely forecasts, as discussed by Hapgood [15]. The effects of using measurements at L5 could be explored in future research, where simulations such as EUFHORIA [37] are used to provide artificial measurements.

Conclusions
An LSTM-based neural network, called the LSTM-NN model, is trained to forecast the Dst index 1 to 6 hours in the future, using solar wind parameters and the Dst from 6 hours before the prediction as an input. While the evaluation scores have indicated that the LSTM-NN model is comparable to the latest publications, visual inspection shows that the model's forecast behavior is similar to that of a persistence model, using the last known input of the Dst as its output. Although the prediction performs better than the persistence model, showing that some information can be learned from the solar wind, the LSTM-NN model effectively fails in its objective.
In order to detect this new type of error, a new test is developed based on the DTW algorithm, to measure the shift between observation and model prediction. DTW can compare two time series in a time window, instead of comparing two values on the same timestamp such as done by the RMSE and the correlation coefficient, allowing the detection of temporal trends. By using the output of the DTW algorithm, first a least-distance mapping is given between the two time series, which can then be used to compare the timestamps of the points mapped to each other. This gives us a measure of the time warp between these two time series, from which we can infer a potential persistence effect.
When this new measure was applied to the persistence model, the results were as expected, and completely captured the temporal behavior of the persistence model. When the measure was applied to the time series forecasting of the LSTM-NN model, it detected the temporal lag in the forecast, proving its usefulness.
Finally, the possible origin of this lag was discussed by observing the autocorrelation of the time series, together with possible different experiments that do not suffer this temporal lag. It was shown that the forecasting of the differentiated Dst did not have this temporal lag. The LSTM-NN model showed promising results for forecasting horizons of t+1h and t+2h, but later forecasts did not have a very high accuracy or correlation to the observations. Future studies focusing on forecasting the differentiated Dst could provide more transparent results. We believe that new research also has to explore the effect of the variability of the solar wind interacting with the magnetosphere in function of the Earth tilt and the inclination of the solar equatorial plane.
Finally, we believe that the observational data measured at L1 plays a big role in limiting the forecast horizon of the Dst index. Looking at the effects of having measurements at L5 should be further explored in future work, using simulations to provide the artificial measurements.
As a concluding remark, we would like to emphasize that researchers should be very prudent when reporting results of time series forecasting with the metrics defined in Section 2.3.2. These metrics fail to capture behaviors that are only seen when taking into account the temporal dimension of the forecasting, and could provide misleading results.

A Additional information
Training and evaluating the model takes on average 10 minutes on a machine with the following specifications: • OS: Windows 10 • Processor: Intel(R) Core(TM) i7-8850H CPU @ 2.60GHz, 2592 Mhz, 6 Core(s), 12 Logical Processor(s) • RAM: 16Gb The source code and experiment can be found on the following webpage: https://github.com/ brechtlaperre/DTW_measure.  When the time series is too small, the counts will be dominated by the green-colored block, while the actual truth will appear very small due to normalization.