Comparison of Deep Learning Techniques to Model Connections Between Solar Wind and Ground Magnetic Perturbations

Geomagnetically induced currents (GIC) can drive power outages and damage power grid components while also affecting pipelines and train systems. Developing the ability to predict local GICs is important to protecting infrastructure and limiting the impact of geomagnetic storms on public safety and the economy. While GIC data is not readily available, variations in the magnetic field, dB/dt, measured by ground magnetometers can be used as a proxy for GICs. We are developing a set of neural networks to predict the east and north components of the magnetic field, BE and BN, from which the horizontal component, BH, and its variation in time, dBH/dt, are calculated. We apply two techniques for time series analysis to study the connection of solar wind and interplanetary magnetic field properties obtained from the OMNI dataset to the ground magnetic field perturbations. The analysis techniques include a feed-forward artificial neural network (ANN) and a long-short term memory (LSTM) neural network. Here we present a comparison of both models' performance when predicting the BH component of the Ottawa (OTT) ground magnetometer for the year 2011 and 2015 and then when attempting to reconstruct the time series of BH for two geomagnetic storms that occurred on 5 August 2011 and 17 March 2015.


INTRODUCTION
Geomagnetically induced currents (GICs) are one of the most significant space weather effects due to their potential to damage the power grid and can cause widespread, long-term power outages. Thus, the ability to forecast GICs is of significant interest to the space weather community, industry partners, and national interests. The intensity of GICs is determined by the strength of the geoelectric field. However, neither measurements of GICs nor the geoelectric field are readily available. The geoelectric field is driven by temporal changes in the magnetic field and the local geology. Thus, measurements of dB/dt using ground magnetometers are used as a proxy for studying GICs. Ngwira et al. (2018) studied two storms during which intense dB/dt peaks occurred and indicated that substorms appear to be the driver of GICs, but state that it's not clear how the widespread features of substorms lead to localized peaks in dB/dt. They theorize that it could be due to "the mapping of magnetospheric currents to local ionospheric structures, " but indicate that further study is needed. Physics based models are used to determine magnetic field fluctuations, but high resolution models are needed to obtain the spatially localized variations (Welling et al., 2019). Such models are computationally expensive and take longer time to run, posing challenges for their use as a forecasting tool. Machine learning based models have the potential for providing efficient, computationally inexpensive forecasts. Wintoft et al. (2015) developed models using Elman neural networks to predict the 30min maximum of dB H /dt (horizontal component of dB/dt) from ACE solar wind and magnetic field measurements. Their models generally predict the timing of GICs caused by sudden impulses well, even when they train the model using only ACE magnetic field measurements.
While many studies of GICs focus on high magnetic latitudes (>60 • ) that lie under the auroral oval, it has been shown that mid-(50 • -60 • ) and low-(<50 • ) latitude regions are also at risk (Gaunt and Coetzee, 2007;Ngwira et al., 2008;Pulkkinen et al., 2010;Oliveira et al., 2018). Lotz and Cilliers (2015) developed a neural network based model using solar wind and IMF inputs and dB/dt measurements at a Southern hemisphere mid-latitude station as outputs. They developed separate models for the north and east components of the geomagnetic field and found that fluctuations in the eastward component are more dependent on the interplanetary magnetic field (IMF) B z . Similar to Wintoft et al. (2015), they found reasonable predictions of the timing of intense fluctuations, with less accuracy as the storm evolved.
More complex neural network architectures can be used to improve the predictions for time-series data. For example, recurrent neural networks such as long short-term memory (LSTM) techniques are used to "remember" parameters from earlier times that have a strong influence on the output features. In this study we present a comparison of models using a feed-forward artificial neural network (ANN) with a builtin time dependence and a LSTM neural network to predict the ground magnetic field north and east components (and therefore the perturbations dB H /dt) at the mid-latitude ground magnetometer station located in Ottawa (OTT). We then discuss the performance of the models by using two of the benchmark geomagnetic storms suggested by Welling et al. (2018) and Pulkkinen et al. (2013). Finally, we discuss several model variations that were implemented during the course of this study to determine possible improvements to the performance of the models.

DATA
For this study, we use solar wind and interplanetary magnetic field (IMF) data obtained from the OMNIWeb dataset available through NASA's Space Physics Data Facility from 1995 through 2010 for the purpose of training and validation of the models, and from 2011 and 2015 for testing. These 2 years were selected for testing because they include storms from the Pulkkinen-Welling validation set for ground magnetic perturbations (Pulkkinen et al., 2013;Welling et al., 2018). Baseline-removed ground magnetometer data from OTT has been obtained from SuperMag (Gjerloev, 2012). The Ottawa ground magnetometer is located at magnetic latitude 54.98 • N and lags UT by 5 h (meaning local midnight occurs at 05:00 UT). The choice of using solar wind data from OMNI instead of the more traditional and continuously available geomagnetic indices is based on the longterm goal of being able to forecast variations in the ground magnetic field ahead of time, and real time solar wind parameters obtained at the L1 position tend to give a 30-40 min window to distribute a warning. However, the OMNI dataset has been mostly avoided in the past as it contains approximately 20% of missing data distributed roughly evenly through the years in the plasma parameters and ∼ 8% in the IMF measurements. This was noted by Wintoft et al. (2015), leading them to compare models trained on just the magnetic field vs. combined magnetic field and plasma measurements. Our intention is to use both IMF and plasma parameters. Since a relatively continuous dataset is preferred for training, some linear interpolation has been done in the training/validation dataset of up to 10 min in all missing parameters, which reduces the missing values to ∼ 6% in both IMF and plasma measurements. OTT ground magnetic field measurements have less than 1% of missing data during the period of study. Those missing data points have been removed from the training set. For the testing periods of 2011 and 2015 a full linear interpolation has been performed to the solar wind data to achieve a completely continuous dataset.
Although dB H /dt is the best proxy measurement to GIC forecasting, it is also very noisy and therefore difficult to forecast directly with data-driven models. Tóth et al. (2014) also found this to be true for first principles-based models. We therefore aim to first predict the northward and eastward components of the baseline-removed ground magnetic field, B N and B E , respectively, using two independent models and then combine them to obtain the predicted horizontal component For the purpose of comparison with the metrics defined by Pulkkinen et al. (2013) we also need to obtain dB/dt H , which is calculated as t is determined by the ground data resolution (1 min), and i represents the components N or E. We emphasize that large variations are most likely to result in significant GIC events.

Feed-Forward Artificial Neural Network
In our first attempt to forecast the B E and B N components we have chosen to train a fully-connected, feed-forward, artificial neural network (ANN) developed using the TensorFlow-Keras environment (Abadi et al., 2015). The Tensorflow-Keras environment is highly modular and allows the easy integration of data and different types of networks. It also allows integration and computation through GPUs instead of the traditional CPU computing which reduces significantly (in our case, up to a factor of 10) the training times. For the network architecture we have chosen a 4-layer deep network with hidden layers of 291-146-73-36 neurons plus a dropout layer (rate 0.1) in between the first and second layer to avoid overfitting. Such selection of neurons matches the dimensionality of the feature vectors and then each layer is halved. We have selected mean square error (MSE) as our loss function, ADAM as our optimizer, and a REctified Linear Unit (RELU) as our activation function. We track the loss in the validation dataset and stop the training after the MSE has not decreased for 25 epochs to avoid overfitting. To incorporate time dependence in the ANN, the input vector for t includes features from previous time steps, e.g., t − 1, t − 2. We have chosen to include a 2-h time-history for solar wind speed , proton density, dynamic pressure, temperature and solar wind electric field using a 1-min cadence for the first 12 preceding minutes (i.e., up to t − 12) plus 10-min averages over the entire interval (yielding 12 additional values). Additionally, ground magnetometer sin(MLT) and cos(MLT) values have been included to ensure a cyclical dependence over the Earth's rotation and solar zenith angle as a proxy of both longitude and yearly seasonality. The resulting feature vector thus contains 291 features and that explains our choice of neurons in the first layer. OMNI data from 1995 to 2010 was split sequentially 70% for training and 30% for validation. The model was then trained and the result used on the test data from 2011 to 2015.

Long Short Term Memory
The long short-term memory (LSTM) (Hochreiter and Schmidhuber, 1997) recurrent neural network was developed using TensorFlow-Keras (Abadi et al., 2015). The input features used in the LSTM model were the same as were used for the ANN as was the training-validation split of the 1995-2010 OMNI data.
The LSTM model requires an extra dimension in the input vector to build in the time history. To keep the models consistent with each other, the LSTM model also uses the preceding 12-min plus 10-min averages over the preceding 2 h time history. Memory limitations required the use of a custom data generator which fed batch sizes of 512 training samples sequentially into the model for training. The network consists of a single LSTM layer with 147 neurons and a single dense layer with one neuron. A RELU is used as the activation function, the optimizer is ADAM, and the loss function MSE. As with the ANN, validation loss was monitored and training was stopped after the MSE had not decreased in 25 epochs. Although recurrent neural networks, and in particular LSTM, are capable of utilizing the time history of the target parameter (in this case, B N or B E ) to improve the prediction, in this study we have chosen not to use it in order to obtain a closer comparison with the ANN model. Once the model was trained, it was tested on data from 2011 and 2015.
The real vs. predicted values of B H for the validation and test datasets are shown as a density plot in Figure 1 (bottom row). The RMSE of the validation set is 8.7 nT with a 33% explained variance, 9.7 nT for the 2011 dataset with a 35% explained variance and 16.8 nT for the 2015 dataset with a 36% explained variance. The consistency of the correlation coefficients (shown in Figure 1) indicates that the model did not suffer from overfitting, which could have been a concern with the amount of input data. Considering these parameters, the LSTM model seems to under-perform the ANN. Similarly to the ANN case, the error distribution for the predictions of the LSTM model further indicate that the model is not severely overfitting (see Supplementary Material).

RESULTS
The two types of models we trained are used to predict B N and B E during storms that occurred on 5 August 2011 and 17 March 2015. The selection was based on the recommendations from the Pulkkinen-Welling validation set for ground magnetic perturbations (Pulkkinen et al., 2013;Welling et al., 2018). These two storms were selected because they are outside of the time range used to train and validate the models and because they correspond to two very different years in terms of geomagnetic activity, 2011 being on the minimum-ascending part of the solar cycle, and 2015 corresponding to the solar cycle maximum.  IMF magnetic field z-component, along with the measured B H and dB H /dt at Ottawa station, and the predicted B H and dB H /dt from the ANN and LSTM models. This is a strong storm with a minimum SYM-H value of -126 nT, driven by a combination of a coronal mass ejection and high speed stream, with the shock arriving at Earth leading to a sudden storm commencement around 18 UT on August 5. In this event, B H presents a clear response to the storm that shows as repeated dB H /dt variations of up to ∼50 nT/min occurring for several hours during the storm main phase, following the period of quick increase in solar wind speed and mostly during the time in which the IMF B z component is strongly and persistently southward.

August 5 2011 Storm
In terms of reconstructing the B H evolution, neither model does a good job of predicting the first two enhancements. The ANN does a decent job predicting the third and fourth enhancements. After ∼23:00 UT on August 5, the ANN predicts some increases but does not match the magnitude likely due to these being later in the storm when the magnetic field perturbations are controlled more by parameters within the magnetosphere or ionosphere, although this is also a period in which the actual fluctuations in B H are relatively small. The LSTM doesn't do well predicting the enhancements during this storm in general. Considering the dB H /dt, the ANN comes close to predicting the timing of the biggest spike in dB H /dt at ∼20:00 UT on August 5, at about half the magnitude, while the LSTM misses this completely. Both models miss the initial spike at the sudden storm commencement just before 5:00 UT, and start to predict enhanced B H about 2-3 h later. Both models predict some enhancement near 8:00 UT, though matching the real enhancement is unlikely due to this being an interval of linear interpolation of the input data (as seen in the straight line in the solar wind velocity and IMF B z over 7:00-9:00 UT). This linear interpolation results in the ANN overpredicting the magnitude and the LSTM predicts significant spikes. The LSTM does a better job of predicting the enhancement just after 12:00 UT, even getting close in overall magnitude of B H . Again, the linearly interpolated input data (∼15:00-17:00 UT) reduces the ability to accurately predict the second half of this enhancement. Only the ANN predicts the largest enhancement at ∼21:00 UT, but does not match the shape of the peak or timing.

March 17 2015 Storm
It is important to note that in this particular storm, there are some gaps in the solar wind measurements, particularly FIGURE 2 | August 5, 2011 storm measurements and predictions, including Sym-H index, solar wind velocity, interplanetary magnetic field z-component, measured (blue solid) and predicted (red dashed) horizontal magnetic field and time-dependent variation at OTT ground station for the ANN and LSTM models.
between 06-08 UT and then on 16-18 UT on March 17. In order to generate a prediction, the solar wind data has been linearly interpolated. Despite the missing data, a simple linear interpolation allows the models to recover part of the variability of the ground magnetic field fluctuations, suggesting that efforts on gap-filling, including empirical modeling of the solar wind data could still yield positive results in the risk assessment of GICs.

Validation Metrics
The Pulkkinen-Welling recommendations include four metrics for validating predictions of ground magnetic perturbations (Pulkkinen et al., 2013;Welling et al., 2018) using binary event analysis. For a particular storm, the interval of interest is divided into non-overlapping, 20-min windows, and for each window, a dB * H /dt value is calculated as the maximum measured dB H /dt within the window. In addition to the method of calculating (dB/dt) H using Equation (2), we use the power law empirical fitting for the OTT station described by Tóth et al. (2014), and present both results for comparison. Pulkkinen-Welling propose four different thresholds of 0.3, 0.7, 1.1, and 1.5 nT/s to evaluate if the model is able to predict a variation of that magnitude. Since the thresholds are in nT/s, and our calculated dB H /dt is in nT/min, we multiply the thresholds by 60. Using the values of the predicted dB * H /dt, we determine whether the model accurately predicts the events by calculating true positives, false positives, true negatives, and false negatives. Accuracy is determined by calculating the following metrics: probability of detection (POD), probability of false detection (PFD), proportion correct (PC), and Heidki Skill Score (HSS). Table 1    models, as they found for a first principles-based model. In fact, the ANN has a similar HSS for the 0.3 nT/s threshold as they report for mid-latitude stations (0.583). However, our methods have a lot of potential for improvement. We had originally trained and optimized a single model of each type that predicted B H , rather than independent models for B N and B E . The single models of each type had better correlation coefficients, explained variance, and RMSE than the values discussed in section 3, but much lower validation metrics than those shown in

CONCLUSIONS
We have developed and compared two types of models that predict the north and east components of the ground magnetic field, B N and B E , at a single mid-latitude ground station. One model is a feed-forward artificial neural network that includes time dependence as input features and the other is a long shortterm memory neural network. The predictions from each model are compared to real measurements for 2 years, 2011 and 2015, including a storm during each year. There is some ability for each of the models to predict the timing of magnetic field perturbations, though this ability is not consistently better for either model between the storms and neither is able to predict the magnitude of the enhancements or predict enhancements later in the storm. Validation metrics indicate that the LSTM is barely more skilled than random or constant predictions, and that using an empirical fitting improves HSS as it does for first principles-based models. Next steps to improve the models include adjustments of the input parameters, increased time history cadence, and comparison to additional time series techniques. Another limitation of these models is the use of only one ground magnetometer station. We expect better predictions if we include more ground stations in the mid-latitude range to get more MLT coverage.

DATA AVAILABILITY STATEMENT
The solar wind, IMF, and Sym-H index data are available from OMNIWeb at https://omniweb.gsfc.nasa.gov and the ground magnetometer data are available from SuperMAG at http:// supermag.jhuapl.edu.

AUTHOR CONTRIBUTIONS
AK was the primary author of the paper, provided guidance on model development and analysis, and participated in interpretation. VP conducted data preparation, model development, analysis, and interpretation and assisted with writing. MC conducted model development and analysis and assisted with writing. MM provided guidance on data preparation and model development. CL assisted with data preparation. HC assisted with overall project discussions and provided editorial comments. All authors contributed to the article and approved the submitted version.

FUNDING
This work was supported by NSF EPSCoR Award OIA-1920965.