The predictive power of magnetospheric models for estimating ground magnetic field variation in the United Kingdom

Space weather events can have damaging effects on ground-based infrastructure. Geomagnetically induced currents (GIC) caused by rapid magnetic field fluctuations during geomagnetic storms can negatively affect power networks, railways as well as navigation systems. To reduce such negative impacts, good forecasting capability is essential. In this study we assess the performance of contemporary magnetohydrodynamic (MHD) models in predicting the external-only ground magnetic field perturbations at three United Kingdom observatories during two severe space weather events: September 2017 and March 2015. Simulated magnetic data were acquired via Community Coordinated Modeling Center (CCMC), using the following models: Space Weather Modeling Framework (SWMF), Open Geospace General Circulation Model (Open GGCM) and Lyon–Fedder–Mobarry (LFM) combined with the Rice Convection Model (RCM). All simulations use spacecraft measurements at L1 as their solar wind input in calculating ground perturbations. Qualitative and quantitative comparison between measured and modelled values suggest that the performance of MHD models vary with latitude, the magnetic component and the characteristics of the storm analysed. Most models tend to exaggerate the magnitude of disturbances at lower latitudes but better capture the fluctuations at the highest latitude. For the two storms investigated, the addition of RCM tends to result in overestimation of the amplitude of ground perturbations. The observed data-model discrepancies most likely arise due to the many approximations required in MHD modelling, such as simplified solar wind input or shift in location of the electrojets in the simulated magnetospheric and ionospheric currents. It was found that no model performs consistently better than any other, implying that each simulation forecasts different aspects of ground perturbations with varying level of accuracy. Ultimately, the decision of which model is most suitable depends on specific needs of the potential end user.

fluctuations in the Earth's magnetic field (B) resulting in the induction of geoelectric fields in the subsurface. These give rise to geomagnetically induced currents (GICs) which negatively affect high voltage power networks, pipelines and railways (Boteler, 1994;2021;Boteler et al., 1998;Pirjola, 2000;Zhang et al., 2015;Rajput et al., 2021). Variations in the geomagnetic field can also affect navigation in directional drilling used in the oil and gas industry (Reay et al., 2005). The negative effects of extreme geomagnetic activity have been observed many times. The first recorded example is known as the Carrington event in 1859, which is believed to be the most severe magnetic storm ever observed (Chapman et al., 2020), causing significant interference with telegraph lines (Cliver and Dietrich, 2013;Hayakawa et al., 2019). Another example is the large storm from March 1989 which is considered one of the most significant magnetic storms of the 20th century, strongly affecting the northern regions of the United States and Europe. Geomagnetic disturbances caused transformer damage resulting in the collapse of the Hydro-Quebec power system, which left millions of people without power for several hours (Bolduc, 2002;Boteler, 2019). A more recent example is the moderate storm from February 2022, during which the majority of Starlink satellites launched into a very low orbit (200 km) were deorbited due to enhanced atmospheric drag. This event shows that even minor storms can have severe financial consequences (Dang et al., 2022). To prevent and mitigate such negative impacts, reliable space weather forecasting methods are essential for providing early warning to infrastructure operators.
Over the past decades, many space weather prediction models have been developed. They aim to simulate the complexity of the solar wind-magnetosphere-ionosphere coupling by solving the magnetohydrodynamic (MHD) equations. Many currently available models rely on solar wind parameters, obtained upstream of the Earth at the L1 Lagrange point by satellite, as their input which are then propagated further down into the Earth's magnetosphere and ionosphere. Using post-processing tools and techniques, the magnetic field perturbations at a given location on the ground can be then estimated from snapshots of the field-aligned and ionosphere height-integrated currents produced by the simulations (Rastätter et al., 2014). Such MHD models are increasingly being used for operational services for predicting values of ground magnetic fields and to provide hazard assessments and warnings up to several hours in advance of an anticipated space weather event. Hence it is important to understand their performance, uncertainties and limitations, particularly as secondary and tertiary services become reliant on their forecasts. In general, the MHD approach is believed to correctly describe the large-scale interaction of solar wind with magnetosphere and ionosphere theoretically by using constant solar wind parameters and interplanetary magnetic field as input. With the availability of more and more observational data from ground-based and space-borne observatories, researchers have tried to model the responses of the magnetosphere to real-time observations by using data-driven global MHD simulations with realistic observed parameters as input. Feng and Feng (2020) present an overview of such global MHD models of magnetosphere, with the focus on numerical implementations.
The performance of such models has been assessed in multiple validation challenges and studies in the past, focusing mainly on large high latitude regions in the northern hemisphere during geomagnetically disturbed periods (Yu and Ridley, 2008;Rastätter et al., 2011;Pulkkinen et al., 2013;Rastätter et al., 2013;Welling et al., 2017;Kwagala et al., 2020). These studies use a variety of metrics to evaluate the performance of chosen models in replicating ground field variations, as well as their rate of change with time measured at selected ground magnetic observatories. However, less attention has been paid to low and midlatitudes where the majority of the global population live and much of the potentially vulnerable infrastructure is located (Zhang et al., 2016;Oughton et al., 2019).
Large variations of the horizontal magnetic field component are the main driver of the resulting geoelectric field at the Earth's surface (Bolduc et al., 1998;Viljanen et al., 2001;Smith et al., 2021). Since the induced electric field also depends on the relative geometry and properties of the power network, the orientation of the magnetic field (North-South, B x , and East-West, B y ) should also be considered to provide more detailed and effective GIC forecast (Kelly et al., 2017). Therefore, in the work presented here, individual components B x and B y are the main focus, though the horizontal field (B H = √B 2 x + B 2 y ) is also considered. Accurate values of ground magnetic fields, particularly time-series, allow geoelectric fields and GICs to be computed with greater confidence through magnetotelluric transfer functions and models of the response of high voltage power grids to variation in ground electric field (Beggan et al., 2021).
The following section (Section 2) describes the validation setting, including the selected space weather events, location of the United Kingdom observatories and information about the MHD models investigated. Data-model comparison metrics selected for performance assessment are described in Section 3. The results and data-model comparison analysis can be found in Section 4, followed by a brief discussion and concluding remarks in Section 5.
2 Magnetohydrodynamic model validation setting

Selected geomagnetic storms
In this study, two of the largest geomagnetic storms from solar cycle 24 are investigated. The details and duration of each can be found in Table 1. Both selected events cover the most disturbed 24-h period of the respective storm. The severity of each event can be determined by various space weather indices. The planetary Kp index quantifies magnetic activity in the range from 0 to 9, where 0 describes quiet conditions and 5 or greater indicates a storm (Bartels et al., 1939). The Dst index describes the variation in the ring current intensity and is believed to reflect the global magnetic field reduction during storms. It is derived from the horizontal magnetic field component measured at four nearly equatorial magnetic observatories and the resulting Dst is calculated at 1 h intervals. Similarly to Dst, SYM-H index represents the symmetric part of the horizontal component but relies on data from six observatories instead, providing more even distribution in longitude. The resulting SYM-H is given in 1min resolution. Detailed description of the differences between the two indices can be found in Wanliss and Showalter (2006). The values of selected indices for each event presented in Table 1   Early September 2017 is considered one of the most geomagnetically active periods of solar cycle 24, where a significant number of sunspots caused several major flares and coronal mass ejections (CMEs). This period has been analysed in many studies, for example by Rodger et al. (2020) Simpson and Bahr (2021). Severe geomagnetic activity observed on the Earth was a consequence of multiple X-class and M-class solar flares that occurred on the 6th and 7th of September. In the following analysis we investigate the period associated with the sudden storm commencement caused by the solar shock arrival just after 23:00 UT on the 7th, when solar wind speed drastically increased.
The second event is another well-known storm, commonly referred to as the St. Patrick's Day event. It is considered the first super geomagnetic storm of solar cycle 24. This storm was associated with a CME that occurred on 15 March 2015, reaching Earth on March 17th. Just like the September 2017 storm, this event is also a popular research topic within the space weather research community (Zhang et al., 2015;Carter et al., 2016;Guerrero et al., 2017;Huba et al., 2017;Divett et al., 2018;Ngwira et al., 2018).

Ground magnetic observatories
We use data from three United Kingdom geomagnetic observatories operated by the British Geological Survey: Hartland (HAD) in Devon, England; Eskdalemuir (ESK) in the Southern Uplands of Scotland and Lerwick (LER) in the Shetland Islands. Detailed location information for each of them can be found in Table 2 and on the map in Figure 1.
Ground magnetic field measurements, at minute-mean cadence, were obtained from the INTERMAGNET website 2 . The quiet time average was removed from each data set by subtracting the mean value taken over the first 2 h of each time series to leave only the external field variations. Those data sets are then used as "ground truth" for comparison with simulated results.

FIGURE 1
Map of the permanent United Kingdom geomagnetic observatories.

Magnetohydrodynamic models 2.3.1 Space weather modeling framework
The Space Weather Modeling Framework (SWMF) was developed by the Center for Space Environment Modeling (CSEM) at the University of Michigan. It aims to simulate various domains of the Sun-Earth chain, and ultimately help predict space weather phenomena. This framework combines numerical models of various regions between the Sun and the Earth into a highperformance coupled model (Tóth et al., 2005). In this study, the global magnetosphere (GM) and its coupling to the inner magnetosphere (IM) are the main focus.
The model included in SWMF that describes the mentioned domains is the Block-Adaptive-Tree Solar-Wind Roe-Type Upwind Scheme (BATS-R-US) (Powell et al., 1999). The BATS-R-US code solves the ideal 3D MHD equations as well as semi-relativistic, Hall, multispecies and multifluid equations Glocer et al., 2009). The governing equations for an ideal, nonrelativistic, compressible plasma, written in primitive variables, represent a combination of the Euler formulas of gas dynamics and Maxwell equations of electromagnetism. BATS-R-US uses the blockbased adaptive mesh refinement (AMR) technique with a finite volume form using numerical methods related to Roe's Approximate Riemann Solver. A variety of different grids are available, as well as explicit, implicit and point-implicit time stepping options (Zeeuw et al., 2004;Glocer et al., 2009;Ridley et al., 2016). Detailed description of the code and the adaptive blocks technique can be found in papers by Zeeuw et al. (2000) and Stout et al. (1997).
The SWMF/GM uses solar wind plasma data (density, velocity, temperature and magnetic field) in Geocentric Solar Magnetospheric (GSM) coordinates as input, which are then propagated from the L1 Lagrange satellite position to the sunward boundary of the simulation domain. The Earth's magnetic field is represented by a dipole with either a fixed orientation or updated axis orientation and co-rotating inner magnetospheric plasma during the simulation run. The outputs provided by SWMF/GM include plasma parameters, magnetic field and electric currents. In this study the following SWMF configurations were used for both events: • SWMF v20180525, high-resolution grid with 9,623,552 cells, • SWMF v20180525 coupled with RCM, high-resolution grid with 9,623,552 cells.

Open geospace general circulation model
The Open Geospace General Circulation Model (OpenGGCM) is a global coupled model of Earth's magnetosphere, ionosphere and thermosphere. In the magnetospheric part, the model solves the resistive MHD equations as an initial-boundary-value problem. The ionosphere potential solver closes the magnetospheric field-aligned currents in the ionosphere, effectively combining magnetospheric convection with ionospheric convection. The magnetospheric module was later improved by coupling with the ionosphere-thermosphere system modelled by the NOAA CTIM [Coupled Thermosphere Ionosphere Model (Fuller-Rowell et al., 1996)]. This addition maps the field-aligned currents into the ionosphere, computes the ionospheric potential and electron precipitation parameters as well as Hall and Pedersen conductances (Raeder et al., 2008;Raeder et al., 2017;Cramer et al., 2017). Similarly to SWMF, OpenGGCM uses solar wind plasma and magnetic field parameters as input, which are propagated to the upstream simulation boundary. The simulated Earth's magnetic field is approximated by a dipole with fixed orientation during the entire simulation run. The following configurations of the OpenGGCM simulation will be analysed for both events: • OpenGGCM 5.0, overview grid with 7 M cells, • OpenGGCM 5.0 coupled with RCM, overview grid with 7 M cells.

Lyon-Fedder-Mobarry
The Lyon-Fedder-Mobarry (LFM) global magnetospheric simulation code was first developed around 1985 and improved over the years with higher resolving power transport, adapted grid, high Alfvén speed, low plasma β, and high magnetic field as well as an integral ionospheric model . The Magnetosphere-Ionosphere Coupler/Solver (MIX) code provides the LFM model with the ionospheric electrostatic potential solution and electric field inner boundary condition information (Merkin and Lyon, 2010). The thermosphere-ionosphere electrodynamics general circulation model (TIE-GCM) is also coupled with the LFM-MIX model combination to provide conductance information necessary in the electric potential calculation. Details of the TIE-GCM can be found in Richmond et al. (1992). All these models in combination form the coupled magnetosphere-ionosphere-thermosphere (CMIT) model Wiltberger et al., 2004) which is a part of the Sun-Earth system simulation package developed by the Center for Integrated Space Weather Modeling (CISM) .The results for both events presented in this paper rely on the following LFM configuration: • LFM-2_1_5 coupled with RCM, grid with 106 × 96 × 128 cells.

Rice convection model
In order to represent the development and behaviour of the equatorial ring current, each of the aforementioned models can be coupled with the Rice Convection Model (RCM), as described in studies by (Zeeuw et al., 2004;Zhang et al., 2007;Zaharia et al., 2010;Ridley et al., 2016). The RCM is a bounce averaged-drift kinetic model of the inner and middle magnetosphere that includes coupling to the ionosphere. It describes adiabatically drifting isotropic particle distributions in a self-consistently computed electric field and time varying magnetic field. The model represents the plasma population in terms of multiple fluids, typically about 100. The particle population in the inner magnetosphere is represented by three main species (H + , O + , and electrons). To accurately represent the inner magnetosphere and its coupling with the conducting ionosphere, RCM solves a special form of the collisionless Vlasov equation (Hénon, 1982) for the distribution function of the three species that includes transport due to curvature drifts. The RCM computes currents and associated electric fields self-consistently, assuming perfectly conducting field lines and employing pre-computed time-dependent magnetic field information and associated induction electric fields. More details about the mathematics and formulation of the RCM can be found in Toffoletto et al. (2003); Sazykin et al. (2002); Wolf et al., 2007, Wolf et al., 1977.

Performance assessment metrics
The accuracy of each MHD model can be evaluated via different metrics which help quantitatively compare modelled parameters against measurements. Metrics described below have been previously used in many data-model comparison studies, such as Kwagala et al. (2020) or . Since the goal of our study is similar to that of these authors, but focused at lower latitudes, the same approach will be followed in this research. The use of multiple metrics allows us to test a variety of specific aspects that describe the relationship between measurements and predictions, therefore providing additional insight and in-depth data-model comparison. All equations and definitions below are adapted from Liemohn et al. (2021); Kwagala et al. (2020) and .

Normalised root-mean-square error
The accuracy fit performance between the measured/observed and modelled values is assessed using the normalised Root-Mean-Square Error (nRMSE). The nRMSE is computed as follows: where ΔM represents the model output, while ΔO corresponds to observed values. In Eq. 1 the RMSE is normalised by the standard deviation of the observed data (σ O ), with <⋅ > indicating the mean of the squared differences. An nRMSE value below one indicates a relatively good agreement with measured data, whereas nRMSE greater than one shows rather poor agreement between measured values and predictions.

Pearson's correlation coefficient
The correlation coefficient (r) indicates the linear correlation between the measured and modelled data sets. The following formula will be used in this study: Frontiers in Astronomy and Space Sciences where m o is the mean value of the observed data, and m m is the mean value of modelled data. The resulting coefficients may vary between -1 and 1, where one represents perfect correlation and its corollary is complete anti-correlation.

Heidke skill score
Modelled values can also be compared with measurements based on a predetermined event state. In this case, the event is a predefined horizontal component (B H ) threshold crossing. That means that the comparison scatter plot between data and model can be reduced to a 2 × 2 matrix, referred to as a contingency table. The names of the four members included in this table vary between different studies. In this paper, the following will be used: hits (H), false alarms (F), missed crossings (M), and correct no crossings (N). Based on the values of a contingency table, various skill scores can be obtained. The most widely known, and one of the most common in space weather research, is the Heidke Skill Score (Hogan and Mason, 2012): HSS measures the fraction of correctly predicted threshold crossings after eliminating those predictions that would be correct purely by random chance. It can take any value ranging from −∞ to 1, with one being a perfect score. Negative values indicate that the model performs worse than a random guess.

Probability of detection
POD gives a measure of the fraction of observed threshold crossings that were predicted correctly. It is defined as: The values of POD range from 0 to 1. Again, one is the ideal score. Any value higher than 0.5 suggests that the model predicts more correct crossings than it misses (with low values of M).

FIGURE 4
Normalised root-mean-squared errors (nRMSE) (top) and Pearson's correlation coefficients (bottom) for B x at each station and both events. Pink and navy lines indicate the mean and median metric value per observatory of all models, respectively. Latitude increases from left to right.

Probability of false detection
POFD is essentially the opposite of POD. It gives a measure of the fraction of observed no threshold crossings that were incorrectly forecast as crossings. It is given by the equation: The range is the same as for POD, however the ideal value is in this case zero. Any value lower than 0.5 suggests that the model predicts more correct no crossings than false crossings.

Frequency bias
Frequency Bias, also known as the bias ratio, helps assess the symmetry of the contingency table (i.e., a frequency distribution table showing a correlation between two variables) and measures how well the modelled values correspond to the observations. It is defined as: There is no specific range of values for the frequency bias, however FB = 1 is the ideal scenario. Any value larger than one implies that the model is more often in the defined event state, predicting the events more frequently than observations. This results in more false alarms than misses. Similarly, FB smaller than one implies that the model predicts events less frequently, resulting in more misses than false alarms.

Results
All of the results presented were acquired via the Run-On-Request system available at the Community Coordinated Modeling Center (CCMC) 3 . Each requested model run uses the same OMNI solar wind data as input, per respective event. Simulated ground perturbations for the northward (B x ) and eastward (B y ) components in 1-min time resolution are then compared with corresponding external field measurements at HAD, ESK and LER. Geomagnetic variations at a specific ground location were obtained using the CalcDeltaB post-processing tool (Rastätter et al., 2014).
Based on visual comparison alone, it can be observed that there are significant differences between modelled and measured ground magnetic field perturbations during both March 2015 and September 2017 storms. Figure 2 shows the B x component and Figure 3 shows the B y component. The black lines show the measured magnetic field variations at the three United Kingdom observatories and the others (various colours) show the MHD model outputs. The majority of models strongly overestimate the variation across the entire duration of both geomagnetic storms, particularly at lower latitudes. The only exception is in both components at LER where the field is underestimated around 01:00 UT on the 8th September. Thus, in general, the phase and amplitude are poorly matched. Figure 5 show the normalised root-mean-square errors (top) and Pearson's correlation coefficients (bottom) for B x and B y components, respectively. In terms of the B x , similar trends can be seen during both events (Figure 4). The mean (pink line) and median (navy line) metrics are also plotted to show the average performance of all models per observatory as well as their accuracy after removing the extreme values. Those results indicate that the overall performance of all models tends to increase with increasing latitude, where nRMSE and correlation coefficients reach values closer to one in most cases. The accuracy of respective simulations in predicting B x vary between observatories. The SWMF and SWMF + RCM perform relatively well in terms of both the overall nRMSE and correlation values for both events. The OpenGGCM model shows similar patterns but with higher nRMSE and lower correlation in comparison to the SWMF results, particularly in terms of event 2. The addition of RCM makes the OpenGGCM the least accurate simulation during the two events investigated. The LFM + RCM shows about average performance at all three observatories, with rather poor correlation with measurements, especially during the March 2015 (event 2).

Figure 4 and
There is no distinct pattern in the simulation performance when predicting the B y component (Figure 5). The mean and median values show an increase in overall model error at ESK for event 1, and tend to increase with latitude for event 2, the opposite of what was observed for B x . Correlation coefficients for both events gradually decrease at higher latitudes. In this case, significant differences between SWMF and SWMF coupled with RCM can be seen, though the addition of RCM improves the model performance during both events. However, the SWMF gives overall poor results, especially for the September 2017 event. The OpenGGCM gives similar error values for all latitudes and events, with the exception of ESK for event 1. The resulting correlation coefficients also remain the same, fluctuating between 0 and 0.45, which indicates relatively little correlation. As seen in B x , the OpenGGCM + RCM ground magnetic field predictions result in large values of nRMSE for both events. However, datamodel correlation in this case is quite high in comparison with other models for event 1. Based on the two metrics, the LFM + RCM shows similar behaviour to the OpenGGCM simulation, staying relatively close to the mean/median values. Its performance during the September 2017 storm is one of the best amongst all simulations.

Threshold metrics
The following set of event detection metrics are based on the calculated horizontal magnetic field component (B H ). A threshold of >150 nT (indicating intense storm conditions) was selected for both storms for all three observatories. Figures 6, 7 show Probability of Detection (POD), Probability of False Detection (POFD), Heidke Skill Score (HSS) and Frequency Bias (FB). Similarly to nRMSE and correlation plots, the pink trace indicates the mean and the navy trace the median of all metrics at each station. The dashed lines on each plot represent 0.5 for both POD and POFD, 0 for HSS and 1 for FB. For a good model performance, the following results would be required: POD > 0.5, POFD < 0.5, HSS > 0 and FB ≈ 1.
During the September 2017 storm (Figure 6) a pattern can be observed in all four metrics. Mean and median POD, HSS and FB tend to have more acceptable values at higher latitudes. The resulting POFDs do not vary drastically at different stations, staying below 0.5. The SWMF horizontal field shows poor results in terms of POD, with values close to 0 at HAD and ESK, and below 0.7 at LER. The POFD gives a reasonable value at all stations. This implies that the model correctly predicts no threshold crossings but at the same time forecasts more false alarms (i.e., overpredicts). The skill score indicates rather poor model performance at lower latitudes, though reaches relatively high value at LER (>0.65). The FB at all latitudes remains below 1 which indicates that the model predicts events less frequently than reality. The B H component forecast from the SWMF + RCM follows a similar pattern to the SWMF, however with much better POD and HSS results and slightly worse POFD.
The SWMF + RCM also shows the best FB amongst all models, with values close to 1 at all stations, meaning that, in general, it predicts threshold crossings almost as frequently as observations. The OpenGGCM simulation performs better than SWMF in terms of POD, that is predicts more correct crossings than misses, but gives higher values of POFD. On average, the resulting skill scores remain below 0.5, indicating poor accuracy particularly at lower latitudes. The OpenGGCM seems to forecast events more often than they occur, with high values of FB. As observed with the nRMSE and correlation analysis, the addition of RCM decreases its overall predictive power. The LFM + RCM model combination results in quite high POD, low POFD and relatively good HSS and FB at all stations.
During the March 2015 storm (Figure 7) the performance of simulations varies more drastically between the observatories, particularly in terms of HSS. A similar pattern can still be observed, with better results at the highest latitude. However, in this case both POD and POFD increase with latitude, as shown by the mean and median traces. On average most models show POD <0.5 at HAD and POFD >0.5 at LER. This implies that models tend to predict just as many correct crossings as false alarms. Contrary to the September 2017 event, the skill scores do not exceed values of 0.3, indicating poor performance overall. In general, most simulations forecast threshold crossings more frequently than observations (FB > 1), with HAD being the only exception. The accuracy of SWMF in terms of all 4 metrics increases with latitude, following the general trend. Here, the addition of the RCM seems to improve its performance in correct threshold crossings (higher POD) and HSS. The results provided by the OpenGGCM simulation show rather average agreement with observations overall, and quite poor skill score (<0) at the highest latitude. The OpenGGCM + RCM shows similar trends to those seen for the first event. Lastly, the LFM + RCM simulation during event 2 follows the average pattern in terms of POD, POFD and FB, sitting relatively close to the median line. Its skill scores vary between observatories, reaching the highest value at ESK and lowest at LER.

Discussion and summary
Our study investigates the performance of various contemporary global MHD models in their ability to forecast ground magnetic field perturbations during two severe geomagnetic storms in 2015 and 2017. Such MHD models are increasingly being used to provide operational services for predicting values of ground geomagnetic field and to provide hazard warnings tens of minutes in advance of an anticipated space weather event based on solar wind measurements at L1. In theory, even longer warning lead times, up to days, could be provided based on solar wind propagation of a CME from initial detection.
The simulated northward B x , eastward B y and the resulting horizontal B H components were compared with ground magnetometer data from three United Kingdom observatories: Hartland, Eskdalemuir and Lerwick. In general, there is no systematic model ranking for any events or components examined and the overall performance of each simulation is highly dependent on the type of metric used in the data-model comparison, as previously seen by . There is no preferred or "best" model. However, the following trends were observed: • Models tend to exaggerate the magnitude of ground variations at lower latitudes but perform relatively better at higher latitudes (>60°N), with lowest nRMSE and highest correlation coefficients, especially during the September 2017 storm. • Overall, all models show rather poor correlation with measured data during both storms. Correlations barely exceed 0.75 and often reach negative values. • The B H component analysis indicates that, on average, models predict threshold crossings more accurately during the September 2017 (event 1), where the average POD and POFD remain above and below 0.5, respectively and HSS is above 0 at all latitudes, with significant increase at LER. • Frequency Bias values for both events suggest that all models have a tendency to predict threshold crossings (i.e. storm periods) more often than in reality but become closer to observations at higher latitudes. • The RCM in combination with the OpenGGCM decreases the model accuracy, resulting in large overestimates of magnetic field amplitude when compared with observations.
Currently-available global MHD models of the magnetosphere and ionosphere have several limitations that affect their overall performance in the magnetic field forecast. Simplified solar wind input parameters might result in timing errors, consequently causing the model output to capture ground disturbances before or after their actual occurrence. Previous research (Freeman et al., 2019;Dimmock et al., 2021) also notes that MHD models regularly fail to represent substorm characteristics, identified as more rapid and frequent disturbances observed mainly in high-latitude and auroral regions (Daglis et al., 2003;Tanskanen et al., 2011), and large changes in the B-field amplitude associated with those events. The results presented in this study also confirm those observations. Another often seen problem with accuracy is that models occasionally predict the opposite sign of modelled ground perturbation, most likely due to a shift in location of the electrojets in the simulated magnetospheric and ionospheric currents compared to their actual location.
An accurate representation of both the amplitude and timing (i.e., phase) of directly-driven field variation during geomagnetically active times is vital for actionable space weather hazard forecasts. The results here suggest that this goal has not yet been reached and much work remains to produce a sufficiently robust estimate of ground magnetic field variations experienced at mid-latitudes. It has to be noted however that our study was limited to only two storms and three observatories for a highly localised analysis. More stations and space weather events should be included in future investigations to provide more systematic and unbiased conclusions regarding the accuracy of MHD simulations.

Author contributions
EF-collecting data, comparison and analysis, coding, paper preparation CB-research idea, coding, proof reading and review KW-research idea, proof reading and review.

Funding
This work was carried out using the SWMF and BATS-R-US tools developed at the University of Michigan's Center for Space Environment Modeling (CSEM), the OpenGGCM developed at UCLA by Joachim Raeder, the LFM Model first introduced by John Lyon, Joel Fedder, and Clark Mobarry, as well as the Rice Convection Model developed by the Department of Physics and Astronomy at Rice University. Additional resources are provided by the OMNIWeb Plus data service at Goddard Space Flight Center.

Acknowledgments
The results presented in this paper rely on data collected at magnetic observatories: Hartland (HAD), Eskdalemuir (ESK) and Lerwick (LER). We thank the national institutes that support them and INTERMAGNET for promoting high standards of magnetic observatory practice (www.intermagnet.org). Simulation results have been provided by the Community Coordinated Modeling Center at Goddard Space Flight Center through their public Runs on Request system (http://ccmc.gsfc.nasa.gov).

Conflict of interest
The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.
Publisher's note All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article, or claim that may be made by its manufacturer, is not guaranteed or endorsed by the publisher.