A machine learning-based model for the next 3-day geomagnetic index (Kp) forecast

The 3-day Kp forecast product is important and necessary for space weather forecasts. There is some essential information that can be obtained from the 3-day Kp forecast product, such as the start time of the geomagnetic storm, the maximum storm level, and the storm duration. In this study, we aimed to predict the next 3-day Kp index based on the previous Kp time series and SDO/AIA 193 Å images. We prepared datasets from May 2010 to December 2019 for training and datasets from January 2020 to October 2022 for testing. The similarity parameters of the previous and current geomagnetic conditions between the samples are calculated and analyzed. We assumed that the paired samples with high-similarity parameters of the previous and current geomagnetic conditions would also have high-similarity parameters of the next 3-day geomagnetic conditions. Based on the assumption, we selected the three best similarity parameters through the feature selection process and adopted the scalable tree boosting system (XGBoost) to develop a prediction model. It took the similarity parameters of the previous and current geomagnetic conditions as input and provided the best match sample from the training subset as a forecast. For the next 3-day non-storm (maximum Kp < 5) prediction period, our model reached an F1-score of 0.96. For the next 3-day storm (maximum Kp ≥ 5) prediction period, our model reached an F1-score of 0.82, a recall of 0.70, and a precision of 0.98.


Introduction
A geomagnetic storm is the consequence of solar wind disturbances originating from the Sun and impacting on geospace (Gonzalez et al., 1994;Perreault and Akasofu, 1978). The solar and interplanetary sources of the geomagnetic storms are related to coronal mass ejections [CMEs; (Gosling et al., 1991;Webb et al., 2000;Chen, 2011;Webb and Howard, 2012);] and the coronal hole high-speed stream (Tsurutani et al., 1995;Gonzalez et al., 1999;Tsurutani et al., 2006;Zhang et al., 2007). Geomagnetic storms can last from hours to days and sometimes lead to space weather effects, for example, the sudden enhancement of the electric currents in the magnetosphere and ionosphere, the severe changes of the relativistic electron fluxes in the Van Allen radiation belts, and the density enhancement in the upper atmosphere (Ritter et al., 2010;Mansilla, 2011;Xiong et al., 2015;Zhang et al., 2019).
Magnetic activity indices were designed to describe variations in the geomagnetic field, including the Kp index, ap index, and Ap index (Mayaud, 1980;Menvielle and Berthelier, 1991;Bartels, 2013a;Bartels, 2013b;Zhang et al., 2019). The planetary 3-hour-range Kp index ranging from 0 to 9 is calculated from 13 geomagnetic observatories between 44°and 60°northern or southern geomagnetic latitude. The 3-hour-range ap index is the equivalent range of the Kp index. The 1-day-range Ap index is calculated from an 8-point Ap index per day. Furthermore, Kp is a good representative for geomagnetic activity and is used to classify the geomagnetic conditions into categories (for example, minor storm, moderate, and major storm) in the Space Weather Prediction Center (SWPC) facilitated at the National Oceanic and Atmospheric Administration (NOAA) and the Space Environment Prediction Center (SEPC) facilitated at the National Space Science Center and the Chinese Academy of Sciences.
The Kp index and Ap index are also important inputs for physical-based geospace models, such as magnetosphere and plasmasphere models, and thermosphere and ionosphere models (Matzka et al., 2021). Compared with the 1-day-range Ap index, the 3-hour-range Kp index can provide more refined information on the geomagnetic activity, such as the start time of the geomagnetic storm, the maximum storm level, and the duration of the storm. Therefore, Kp index prediction is very important and necessary for space weather forecasts.
There are many models developed to predict geomagnetic activity on multiple time scales of hours to years based on statistical and machine learning methods (Feynman and Gu, 1986;Bala and Reiff, 2012;Wang et al., 2015;Luo et al., 2017;Tan et al., 2018;Shprits et al., 2019;Zhelavskaya et al., 2019;Chakraborty and Morley, 2020). However, only a few have been applied to the operational space weather forecast. At present, SWPC routinely provides products of a 45-day Ap index forecast 1 , 27-day Ap index and the largest Kp index forecast 2 , and 3-day Kp index forecast 3 . SEPC routinely provides products of a 27-day Ap index forecast 4 and 3-h Kp index forecast in less than 3 h advance 5 . So far, there is no public algorithm designed for a 3-day Kp index forecast, and the conventional product of a 3-day Kp index forecast is only provided by SWPC.
The 3-day Kp index forecast is an essential and basic product in space weather forecasts. It contains 24 points in a 3-day period and describes the geomagnetic conditions with a 3-hour-range resolution. Considering the increasing demand for space weather forecasts, it is still important and necessary for the space weather community to develop prediction models that can provide the 3-day Kp index forecast product.
In this study, we aim to develop a model for the next 3day Kp index time-series prediction. Here, we introduce the data preparation in Section 2, then develop a classification model based In this study, we used two kinds of data from May 2010 to October 2022, including the 3-hour Kp index from the National Oceanic and Atmospheric Administration (NOAA) and the 193Å wavelength images measured at the SDO/AIA (O'Dwyer et al., 2010;Lemen et al., 2012;Pesnell et al., 2012). They were divided into the training subset (including data from May 2010 to December 2019) and the testing subset (including data from January 2020 to October 2022). A prediction model was trained by the training subset and tested by the independent testing subset.
It should be mentioned that we only focus on the background solar wind (including the coronal hole high-speed streams and corotating interaction regions) in this study. Therefore, we removed the time periods when the geomagnetic conditions are affected by interplanetary coronal mass ejections (ICMEs), according to the near-Earth ICMEs list (Cane and Richardson, 2003;Richardson and Cane, 2010). For each day, we selected two samples according to the timepoints at 0:00 UTC and 12:00 UTC. Thus, we obtained 6,042 samples from the training subset and 1853 samples from the testing subset.
As shown in the flow chart in Figure 1, we prepared the 54-day Kp index time series before the current timepoint (T), and the AIA 193Å image was measured at T as inputs and we took the 3-day Kp index time series after T as outputs. Figure 2 shows how to divide the Kp time series into the previous 54-day Kp input and the next 3-day Kp output for the current T sample. Take the sample at 00:00 UTC on 22 Oct 2019 as an example, we took the Kp time series (432 points) from 00:00 UTC on August 29 to 00:00 UTC on October 22 as the input and the Kp time series (24 points) from 00:00 UTC on October 22 to 00:00 UTC on October 25 as the output. Figure 3 shows how to prepare the AIA 193Å image measured at T as input for the current T sample. Take the sample at 00:00 UTC on 22 Oct 2019 as an example, the observed image is shown in the top left panel. Then, a slice covering the [S40, N40] and [E40, W40] areas was cut from the observed image and shown in the top right panel. A median value was calculated based on the previous monthly slices. After all values above the median value in the slice at current T were replaced by 0, we obtained a slice as shown in the bottom left panel referring to the properties of coronal holes. The slice in the bottom left panel was then resized into a smaller size of 32 × 32 pixels and used as one of the inputs.
To compare the difference between the training and testing subsets, we calculated the maximum value of the 3-day Kp output and took it as a representative for each sample. Then, we draw the sample distribution histogram in Figure 4. The blue and red bars represent the training and testing subsets, respectively. It was found that the maximum Kp of the samples in the two subsets had similar distribution properties. There were 1,614 (26.7%) and 390 (21.0%) storm samples (maximum Kp ≥ 5) in the training and testing subsets, respectively.

Similarity calculation and labeling of sample pairs
Two samples at timepoints T i and T j from the training subset were called a pair (where the timepoint T i was more than 3 days farther from T j ). For each sample at timepoint T i , we picked up more than 5,000 other samples from the training subset to form sample pairs with it. All the samples that were within a 2-month interval of the current timepoint were excluded. As a result, we obtained more than 25 million sample pairs from the training subset.
We assume that if the inputs of two samples in a pair have high similarity (that is, their previous and current geomagnetic conditions are similar), their geomagnetic conditions in the next 3 days should be similar too. In this case, the two samples in the pair are similar to each other so that one sample can be a proper representative of the other one. In this study, the similarity parameters of the pair inputs were calculated and used to develop a model for the 3-day Kp forecast based on this assumption. Figure 1 shows that based on the 54-day Kp time-series inputs, we derived 11 similarity parameters for each pair. The first two parameters were the mean absolute error (MAE Kp ) and the root mean square error (RMSE Kp ). They were calculated by the following formulas, where x and y represent the inputs of a pair at timepoints T i and T j , respectively: The next two parameters were the maximum absolute error (MaxDiff Kp ) and the sum of the absolute error for the storm (Kp ≥ 5) period (SDiff Kp ). Also, four spatial distances were calculated, namely, the Euclidean distance (D (Euclidean) Kp ), cosine distance (D (Cosine) Kp ), correlation (D (Correlation) Kp ), and Hamming distance (D (Hamming) Kp ), by the following formulas and were taken as similarity parameters. The SciPy package 6 was used to calculate these spatial distances.

FIGURE 4
Sample distribution histogram at the training and testing subsets. The x-axis represents the maximum value of the 3-day Kp output for each sample. The y-axis represents the sample number.
Scikit-learn package 7 . Through the KPCA algorithm, a Kp timeseries input can be represented by a number (the feature). So, for each pair, we obtained two numbers (Feature x, KPCA(kernel) and Feature y, KPCA(kernel) ) representing the Kp time-series pair inputs (x and y). Then, we calculated the absolute error of the two numbers and took it as the similarity parameter for the pair. In this way, we obtain three similarity parameters, namely, F (linear) Kp , F (rbf) Kp , and F (cosine) Kp , by the following formula: KPCA(kernel) , kernel ∈ {linear, rgb, cosine} .
(6) Figure 1 shows that based on the coronal holes image inputs, we derived another seven similarity parameters for each pair. The first two parameters were the mean absolute error (MAE CH ) and root mean square error (RMSE CH ), and were calculated by Formulas 1 and 2. The next two were the widely used image similarity parameters, such as structural index similarity (SSIM CH ) and peak signal-to-noise ratio (PSNR CH ), which were calculated using the Scikit-image package 8 . Then, in a similar way, we calculated the absolute error of the KPCA features derived by the image pair inputs using Formula 6 and took them as the similarity parameters, such as F (linear) CH , F (rbf) CH , and F (cosine) CH .
Then, we obtained 18 similarity parameters from the pair inputs in the training subsets. Except for the correlation (D (Correlation) Kp ), structural index similarity (SSIM CH ), and peak signal-to-noise ratio (PSNR CH ), the larger the other 15 7 https://scikit-learn.org 8 https://scikit-image.org/docs/stable/api/skimage.html parameters, the higher similarity the pair inputs reach. Hence, it is the opposite for the correlation (D (Correlation) Kp ), structural index similarity (SSIM CH ), and peak signal-to-noise ratio (PSNR CH ). Then, we standardized the 18 similarity parameters independently by computing the relevant statistics on the pairs in the training subsets using the Scikit-learn package.
As we established the pair inputs and calculated the similarity parameters, the pair outputs (3-day Kp time series) should be analyzed and labeled as a number (1 or 0). Here, we adopted a batch labeling method according to the following principles: 1) For each sample at timepoint T i , if the next 3-day geomagnetic conditions reached the storm level (maximum value of Kp ≥ 5), it is a storm sample. Otherwise, it is a non-storm sample. 2) For each sample at timepoint T i , we picked up more than 5,000 other samples and formed pairs with them (denoted by T i pairs).
If a pair contains one storm sample and one non-storm sample, we considered the pair as a bad match and labeled the pair output as 0. 3) For each sample at timepoint T i , we selected the top 100 pairs with the least mean absolute errors between their outputs (the next 3day Kp time series) from the remaining T i pairs. We considered those 100 pairs as good matches and labeled their pair outputs as 1. The remaining pairs were labeled as 0.
Finally, we prepared our dataset into a standard form consisting of inputs (18 similarity parameters) and output (binary labels) that is suitable for a binary classification task.

Feature selection of similarity parameters
Feature selection is a widely used approach in the machine learning community to select the best features from the dataset in which the model can perform better with less training time. In this study, we selected the best features by comparing the Pearson's correlation of the 18 similarity parameters with the binary labels of the pairs using the SelectKBest from the Scikit-learn package. The correlations were standardized to scores ranging from 0 to 1, as shown in Figure 5.
It was found that there were three best features that had a significantly higher score than others. The top three best features were the Euclidean distance of the 54-day Kp time series (D (Euclidean) Kp ), the Hamming distance of the 54-day Kp time series (D (Hamming) Kp ), and the differences between the KPCA features of the current coronal hole images for each pair (F (linear) CH ). They were selected from the dataset and used for model development, as shown in the flowchart in Figure 1.

Model development and result analysis 3.1 Development of a classification model for sample pairs
After the three best features had been selected by feature selection, we determined to develop a classification model for pairs in the training subset to predict whether a pair of two samples is a good match or not. If the pair is a good match, their geomagnetic conditions are similar to each other so that one sample can be considered as a forecast of the other.
The scalable tree boosting system (XGBoost) is a widely used machine learning algorithm for binary classification tasks. It implements gradient boosting which performs additive optimization in functional space and incorporates a regularized model to prevent over-fitting (Chen and Guestrin, 2016).
We applied the XGBoost algorithm to develop a classification model using the Scikit-learn package 9 to predict whether a pair of two samples is a good match or not. If the pair is a good match, their geomagnetic conditions are similar to each other so that one sample can be considered as a forecast of the other. Considering the "f1weighted" (it is a balanced score of a standard binary classification task) as metrics, the best hyper-parameters of the XGBoost model are n estimators = 0.01 and max depth = 20.
As shown in Figure 1, to predict the 3-day Kp time series at T i , we give a forecast of the 3-day Kp time series following the steps: 1) We pick up a sample at T j to establish a pair and calculate the similarity parameters of the pair. 2) For the pair at T i and T j , the three similarity parameters are fed into the classification model. If the model predicts 1, the pair is a good match, whereas if the model predicts 0, it is not a good match. 3) If we find a good match, we take the 3-day Kp time series after T j as forecast at T i . Otherwise, we repeat the aforementioned steps.
We developed a prediction model using the training subset and applied the model on the testing subset from January 2020 to October 2022 and evaluated its performance. Both the geomagnetic storm prediction (maximum Kp ≥ 5) and non-storm (maximum 9 https://scikit-learn.org Kp < 5) prediction are important in space weather forecasts. Thus, we will evaluate the model for the geomagnetic storm prediction (maximum Kp ≥ 5) as well as the non-storm (maximum Kp < 5) prediction tasks.

Evaluation metrics and model performance
For a binary classification task like the next 3-day geomagnetic storm (maximum Kp ≥ 5) prediction, the confusion matrix is shown in Table 1. The model is evaluated by its ability to predict the next 3-day geomagnetic storm, the first day (day 1) geomagnetic storm, the second day (day 2) geomagnetic storm, and the third day (day 3) geomagnetic storm. We also evaluate the model's ability to predict the next 3-day non-storm conditions, the first day (day 1) of nonstorm conditions, the second day (day 2) of non-storm conditions, and the third day (day 3) of non-storm conditions.
There are three metrics for binary classification used in the study, including precision, recall, and F1-score. They are calculated by the following formulas: where the true positive (TP), false positive (FP), false negative (FN), and true negative (TN) are calculated following the confusion matrix, as shown in Table 1 for storm prediction and Table 2 for non-storm prediction. For example, for storm prediction, the higher the recall, the more storm samples have been correctly predicted. The higher the precision, the fewer false alarms have been made. An F1-score is a balance metric for recall and precision. A larger F1-score shows a better ability to classify a sample into the correct category. The three metrics, recall, precision, and F1-score for the geomagnetic storm prediction task (considering the storm samples as positive samples) were listed as "For storm category (maximum Kp ≥ 5), " as shown in Table 3. The metrics for the non-storm prediction task (considering the non-storm samples as positive samples) were listed as "For non-storm category (maximum Kp < 5), " as shown in Table 3. We found that 1) Our model reaches an F1-score of 0.96, a recall of 0.99, and a precision of 0.93 for the next 3-day period of non-storm category prediction and an F1-score of 0.82, a recall of 0.70, and a precision of 0.98 for the next 3-day period of storm category prediction.
2) However, both for the non-storm category and storm category, the three metrics (F1-score, recall, and precision) by our model are lower at the first day, second day, and third day period prediction than that in the next 3-day period prediction. It indicates that it is difficult to accurately predict the storm periods in a shorter time period (less than 3 days).
We also conducted an error analysis of the model forecasts with the observations in the testing dataset. For each sample (containing  Actual non-storm (observation) The eight statistics of the 3-day period, namely, day 1, day 2, and day 3 predictions, for all samples in the testing subset by our model are shown in Table 3. It was found that 1) The average and standard deviation of the mean error for the next 3-day Kp prediction is 0.03 and 0.65, respectively. The average and standard deviation of the mean absolute error for the next 3-day Kp prediction is 1.06 and 0.32, respectively. 2) The average and standard deviation of MaxDiff (the maximum absolute error of the 3-day Kp time-series) is 2.87 and 0.78, respectively.
Moreover, we compared our model results with the daily product of the 3-day Kp index forecasts provided by SWPC from 30 Nov 2020 to 27 Oct 2022. After we had removed the time-periods when the geomagnetic conditions are affected by interplanetary coronal mass ejections (ICMEs), there were 50 storm samples and 548 non-storm samples. Evaluation metrics for SWPC's 3-day Kp index forecast products and results obtained by our model are shown in Table 4. It was found that 1) For the next 3-day prediction, our model provides a positive mean error average, while SWPC provides a negative mean error average. It indicates that statistically, our results are usually higher than the observations and SWPC's products are usually lower than the observations. 2) For the next 3-day prediction, our model provides a mean error average of 1.13. It is slightly higher than the mean error average of 1.03 calculated from SWPC's product. 3) Our model performs better than SWPC's product for the three metrics (recall, precision, and F1-score) in the 3day prediction. However, compared with SWPC's results, our model had less recall and higher precision for the storm category in the first-day, second-day, and third-day predictions.

Conclusion and discussion
In this study, we aimed to develop a model for the next 3-day Kp index time-series prediction based on the previous 54-day Kp time series and SDO/AIA 193 Å images. We prepared a dataset (6,042 samples) from May 2010 to December 2019 for training, and a dataset (1853 samples) from January 2020 to October 2022 for testing.
The similarity parameters of the previous and current geomagnetic conditions between the samples are calculated and analyzed, as well as the similarity parameters of the next 3-day geomagnetic conditions. We assumed that the paired samples with high similarity for the previous and current geomagnetic conditions would also have high similarity for the next 3-day geomagnetic conditions. Based on the assumption, we first selected the three best similarity parameters through the feature selection process and then adopted the XGBoost algorithm to develop a prediction model for the next 3-day Kp forecast. The model took the best three similarity parameters of the previous and current geomagnetic conditions as input and provided the best match sample from the training subset as a forecast for the next 3-day Kp time-series.
A prediction error analysis by our model was conducted. For the non-storm prediction, our model reached an F1-score of 0.96 for the next 3-day period and an F1-score over 0.92 for the first day, second day, and third day period. For the storm prediction, it reached an F1-score of 0.82, a recall of 0.70, and a precision of 0.98 for the next 3-day period.
We also compared our model results with the daily product of the 3-day Kp index forecasts provided by SWPC from 30 Nov 2020 to 27 Oct 2022. In statistics, our results were usually higher than the observations and SWPC's products were usually lower than the observations. Compared with SWPC's products for the next 3-day prediction, our model reached higher metrics (recall, precision, and F1-score). However, our model showed a higher mean error average in the next 3-day prediction, and less recall and higher precision for the storm category in the first-day, second-day, and third-day predictions.
This study established a prediction model that can be used to provide the 3-day Kp forecast product, which is important and necessary for space weather forecasts. There is some essential information that can be obtained from the 3-day Kp forecast product, such as the start time of the geomagnetic storm, the maximum storm level, and the duration of the storm. Therefore, it is a more refined product than the 3-day Ap forecast product. So far, the 3-day Kp forecast product is routinely provided by the Space Weather Prediction Center facilitated by the National Oceanic and Atmospheric Administration. Considering the increasing demand for space weather forecasts, more prediction models that can provide essential products should be developed.
However, the current model has limitations in accurately predicting the storm periods in a shorter time period (less than 3 days) which lead to lower evaluation metrics (recall, precision, and F1-score) in the first-day, second-day, and third-day predictions. In the future, we would like to improve the 3-day Kp forecast model by deriving more relevant similarity parameters of the geomagnetic conditions and adopting a complex machine learning algorithm such as a convolution neural network and long short-term memory.

Data availability statement
The original contributions presented in the study are included in the article/Supplementary Material; further inquiries can be directed to the corresponding author.