Geoeffectiveness Prediction of CMEs

Coronal mass ejections (CMEs), the most important pieces of the puzzle that drive space weather, are continuously studied for their geomagnetic impact. We present here an update of a logistic regression method model, that attempts to forecast if a CME will arrive at the Earth and it will be associated with a geomagnetic storm defined by a minimum Dst value smaller than −30 nT. The model is run for a selection of CMEs listed in the LASCO catalogue during the solar cycle 24. It is trained on three fourths of these events and validated for the remaining one fourth. Based on five CME properties (the speed at 20 solar radii, the angular width, the acceleration, the measured position angle and the source position – binary variable) the model successfully predicted 98% of the events from the training set, and 98% of the events from the validation one.


INTRODUCTION
Forecasting if a coronal mass ejection (CME) is geoeffective (i.e., capable of causing a geomagnetic disturbance) is a subject of increasing interest during the last decade, because of the high impact these eruptive events may have on the technological system in orbit or on Earth. Each model must take into consideration some approximation and, thus, no model can currently predict with a 100% accuracy the impact of a CME.

Geoeffectiveness of CMEs
It is known that the CMEs reaching Earth's magnetosphere can produce large perturbations in the geomagnetic field known as geomagnetic storms. The first indication of a geomagnetic storm is shown by a decrease of the Dst index, with storms being classified as small if −50 nT < Dst # −30 nT, moderate (−50 nT P Dst > −100 nT), and intense (Dst # −100 nT) (Gonzalez et al., 1994).
The geoeffective CMEs predominantly originate from sources near the central meridian, mostly from the western hemisphere (Srivastava and Venkatakrishnan, 2004;Zhang et al., 2007). The most geoeffective tend to be the energetic frontside halo CMEs, which are associated with strong soft X-ray flares .
The geoeffectiveness of the CME will also depend on its particular evolution, which is related to both internal CME properties (kinematic, geometric and magnetic), and (external) solar wind plasma properties (see e.g., the review by Manchester et al., 2017).
It was shown that interacting CMEs in the heliosphere amplify the geomagnetic response (Scolini et al., 2020). This amplification could be due to shock compression inside interplanetary CMEs (ICMEs) (see e.g., Shen et al., 2018;Xu et al., 2019), due to the generation of high energy protons (see e.g., Joshi et al., 2013) or due to the heights above the photosphere at which the shocks are formed (see e.g., Gopalswamy et al., 2013).

Solar Cycle Dependence
Correlations between CMEs, ICMEs and the sunspot number have been intensively studied (Gopalswamy et al., 2010;Webb and Howard, 2012;Lamy et al., 2019) to conclude that the CME rate usually follows the solar activity indices (Möstl et al., 2020). Chi et al. (2016) confirmed that the yearly ICME rate follows the sunspot number. This implies that during maximum of solar activity more CMEs arrive at Earth compared with the minimum solar activity period.
Comparing the last two solar cycles, the 23rd one has been more geoeffective than the 24th one (Bhatt and Chandra, 2020). Solar cycle 24 was characterized by low flare activity and the main contribution to geoeffective events was made by CMEs (Bruevich and Yakunina, 2020).
The present study covers the solar cycle 24 and it takes into consideration the possibility that there is a model simple enough, based only on CME parameters derived close to the Sun, which could predict that a CME will reach the Earth and it will produce a geomagnetic storm.
The model is based on an updated logistic regression method (Srivastava, 2005;Besliu-Ionescu et al., 2019), that attempts to forecast if a CME will arrive at the Earth and if it will trigger a geomagnetic storm. The model is run for a selection of CMEs listed in the LASCO catalogue (Yashiro et al., 2004) from January 2008 to May 2020.
The model takes into consideration the full-chain of events CMEs-ICMEs-Geomagnetic Storms and outputs the probability of a CME being geoeffective or not.
The paper is structured as follows: Section 2 presents the method used in this study, Section 3 describes the results of the non-linear logistic regression model, Section 4 discusses the main findings of this study and proposes future research.

Data Selection
In order to select our events we looked in the LASCO CME catalogue (https://cdaw.gsfc.nasa.gov/CME_list/) in the period between January 2008 and May 2020.
In this period there were approximately 17,000 CMEs detected. We excluded CMEs catalogued with "poor events" and "very poor events," which amounted to more than 12,500 CMEs, i.e., about 73% of the total CMEs observed by LASCO in the studied period. The classification of events is linked to the quality index (0-5) for the tracking feature (leading edge) of each CME: very poor, poor, fair, typical, good, and excellent. Very poor event (quality index 0) means a CME with an ill-defined leading edge and poor event (quality index 1) is a CME where the leading edge is not clear and sharp enough to be accurately tracked in different frames (see e.g., Yashiro et al., 2004). The poor events were excluded in order to have a consistent list where one can measure with accuracy different characteristics of the CMEs (speed, angular width, etc.). This selection criteria left us with a database of 4,576 CMEs.
We further excluded the CMEs that have an angular width smaller than 60°, leaving us with 2,794 CMEs to study. This second selection criteria is justified since, in order for a CME to arrive at the Earth and to produce a geomagnetic storm it should have a large angular extent (e.g., Schwenn, 2006;Zhang et al., 2007).
In general, full halo (apparent angular width of 360°) and partial halo (apparent angular width larger than 120°) CMEs in LASCO images are considered as potential candidates to impact the Earth (if their source region is on the Earth-facing solar disk). A normal CME, seen above the limb with an angular width of around 60°, will appear as a halo CME or partial halo CME when oriented along the Sun-Earth-line (both: towards to or away from Earth) or some 40°off that line, respectively (e.g., Schwenn, 2006).
However, it was also demonstrated that narrow CMEs (AW # 20°) can arrive at Earth and exhibit clear in-situ signatures (e.g., Kilpua et al., 2014). Still, none of these narrow CMEs that arrived at Earth were detected by LASCO. Many studies consider eruptions below 10°angular width as being jets and not CMEs (e.g., Paraschiv et al., 2010) (review of Raouafi et al., 2016, etc.). The averaged angular width of the CMEs wider than 30°is around 60°(e.g., the review of Webb and Howard, 2012) which also contributed to our selection criteria. We decided to re-include two CMEs that were poor events, but have reached the interplanetary space, such as listed by the ICME catalogue (Richardson and Cane, 2010). Thus, the database for the current study has 2,796 events.
The association of our events with the interplanetary disturbances was extracted from the ICME Catalogue (Richardson and Cane, 2010) which is available online at http://www.srl.caltech.edu/ACE/ASC/DATA/level3/icmetable2. htm. There where 49 CMEs that have reached the Earth during the selected period. CME-ICME association method, described by Cane and Richardson (2003), is based on studying the proton temperature from the solar wind for periods of abnormally low values. Then the ratio of the observed vs. the expected proton temperature is evaluated and the magnetic observations are added. An ICME interval could be inferred from reduced fluctuations and some degree of organization in the magnetic field and will be bounded by distinct magnetic field discontinuities which may be accompanied by abrupt changes in plasma parameters (Cane and Richardson, 2003).
Out of these 49 ICMEs, 16 did not produce any geomagnetic disturbances (i.e., Dst min was larger than −30 nT), four were associated with minor geomagnetic storms (Dst min between −30 and −50 nT) and 29 were followed by moderate or intense geomagnetic storms (Dst min # −50 nT).
The Dst min is the minimum value of the Dst index recorded during the geomagnetic storm marking the end of its main phase, value which is used when cataloging the intensity of the storm.
The location of the CME on the solar disk was derived by checking each event individually. We looked for signatures like dimmings, waves, eruptive prominences. We looked at the combined EUV (SOHO/EIT or SDO/AIA) and white-light (LASCO) movies as given in the catalogue. If nothing was seen in running difference images, we checked EUV normal movies (for e.g., sdoa193_c2rdf.html in Java Movie) to better see the dimmings and the waves. For dimmings we also checked the Solar Demon catalogue (Kraaikamp and Verbeeck, 2015): http://solardemon.oma.be/science/dimmings.php? days 0&dimming_threshold 0&dimming_location 1&science 1.

Method
Predictive models are used in almost every scientific field. Given a set of independent variables, the output of such a model will compute the probability that the dependant variable will have a certain behavior when the combination of the independent variables is the "right-one." The logistic regression is a class of regression that needs an independent variable or a set of independent variables to predict a dependent one. Therefore, besides the five independent variables (CME speed at 20 solar radii, its angular width, measured position angle, the acceleration and a binary variable for position), the model needs a dependent one. For this we have chosen a binary variable defined by 0 if the Dst min value was > −30 nT (i.e., no geomagnetic storm detected), and 1 for Dst min # −30 nT (i.e., a storm was identified), identical to the binary one used by Besliu-Ionescu et al. (2019).
The solar wind sometimes completes accelerating before 20 solar radii (Nakagawa et al., 2006). Thus, the speed at 20 solar radii better represents the state of the CME after escaping the solar corona.
The model used in this study is a modified version of Srivastava (2005) and has been applied in Besliu-Ionescu et al.
The equation used in the model is: where Z i is Π represents the probability of the occurrence of a geomagnetic storm given the ith observation of the CME. Z i is a linear function of the observations estimated as a natural logarithm of the odds of the occurrence of the geomagnetic storm (Srivastava, 2005). x i represents the CME observations (CME speed at 20 solar radii, CME angular width, measured position angle, acceleration and a binary variable for its position).
The initial Srivastava (2005) model used a database of 55 geoeffective events that were defined as full chains CME-ICME-geomagnetic storms (intense and super intense). They used a set of seven independent variables describing the CME: its width, speed, its association with flare and its location; and the interplanetary conditions: the magnetic field intensity, the southern component of the interplanetary magnetic field and the ram pressure. The goal of the model was to predict the occurrence of a geomagnetic storm according to the properties of the selected events, used as independent variables, by defining a binary dependant variable with 0, for intense geomagnetic storms, and 1, for super-intense ones. The dataset was divided in training (46 events) and validation (9) sets and the obtained success rates for that model were 85 and 77.7% for the training, respectively, validation sets.
Besliu-Ionescu et al. (2019) had a slightly different approach than Srivastava (2005) as they used only CME solar parameters and excluded any ICMEs measurements. The parameters (independent variables in the model) used by Besliu-Ionescu et al. (2019) were the measured position angle of the CME, its angular width, linear speed, the acceleration, the latitude and longitude of its source, the association with a flare (binary variable to be 1 for the events where there was a flare associated with the CME, and 0 otherwise), the flare importance index (Maris et al., 2002), the magnetic active region type (a scaled value between 0 and 1 as a function of the magnetic classification of the active region) and the orientation of the neutral line (a number describing the direction of the neutral line -NS, EW, NW-SW and NE-SW). The computed proportions of correctness (PC), the ratio of total number of correct forecasts and the total number of forecasts, were over 0.95.
In this study we use a similar approach to Besliu-Ionescu et al. (2019), that we applied to a different set of independent variables.
The software used was selected from the IMSL package of the Interactive Data Language (IDL). IMSL_nonlinregress is a function that fits a nonlinear regression model using least squares. All the details about its programming notes, usage and output can be found at https://www.l3harrisgeospatial.com/ docs/IMSL_NONLINREGRESS.html. Figure 1 represents a schematic chart for the method flow as described above.

Selection of Independent Variables
We studied all the properties listed in the LASCO catalogue linear speed, second order speed at final height, second order speed at 20 solar radii, the central and measured position angle, the angular width, the acceleration, the mass and energy of the CME.
We eliminated variables that correlated amongst them. The full correlation tables of all CME parameters can be found in Besliu-Ionescu et al. (2019). We selected parameters with small correlation coefficients such that the non-linear logistic regression is correctly applied. We decided to exclude the linear speed as there is supporting evidence (Verma et al., 2013) that the correlation between the linear speed of the CME and the Dst index is weak. There were two classes of variables that were correlated: the three types of CME velocities (V lin , V 20R and V 2f ) and the two angles-measured and central position angles (MPA and CPA). We chose one per each class. The speed at 20 solar radii better represents the state of the CME after escaping the solar corona. Then, we eliminated the mass and energy of the CME because of the large uncertainties due to poor measurements.
Hence, in this study the new set of independent variables consisted of: the speed of the CME at 20 solar radii, its angular width, measured position angle, acceleration and the location of the source region. The location bin variable was set to be 0 if the source was on the backside of the Sun, and 1 if the source was on the frontside, disregarding its exact latitude and longitude. Thus our dataset of 2,796 CMES consists of 1,647 frontside CMEs and 1,149 backside ones.
The measurements that were not binary variables defined (speed at 20 solar radii, measured position angle and angular width) were normalized to unity in order to minimize the possible numerical errors or discrepancies due to the variable ranges.
We also used a set of standardized data computed by removing the mean and dividing by the standard deviation (e.g., Gelman, 2008) (denoted by *ST in Table 1).

RESULTS
The output after running the non-linear logistic regression model are the six coefficients, b 0 . . . b 5 (see Eqs 2-4) which are also displayed in Table1.
Choosing standardized input data puts all predictors on a common scale (Gelman, 2008) allowing us to compare the resulting logistic regression coefficients. The classification of coefficients as response given by predictors between the two methods of preparing the data is very similar. The difference between them consists in the order of the first three predictors-the source position, the acceleration and the angular width.
Choosing the normalization method of data preparation suggest that the CME angular width is the most important predictor, while choosing the standardization method, the most important one is the CME source position.
The other predictors have the same importance in both methods. The residual sum of squares for both methods has the same value.
The presented set of independent variables was selected because it had the smallest residual sum of squares value. The residual sum of squares was calculated by IDL and stored into the SSE variable.

Training Set
As already mentioned, we divided the events into the two categories needed for running the model, training and validation, three fourths for the training one, and the remaining one fourth for the validation one.
Thus, the training set contained 2,097 events, with 33 positive events included. By positive event we define a CME that reached the Earth and that was associated with a geomagnetic storm (i.e., a minimum Dst value # −30 nT). The vast majority of the events (2,763) were negative events, meaning that CMEs never reached the Earth, or they were not associated with a geomagnetic storm.
Using the coefficients displayed in Table 1 we have computed the probability that a geomagnetic storm is produced (Formula 1) for each considered event using the regression model.
Π is the probability of the occurrence of a geomagnetic storm (Dst # −30 nT). If Π is bigger than 50% then we considered that a geomagnetic storm was forecasted by the model and if Π is smaller than 50% then there was no geomagnetic storm. A correct forecast will mean a probability bigger than 50% for a positive event and a probability less than 50% for a negative event. Hence, the success rate was computed as the number of correct forecasts divided by the total number of events. The general success rate (considering both positive and negative events) was 0.986, and 0.987, respectively for the normalized and standardized set.
During training, the model did not successfully predict any of the 27 positive events.

Validation Set
The validation set contained 699 events with six positive events included. For this set the success rate was 0.989 and 0.989, respectively for the normalized and standardized set. The validation set din not correctly forecasted any of the six CMEs that were associated with geomagnetic storms.

CME Activity During SC24
In order to study the geoeffectiveness of our 2,796 CMEs during SC24, we have attempted a statistical analysis of the CME evolution with the solar cycle. Figure 2 shows in the left panel the annual number of detected CMEs in blue bars and the yearly smoothed sunspot number in a black line. Every aspect of the solar activity varies during the 11-years solar cycle. Taking the sunspot number as the most significant indicator of the cycle's activity, this would mean that coronal mass ejections will also vary with the sunspot number, either in correlation or anticorrelation. Figure 2  In another study, no significant correlation between the phases of solar cycle and yearly occurrence of intense and great storms has been found (Rathore and Parashar, 2011).
Generally, the yearly number of detected CMEs follows the yearly smoothed sunspot number as seen in Figure 2. It is clearly observable that there are less CMEs detected during the descending phase of SC24 by comparison with its ascending phase.
We have observed that 68% of the CMEs were detected during the maximum phase of the solar cycle and that the descending phase had the least events-only 10%. Similarly, high speed CMEs (the speed at 20 solar radii exceeding 1,000 km/s) were significantly more during the maximum phase of the cycle (129), while the descending phase had the smallest number (20).
Considering CMEs from the point of view of the MPA, there are more CMEs measured in the northern hemisphere-with ∼ 8% more than the sourthern one. The explicit division as a function of the MPA is 28% in the NE quadrant, 26% NW quadrant, 24 and 22%, respectively for the southern ones (SE/ SW). This difference is considered to be too small to be motivated by a certain preference.
The right panel of Figure 2 shows a histogram that represents in dark gray the total number of CMEs and in light gray the CMEs which arrived at the Earth (ICMEs) as a function of their measured position angle. This histogram is constructed in bins of 10 degrees for all CMEs studied here and shows a preference for 80-120°and 260-300°latitudinal bands.
The slight preference for the northern hemisphere is not reproduced for the CMEs that were detected near Earth. There were 15 CMEs coming from regions near the poles ( ± 30°) that reached the Earth, and only nine have produced geomagnetic storms.
Nine out of these 15 ICMEs were detected during the maximum phase of SC24, which is contrary to the fact that most of the ICMEs were detected during the descending phase (29 ICMEs out of the 49 included in our set). 21 were followed by geomagnetic storms. Halo CMEs are most geoeffective between the maximum and descending phases of SC23 (Shrivastava,  . Zhang et al. (2008) found similar results for CIRs during the descending phase of SC23. Gopalswamy et al. (2020) analyzed 44 and 38 limb halo CMEs in cycles 23 and 24, respectively, in order to quantify the effect of the heliospheric influence on CME properties. Their study reveals the effect of the reduced total pressure in the heliosphere that allows cycle 24 CMEs to expand more and become halos sooner than those from cycle 23. They also found similar results regarding the CME activity during the solar cycle, more specifically, that the maximum number of detected CMEs coincides with the maximum value of the relative sunspot number, which can easily be confirmed from Figure 2.
A better understanding of the linkage between CMEs and solar activity cycle should improve our understanding about their geoeffectiveness. Some studies (e.g., Wang et al., 2002;Echer et al., 2008;Rathore and Parashar, 2011;Verma et al., 2013) show that there are more geomagnetic storms related to eruptive phenomena during the descending phase of a solar cycle.
A classification of CMEs by their linear speed into three categories (v < 250, 250 # v < 1,000, v # 1,000 km/s, respectively) led Miteva et al. (2017) to see the same coincidence between the number of CMEs and the sunspot number. They also confirm that SC24 was low in 25-50 MeV proton events, X-to-C class solar flares and faster than 1,000 km/s CMEs, all these phenomena being reduced by 30-45% with respect to SC23.
Our study has 1,352 CMEs coming from the western hemisphere and 1,444 from the eastern one. Out of these, there were 23 ICMEs and 26, respectively. Cycle 24 lacks in events driving extreme geomagnetic storms compared to past solar cycles. Out of the 49 ICMEs included in our study, 33 have been followed by geomagnetic storms.
For solar cycle 24 Hess and Zhang (2017) have identified 70 Earth-affecting interplanetary coronal mass ejections (ICMEs). They found that Earth-affecting CMEs in the first half of Cycle 24 are more likely to come from the northern hemisphere, but after April 2012, it reverses. They also found that in past solar cycles, CMEs from the western hemisphere were more likely to reach Earth.
Only around 50% of the ICMEs were generating GSs during the years 1996-2017. Out of these, around 23% generated intense GSs (with Dst # −100 nT) and the probability for severe storms (Dst # −200 nT) was 4% (Alexakis and Mavromichalaki, 2019). Similar results were found also by Richardson and Cane (2011) for the time period 1995-2009. For our selected events the percentage of ICMEs followed by geomagnetic storms, including minor ones, is ∼ 67%.
In our dataset containing 49 ICMEs there are nine intense geomagnetic storms associated with them, and only one severe storm (Dst min −223 nT). This means that ∼ 59% of ICMEs were followed by geomagnetic storms.
Using a Spearman rank correlation coefficient between Dst index and CME speed for 33 halo CMEs from the beginning of the past solar cycle (2009)(2010)(2011)(2012)(2013). Bisht et al. (2017) showed that high speed CMEs and big flares are not the effective and significant parameters for the geoeffectiveness of these selected halo events. This supported our decision for eliminating the parameter related to the flare-CME association.
In our study, out of the 2,796 CMEs, there were 276 halo ones, out of which 24 were associated with geomagnetic storms, having velocities ranging from 143 to 3,163 km/s. This resulted in a 0.08 Spearman coefficient between the linear speed and the Dst index.
In a propagation through the interplanetary space analysis of 53 fast Earth-directed halo CMEs observed by the LASCO instrument during the period January 2009-September 2015 Scolini et al. (2018) found that 82% of the CMEs arrived at Earth in the next 4 days. The events were propagated to 1AU by means of the WSA-ENLIL +Cone model and almost all of them triggered geomagnetic storms. The average time delay in the case of our geoeffective CMEs was ∼ 3 days.
No other statistics of the measured CME properties have shown a noticeable dependence of the solar cycle evolution.

Concluding Remarks
We have applied a non-linear logistic regression model to a selected set of CMEs detected by LASCO in order to evaluate their geoeffectiveness such as defined by their association to a geomagnetic storm. The selected CMEs excluded "poor" and "very poor" events and CMEs with angular width less than 60°, thus obtaining a database of 2,796 events. These CMEs were divided into training (three quarters) and validation (one quarter) sets. Using a set of five independent variables (V 20R , AW CME , MPA CME , Acc CME , Pos CME ), the correlation to ICMEs and geomagnetic storms according to the ICME catalogue (Richardson and Cane, 2011), we have computed the probability that a CME will be associated with a geomagnetic storm. We normalized and standardized the input data such that we minimize the numerical errors. We have obtained greater than 0.98 success rates for all categories. However, there were no positive events correctly forecasted.
Besides CME-CME interaction there is now an increasing concern that stealth CMEs are also important from the space weather perspective (e.g., Nitta and Mulligan, 2017;Mishra and Srivastava, 2019). The stealth CMEs lack any low coronal signatures (see e.g., Robbrecht et al., 2009;D'Huys et al., 2014) which is why they are more difficult to forecast if they erupt from the visible part of the Sun and if they arrive at Earth. Such CMEs have an important physical concern for other planetary magnetospheres as well (see e.g., Thampi et al., 2021). Our model forecasted that stealth CMEs will not have any impact on the Earth, as their location was originating from the backside.
As the stealth CMEs are lacking low-coronal signatures, their source regions could not be identified. In consequence, these CMEs were considered as originating from the backside of the Sun (location variable was set to 0). This implies that our model will not forecast that stealth CMEs will have any impact on the Earth.
During their journey from the Sun to the Earth, CMEs can accelerate/decelerate, deflect, rotate and deform (see e.g., Manchester et al., 2014;Manchester et al., 2017). Syed Ibrahim et al. (2019) found that the ICME transit-time decreases with the increase in the CME initial speed, although a broad range of transit times were observed for a given CME speed. For slow CMEs ( < 400 km/s), the energy is transferred from the solar wind to the CMEs, while faster events (P400 km/s) tend to lose their energy to the ambient medium (e.g., Soni et al., 2019).
The paragraphs above reveal the limitation of our model by using only the CME parameters as input for the model. A possible improvement might be the addition of some weighting coefficients to increase the significance of the positive events in the training process. For a more robust analysis one also needs to take into consideration the interaction between the CMEs and the ambient solar wind during their journey to the Earth. Throughout their propagation, the CME parameters like speed, shape, etc. change considerably and this has a big impact on their geoeffective response.
However, we consider this model to be a sustainable one for the purpose of predicting the association of a geomagnetic storm to a CME which arrived at Earth, based solely on the measurements of the CME's properties.
Another improvement of this model could be the addition of the tilt angle of the CME to the dataset, in order to better estimate the direction of the CME propagation, even though it will not take into consideration the interplanetary interactions.

DATA AVAILABILITY STATEMENT
The raw data supporting the conclusions of this article will be made available by the authors, without undue reservation.

AUTHOR CONTRIBUTIONS
DB-I contributed to construction of database, running the model, main writer of the text. MM contributed to construction of database, writing and reviewing text and software.

ACKNOWLEDGMENTS
The CME LASCO catalog is generated and maintained at the CDAW Data Center by NASA and The Catholic University of America in cooperation with the Naval Research Laboratory. SOHO is a project of international cooperation between ESA and NASA.