^{1}

^{2}

^{1}

^{1}

^{2}

This article was submitted to Space Physics, a section of the journal Frontiers in Astronomy and Space Sciences

This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

Simultaneous solar wind measurements from the solar wind monitors, WIND and ACE, differ due to the spatial and temporal structure of the solar wind. Correlation studies that use these measurements as input may infer an incorrect correlation due to uncertainties arising from this spatial and temporal structure, especially at extreme and rare solar wind values. In particular, regression analysis will lead to a regression function whose slope is biased towards the mean value of the measurement parameter. This article demonstrates this regression bias by comparing simultaneous ACE and WIND solar wind measurements. A non-linear regression analysis between them leads to a perception of underestimation of extreme values of one measurement on average over the other. Using numerical experiments, we show that popular regression analysis techniques such as linear least-squares, orthogonal least-squares, and non-linear regression are not immune to this bias. Hence while using solar wind parameters as an independent variable in a correlation or regression analysis, random uncertainty in the independent variable can create unintended biases in the response of the dependent variable. More generally, the regression to the mean effect can impact both event-based, statistical studies of magnetospheric response to solar wind forcing.

The Earth’s magnetosphere-ionosphere system is primarily driven by the solar wind. Hence, measurements of the solar wind and their interpretation are crucial in our attempt to understand the near-Earth space environment. At the time of writing this report, two spacecraft, ACE and WIND, have been measuring solar wind parameters for over 20 years from outside the magnetospheric bow shock. Many event-based studies, statistical studies, and simulations use these measurements as input. Many assume that the solar wind measured by these monitors situated at the L1 Lagrange point ultimately drives the magnetosphere system.

However, comparing measurements of the solar wind time-shifted to the bow shock shows random differences between the spacecraft (

In this manuscript, we refer to these uncertainties as measurement uncertainties. They arise from

For instance, at times, in event-based studies, the estimated solar wind driver from L1 measurements may not be driving the magnetosphere-ionosphere response being investigated. However, one may believe that multi-event and large-scale statistical studies can avoid this difficulty posed by random errors and provide us with the average response of the planet to solar wind driving. The reasoning goes that “underestimates will cancel overestimates” for random errors when estimating averages. Such studies belong to the class of regression analysis, where average associations and relationships between solar wind parameters and geomagnetic parameters are inferred from observations. In fact, many modern machine learning studies are non-parametric non-linear regression analysis carried out for multiple variables using large data sets (

In this report, we show direct evidence for such regression biases by comparing measurements of the solar wind propagated to the bow shock made by two spacecraft via a simple non-linear regression analysis (i.e., calculating the conditional expectation of one spacecraft measurement given the other). If the solar wind monitors all measured the same value, the average measurement of one spacecraft given the measurement of the other (regression curve) would be a straight line with a 45° slope. However, since their measurements differ, albeit randomly, we observe a bias in the slope of the regression curve such that it bends towards the mean of the independent variable. The bias can be severe at extreme values.

Before presenting the evidence for this bias from solar wind measurements in

Like

Initially, we assume the random variable

Regression bias in non-linear regression analysis between normally distributed random variables, with uncorrelated Gaussian noise.

However, when the relationship between the two variables is unknown, it is common to rely on regression analysis to infer their relationship. Regression analysis is a broad category of techniques used to find an association between two or more variables. Linear regression is the most familiar type of regression analysis, especially the method of ordinary linear least-squares that minimizes the sum of squared differences between the data points and a unique line on the plot. Suppose the relationship between

An approximate and common method of calculating the conditional expectation

True values of

In

In

1. ^{∗ }= ^{∗ }−

It follows that,

. Where

2.

Demonstrating regression to the mean of the true value of an erroneous measurement. The black line is the underlying probability distribution of a random variable

The regression bias, quantified by the attenuation factor in the linear least-squares regression, is unaffected by uncertainty in the dependent variable

When

Similar layout as

When ^{2}; hence the noise fraction shown in

Similar layout as ^{2}.

By definition, the log-normal distribution

Comparing the effect of regression to the mean on non-linear regression, linear least-squares regression, and orthogonal regression function. Similar to ^{2}. _{2} added to

A relatively popular method considered to be capable of avoiding regression bias is the orthogonal regression function. The orange line in

Uncertainties are commonly characterized by referring to the standard deviation or variance of ^{∗ }− ⟨^{∗}⟩ = _{1}(^{2} for the top while _{2}(_{1}/_{2}/_{1} and _{2} i.e., _{1}) and _{2}). Here the common metric used to quantify unbiased noise, the standard deviation, has the value of 0.5 for _{1} and 1 for _{2}. Since _{1}) < _{2}) one may assume that there is less noise in _{1} is correlated with _{2} is not.

The severity of the regression bias at extremes is not determined only by the standard deviation of the uncertainty _{1}) where _{1} is correlated Gaussian noise. _{1}/_{1}) ∝ ^{2}. _{1}, with the standard deviation of _{1} = 0.5. _{2} added to _{2}/_{2}) = _{2}, with the standard deviation of _{2} = 1.

The previous section demonstrates that uncertainty in the independent variable can lead to a bias in the regression function. Such biases are unavoidable whether we use non-linear regression, linear least-squares regression, or orthogonal linear regression. However, we can correct the bias with a quantitative knowledge of the uncertainties, its direct or indirect correlation with the independent variable, and the probability distribution underlying the independent variable. In this section, we show regression biases in comparisons between solar wind monitors and suggest that at least part of these results are from random uncertainty in solar wind measurements rather than systematic instrument biases.

The solar wind monitors we use are the ACE and WIND satellites. They mostly measure solar wind plasma and magnetic fields upstream of the Earth’s magnetospheric bow shock. We use 1-min spacecraft-specific data compiled by the OMNI database, which are time-shifted using a propagation model to the bow shock. Following is a look at non-linear regression between ACE and WIND measurements of multiple solar wind parameters. They should lie along the line of equality if both spacecraft measure the same solar wind plasma and magnetic field on average without uncertainty. However, that is not the case. Substantial regression biases towards the mean of the parameter can be observed for extreme values, especially when the monitors are far apart.

_{
z
} GSE measurements given WIND _{
z
} GSE and vice versa, shown by the magenta line, has a slope reduced towards the mean. The regression curve in _{
z
} GSE on average compared to WIND for extreme values. However, the latter suggests that WIND underestimates _{
z
} GSE on average compared to ACE. We can explain the contradiction if we suppose that the biases of these regression curves come from similar uncertainty in both ACE and WIND measurements, as discussed concerning _{
y
} GSE. At large values of ACE _{
y
} ∼ 200 km/s, on average WIND measures a

Regression bias in solar wind velocity _{
z
} and _{
y
} in GSE coordinates. _{
z
}. The dashed black line is the line of equality, and the magenta line is the conditional expectation _{
z
} vs. ACE _{
z
}. However, the magenta line is the reverse regression function

The primary cause of this non-trivial regression bias is the uncertainty stemming from the spatial and temporal separation of the measurements. As a result, both spacecraft do not see the same solar wind magnetic field or plasma most of the time. A useful measure of whether a downstream spacecraft measures the same plasma element previously seen by an upstream spacecraft is the impact parameter (IP). For WIND and ACE, the impact parameter (IP) is the “minimum distance experienced between WIND moving at 30 km/s in Y and plasma element moving at 390 km/s in X” (

_{
z
} GSM and vice-versa for all data points in the year 2002. In 2002, WIND was not yet parked onto its L1 orbit, and as a result, the IP between ACE and WIND is significant for most measurements. _{
z
} and vice versa for IP less than 60 _{
E
}, implying that they both likely see similar solar wind plasma. An IP of less than 60_{
E
} is considered to be the minimum separation for which WIND and ACE will see similar plasma and magnetic fields (_{
E
} is about ∼30%. Hence for ∼70% of the time, the two spacecraft don’t measure the same plasma or field.

Regression bias in solar wind magnetic field Z-GSM component. The bias reduces when filtering the data used by reducing the impact parameter (IP) between ACE and WIND. _{
z
} vs. WIND _{
z
}. The measurements used were from the year 2002. The dashed black line is the line of equality, and the magenta line is the conditional expectation _{
E
}.

_{
E
}. The first panel shows the regression function of ACE given WIND measurements, while the second panel plots the reverse: WIND given ACE measurements. We see that there is a regression bias with a decreasing slope with increasing density for

Regression bias in solar wind proton density. It reduces when filtering the data to IP^{
ACE
}|^{
WIND
}) in magenta, and the same restricted to only data with IP ^{
WIND
}|^{
ACE
}), and reveals similar bias towards the mean value.

For the regression functions shown here, the non-linear decrease in the slope with increasing density is due to the log-normal distribution of density, similar to the numerical experiment described in

The IMF clock angle is an essential solar wind parameter determining the extent of solar wind energy coupling to the magnetosphere. The rate of the day-side reconnection, in part, is influenced by the relative orientation of the solar wind magnetic field direction (modified by the magnetosheath). For example, in simple magnetic reconnection models, two oppositely directed magnetic fields brought together by moving plasma drive reconnection. Hence, a southward IMF can generally trigger day-side reconnection at the sub-solar point, while a northward IMF does not. As a result, many proposed solar wind driver functions, which estimate the energy coupling between the solar wind and the magnetosphere, are some functions of the IMF clock angle (

The IMF clock angle is defined as the angle between the IMF vector projected on the GSM Y-Z plane and the geomagnetic north: _{
cl
} = _{
Y
}, _{
Z
}) where −180° < _{
cl
} < 180°. In this manuscript, we have constructed _{
cl
} to range from 0° to 360° with 0° pointing towards _{
Z
} north. In _{
Y
}/_{
Z
} of ACE vs. WIND and plot the conditional expectation of the ACE _{
Y
}/_{
Z
} given WIND _{
Y
}/_{
Z
} (magenta line). The blue line is the same non-linear regression function but with measurements where ACE and WIND have an impact parameter less than 60 _{
E
}. _{
y
}/_{
z
}|∼ > 2). Large values of the _{
Y
}/_{
Z
} ratio are mostly a result of small _{
Z
} values. The latter corresponds to

Regression bias in Solar Wind IMF Clock Angle. _{
E
}. _{
E
}.

_{
E
} in both plots. The conditional expectation is calculated using directional statistics, as an arithmetic mean is inappropriate for angles. Here the mean is calculated by first converting the IMF clock angle into a complex number through Euler’s formula to consider how angles wrap around 360°. Then the arithmetic mean is calculated of the resulting complex numbers. This value is then converted back to an angle to obtain the conditional expectation.

To explore the nature of the bias in detail, _{
E
}. When the distances between the monitors are lower, the regression bias is lower for all ACE IMF clock angles.

Polar plots of regression bias in solar wind IMF clock angle. _{
E
}, implying ACE and WIND are likely measuring similar plasma. _{
E
}.

The regression bias is at a highest of ∼ + 7° around

Close to the local pdf minimum

_{
z
} (^{2}(_{
cl
}/2) is shown in

_{
s
}(_{
z
} > = 0) = 0 and _{
s
}(_{
z
} < 0) = −_{
z
}, instead we define _{
sw
} is, therefore, the product of solar wind velocity and negative IMF _{
z
} in GSM coordinates. _{
sw
} estimates with WIND _{
sw
} and vice versa during 2002. Although the regression is carried out through the entire range of _{
sw
}, the figure shows only _{
sw
} > 0 as it is the dawn-dusk component of the solar wind electric field. In 2002, the WIND spacecraft was far from L1 and had not yet arrived at the L1 orbit. The non-linear regression curves in both show a bias with a lower slope from 0 to 15 mV/m. At higher values of the driver, the number of data points is fewer, and hence there is substantial uncertainty in the regression curves. However, we observe a non-linear decrease in the average WIND _{
sw
} measurements in _{
sw
} measurements (magenta line). The regression function has considerably less bias when restricted to measurements where the impact parameter is less than 60_{
E
}, suggesting that the bias is entirely a result of the spatial separation between the monitors.

Regression bias in solar wind driver function _{
sw
} = _{
sw
}
_{
south
}, where _{
sw
}
_{
south
} between 15 and 20 mV/m are highlighted with larger dots.

For example, consider the data points highlighted using larger dots in _{
sw
} values between 15 and 20 _{
E
}, with WIND being far away from ACE _{
sw
} is much higher than the rarer CME-induced high value of _{
sw
}, WIND is more likely to see a smaller _{
sw
} (due to their high probability of occurrence) than ACE which is measuring a high value (with a low probability of occurrence). The bias caused by this event is removed easily by filtering for measurements with impact parameters less than 60 _{
E
}. Similar regression bias is observed with other solar wind driver functions as well. An example of the bias in the merging electric field _{
m
} = _{
sw
}
_{
T
} sin^{2}
_{
cl
}/2 is shown in the _{
sw
} is the solar wind speed in

Results in section 3 show that regression bias exists for important solar wind parameters like IMF _{
z
}, clock angle _{
cl
} and solar wind proton number density

Uncertainties in complex parameters such as solar wind driver functions, which are a combination of solar wind parameters, may be correlated with the parameter’s value. Consider the example of the merging electric field: _{
m
} = _{
T
} sin^{2}
_{
cl
}/2. An uncertainty Δ_{
cl
}, will result in an erroneous merging electric field _{
cl
} and Δ

Therefore, the uncertainty in the merging electric field: _{
m
} for a given fractional uncertainty of a small IMF clock angle.

Many solar wind driver functions are empirically constructed formulas and are not necessarily derived from physical principles. Hence true solar wind driver functions may be biased or different in a random sense or both. It is easy to imagine that the estimate of the solar wind drivers using upstream solar wind monitors differs randomly from the platonic “true” driver function that affects the Earth’s response. Suppose the driver function is in the form of the merging electric field. In that case, random uncertainty in one of the parameters can lead to correlated uncertainties in the merging electric field. However, if, instead, they are in the form of a sum of parameters like _{
sw
} + 56_{
z
} (

Random uncertainties in the solar wind drivers are not just limited to spatial and temporal uncertainty in the solar wind measurements and instrumental errors (

The “regression towards the mean effect” may not only be relevant to statistical regression analysis. It affects individual studies of extreme solar wind driving and the Earth’s response to it. The reason for this is that the regression bias affects the entire conditional probability distribution of the measurements being compared. Hence, when we infer the Earth’s response to an extreme solar wind driving, it is likely that the actual value of the solar wind driver is lower and closer to its mean value. Hence, we may be underestimating the effect of the solar wind driving of geomagnetic activity even for a single event or case study.

A more precise way to describe the “regression towards the mean effect” is perhaps apparent in

The natural question from our analysis is what we can do to correct or mitigate regression bias. Two primary directions here are 1) to quantify the uncertainty and calibrate the data to compensate for the bias, or 2) to improve the quality of the data by reducing uncertainty. For the case of ordinary linear least-squares regression, orthogonal regression that considers uncertainty in both dependent and independent variables can correct the bias. However, for non-linear regression, these methods may be insufficient. Therefore a careful analysis of correlated uncertainties and stochastic properties of the measured parameters are necessary to construct error models that estimate the regression bias. After this, one can apply the technique of regression calibration to the uncertain measurements and calculate the likely true values to correct for the bias in the inferred relationship. Many more techniques exist and are discussed extensively in

The main challenge to constructing error models to carry out regression calibration is quantifying the uncertainties in the measurement parameters. In most cases, the uncertainties involved are not just instrumental errors but uncertainties that stem from the implicit assumptions made in interpreting measurements. For example, in the case of the solar wind driver functions—random uncertainties stem from our assumptions of: 1) solar wind propagation models, 2) solar wind structure, 3) solar wind interaction with bow-shock and magnetosheath plasma, 4) valid solar wind and magnetosphere state parameters. More assumptions may exist, but the first step towards quantifying random uncertainty in solar wind parameters (including driver functions) is to identify the assumptions and then estimate their contribution to the uncertainty through physics or mathematical models.

We used simple numerical experiments to demonstrate the statistical phenomenon of regression towards the mean, which leads to biases in the correlation between measurement parameters. We showed evidence for such biases while comparing simultaneous 1-min resolved propagation delay-corrected ACE and WIND measurements of several solar wind parameters upstream of the magnetosphere bow-shock. The regression biases were significant for extreme values of the measurement parameters. For example when WIND measures ^{−3} and _{
cl
} = 30°respectively. This regression bias reduces when selecting measurements where ACE and WIND are nearby and in similar solar wind plasma.

These results suggest that regression biases may exist in statistical and event-based solar-wind/magnetosphere coupling studies, where the magnetosphere’s response to solar wind driving is inferred from measurements. The bias may become significant for rare and extreme driving conditions and if the uncertainties in the driver functions correlate with the solar wind strengths. We can reliably correct the regression bias only by knowing the stochastic properties of the parameters used in the study and their uncertainties. Not accounting for the effect of these uncertainties may lead to misinterpreting the bias (which can sometimes be non-linear) as systematic measurement bias or physical processes. One such possible misinterpretation could be the saturation of geomagnetic indices observed with increasing solar wind driving (

The datasets analyzed for this study, and the corresponding MATLAB code used to visualize the plots in this article can be found in the repository

NS conceptualized this project, performed the analysis, and wrote the manuscript. DS supervised and conceptualized the work, and reviewed the manuscript.

NASA Cooperative Agreement 80NSSC21M0180G: Partnership for Heliophysics and Space Environment Research (NS) NASA Heliophysics Participating Investigator Program under Grant WBS516741.01.24.01.03 (DS).

The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article, or claim that may be made by its manufacturer, is not guaranteed or endorsed by the publisher.

We thank Bob Robinson at the Catholic University of America for discussions and support. We also wish to acknowledge Joe Borovsky, Maria-Theresia Walach, Varsha Subramanyan, Dogacan Ozturk, Banafsheh Ferdousi, Gonzalo Cucho-Padin and Abigail R. Azari for valuable discussions.

The Supplementary Material for this article can be found online at: