How to Test the Association Between Baseline Performance Level and the Modulatory Effects of Non-Invasive Brain Stimulation Techniques

Behavioral effects of non-invasive brain stimulation techniques (NIBS) can dramatically change as a function of different factors (e.g., stimulation intensity, timing of stimulation). In this framework, lately there has been a growing interest toward the importance of considering the inter-individual differences in baseline performance and how they are related with behavioral NIBS effects. However, assessing how baseline performance level is associated with behavioral effects of brain stimulation techniques raises up crucial methodological issues. How can we test whether the performance at baseline is predictive of the effects of NIBS, when NIBS effects themselves are estimated with reference to baseline performance? In this perspective article, we discuss the limitations connected to widely used strategies for the analysis of the association between baseline value and NIBS effects, and review solutions to properly address this type of question.


INTRODUCTION
Converging evidence demonstrates that the behavioral effects of transcranial magnetic stimulation (TMS) could dramatically change as a function of different factors, such as stimulation intensity (Moliadze et al., 2003;Abrahamyan et al., 2011Abrahamyan et al., , 2015, timing of stimulation (Kammer, 2007;de Graaf et al., 2014;Chiau et al., 2017; and the initial brain "state" when stimulation is applied (Siebner, 2004;Silvanto and Pascual-Leone, 2008;Ruzzoli et al., 2010;Schwarzkopf et al., 2011;Perini et al., 2012;Romei et al., 2016;. In this framework, there is lately a growing interest toward the importance of considering the interindividual differences in baseline performance when describing the impact of TMS. Although few studies directly investigated the role of inter-individual differences in determining the behavioral effects of non-invasive brain stimulation techniques (NIBS), there is currently a strong drive to explore how NIBS effects covary with individual characteristics, especially with the performance at baseline (see Silvanto et al., 2018).
A consistent body of independent and recent evidence suggests that baseline performance modulates TMS effects (Schwarzkopf et al., 2011;Painter et al., 2015;Emrich et al., 2017;Juan et al., 2017;Paracampo et al., 2018;Silvanto et al., 2018). Furthermore, also the effects of others brain stimulation techniques, such as transcranial direct current stimulation (tDCS) or transcranial random noise stimulation (tRNS), seem to interact with baseline performance level (Jones and Berryhill, 2012;Tseng et al., 2012;Hsu et al., 2014Hsu et al., , 2016Benwell et al., 2015;Learmonth et al., 2015;Juan et al., 2017;Penton et al., 2017;Schaal et al., 2017;Yang and Banissy, 2017; see also Vergallito et al., 2022 for a recent review). Together these findings have been interpreted as indicative of the importance of adopting an individual differences approach, when describing the effect of NIBS. This is indeed a potentially important issue for brain stimulation studies: Even though at group level a modulation in performance does not emerge, a deeper analysis focusing on individual differences may disclose stimulation effects characterizing specific classes of individuals (Silvanto et al., 2018). According to this view, baseline performance can be seen as an indirect measure of neural excitability that, in interaction with the TMS intensity, contributes to the behavioral outcome Silvanto et al., , 2018. The facilitatory vs. inhibitory effect of TMS as a function of neuronal excitability is a wellestablished mechanism and it is consistently observed when TMS is applied during a behavioral task following a predictable manipulation of the initial neural state, such as adaptation or priming (see  for a review). State-dependent TMS effects in paradigms based on priming/adaptation have been observed in a range of different domains, from number and letter processing (Kadosh et al., 2010;Cattaneo Z. et al., 2010;Renzi et al., 2011) to action observation (Cattaneo, 2010;Cattaneo L. et al., 2010;Jacquet and Avenanti, 2015) and perception of emotion (Mazzoni et al., 2017).
Assessing how baseline performance level (and brain state) determine behavioral effects of brain stimulation techniques is therefore an important question, which raise up crucial methodological issue. How to assess the association between baseline value and subsequent change? Or, in other words, how can we test whether the performance at baseline is predictive of the effect of NIBS? An approach that has been typically used to provide evidence of an association between baseline performance and their changes after the stimulation is the correlation approach. This consists in regressing or correlating the magnitude of the induced stimulation effect (which is defined as the performance in the effective TMS/tDCS condition minus the performance in the baseline/Sham condition) with the baseline level of performance (sham stimulation) (Emrich et al., 2017;Penton et al., 2017;Yang and Banissy, 2017;Paracampo et al., 2018;Silvanto et al., 2018;Diana et al., 2021;Wu et al., 2021). Another conceptually similar approach is the categorization approach . It consists in categorizing subjects according to threshold values, such as the median baseline performance (i.e., median-split) and subsequently comparing the effect of NIBS in terms of changes in the behavioral outcome (defined as the active TMS/tDCS condition minus the baseline performance) across the two subgroups (i.e., "low" performers vs. "high" performers) (Tseng et al., 2012;Hsu et al., 2014Hsu et al., , 2016Benwell et al., 2015;Learmonth et al., 2015;Juan et al., 2017;Schaal et al., 2017;Silvanto et al., 2018). However, these approaches are connected to severe biases in estimating the effects. Albeit such biases are well documented (Oldham, 1962;Tu et al., 2005;Chiolero et al., 2013), they have been neglected in several TMS/tDCS studies. We first illustrate biases connected to these methods, and we conclude by discussing techniques that have been proposed to investigate baseline modulatory effects without incurring in such biases.

Biases of the Correlation Approach
The correlation approach consists in correlating or regressing a baseline with a deviation from the baseline, or equivalently in regressing the deviation from the baseline on the baseline. One issue with this strategy is not taking into account that the estimate of the deviation from a baseline depends on the baseline itself (Oldham, 1962). This issue is known as mathematical coupling, and can take place when a correlation is estimated between two variables that share a common source of variation (Blance et al., 2005). Let us denote as x i the observed baseline performance of the i-th individual and as y i the performance observed after NIBS. The deviation of i's performance from the baseline is computed as d i = y i −x i . The relationship between baseline performance and NIBS effect can be then estimated as the correlation r d,x =r y−x,x . We should suspect a mathematical coupling by seeing that x contributes to both variables being correlated. Since x contributes positively to the first term of the correlation and negatively to the second term, the expected correlation is negative (Spearman, 1913).
A simple numeric example is probably the most effective way to illustrate how dramatic the effects of mathematical coupling can be (Oldham, 1962). We can use the R statistical language (R Core Team, 2021) to generate random data representing the performance of N = 50 subjects in the baseline (x) and experimental stimulation (y) conditions.
The first line, set.seed(1), serves to fix the random number generation procedure, such that the readers will be able to produce our exact same results on their computers. Lines 2 and 3 actually generate the data at baseline and after NIBS. In this example, all datapoints are independently sampled from a standard normal distribution with µ = 0 and σ = 1. Thus, data come from a population in which there is no relationship between the variables involved and no effect of neurostimulation whatsoever. In short, such data come from a population in which the null hypothesis is true for all parameters of interest. In fact, at line 4, we can test the effect of the NIBS by comparing performance before and after stimulation, and obtain a null result, t(49) = −0.09, p = 0.93, as it could be expected. Similarly, at line 5, we test the correlation between x and y and obtain a null result, r = −0.039, p = 0.79 ( Figure 1A). However, at line 6 we test the correlation between the baseline and the deviation from the baseline, and we obtain r y−x,x = −0.67, p < 0.001 ( Figure 1B). If we repeated the example with a different random seed, we would obtain slightly different results each time. On average, it can be demonstrated that our results would converge toward the value r y−x,x = 1 √ 2 ∼ = 0.707 (Spearman, 1913;Chiolero et al., 2013). 1 Thus, under the null hypothesis of no relationships and no effect of neurostimulation, a researcher using the correlation approach would expect to find a correlation between the baseline and the deviation that can be considered very large (Cohen, 1988). Another bias connected to the correlation approach is regression toward the mean due to measurement error at baseline (Nesselroade et al., 1980;Blomqvist, 1987;Tu et al., 2005). Let us now consider the fact that the individual performance is always assessed with a certain degree of error. Variations in observed performance could reflect transient and non-systematic factors (tiredness, distraction, etc.), which introduce noise in the assessment. The observed performance at baseline of the ith subject could be decomposed into x i = X i + e xi , where the observed performance x i is given by the sum of the true performance X i and measurement error e xi . The same holds for the performance after NIBS, y i = Y i + e yi . A researcher's aim would be to estimate the true correlation, which is the correlation involving the true latent performance r Y−X,X , but the researcher would typically approximate that value by estimating the correlation involving observed performance, r y−x,x . However, the observed difference y i − x i is equal to Y i − X i + e yi − e xi . The correlation r y−x,x is thus affected by the measurement error e xi being present both in the independent and in the dependent variable, with opposite signs. In particular, Blomqvist (1987) has shown that the relationship between the observed correlation (or regression slope) and the true value is r y− x is the measurement error, the ratio of the error variance of the observed baseline performance, σ 2 e x , to the total variance of the observed baseline performance, σ x . For example, let us assume a situation in which the true correlation is very low, r Y−X,X = 0.01, and the error variance in performance assessment is 40% (a value that is not uncommon in tasks used in the field, e.g., Fan et al., 2002). The researcher would expect to observe a correlation of r y−x,x = −0.394, which would be regarded as significantly different from zero (p < 0.05) on a sample of more than N = 25 participants.
Interestingly, both biases lead to the same type of results, which is observing a negative correlation between baseline and change. This is exactly what most of the studies in the field reviewed above reported.

Biases of the Categorization Approach
The categorization approach consists in categorizing subjects into two groups, one including those with higher baseline performance (e.g., above the median) and the other including the 1 If we used an unstandardized regression approach, our results would converge towards a slope b = −1 (Blomqvist, 1987), which is equivalent to a correlation of 1 √ 2 . This can be seen considering that the population variance of x and y are σ x = σ y = 1 and their correlation is σ xy = 0. The standard deviation of y-x is thus σ y−x = σ x + σ y − σ xy = √ 2. The regression slope b y−x,x can be standardized with the formula r y−x,x = b y−x,x σx σy−x = 1 √ 2 . remaining subjects. The deviation in performance between the two groups is then compared (e.g., in using t-test or ANOVA). The categorization approach avoids mathematical coupling but is nonetheless affected by regression toward the mean. As for the correlation approach, an easy way to understand why the categorization approach is problematic is considering what would happen if this was applied under the null hypothesis. Let us define the null hypothesis as the one in which the neurostimulation has no effect whatsoever and the true performance of all participants is the same. For simplicity, let us assume that the true performance takes value zero (i.e., X i = Y i = 0, ∀i). In this situation, any variance in the observed performance is just measurement noise. Following the typical categorization approach, we would nonetheless perform a median or mean split, relying on the observed baseline performance, and divide participants into high performers (highP) and low performers (lowP). The observed performance of the highP group will thus be always larger than the observed performance of the lowP group, but this will not be true for the performance after neurostimulation. Therefore, we will typically observe a performance increase after stimulation for the lowP group and a performance decrease for the highP group.
This bias can also be easily illustrated by continuing the simple numeric example used above. In particular, lines 7-10 in the code below perform the median split, separating the x and y variables for those who have better or worse observed baseline performance. Lines 11-12 perform a paired-samples t-test to examine changes in performance in the lowP group and calculate the effect size d z (Cohen, 1988;Perugini et al., 2018). Lines 13-14 replicate the analysis for the highP group. The results of the t-test show, as expected, a significant performance improvement for the lowP group, t(24) = −2.90, p = 0.008, d z = 0.58, as well as a significant decrease in performance for the highP group, t(24) = 2.89, p = 0.008, d z = −0.58 (Figure 1C). This is of course an example on few randomly generated data points: If one repeated this example with different random data, the effect size would converge toward the values d z = ± 1 π −1 (i.e., ±0.68), 2 an effect that is considered by Cohen 2 This value can be obtained by considering that y lowP and y highP have been sampled from a standard normal distribution. Since the median of a standard normal distribution is zero, x highP and x lowP follow a half-normal distribution, with mean M = ± 2 π and SD = π −2 π (Olmos et al., 2012). The expected value of the effect size d z can then be calculated as the mean of the differences between y lowP − x lowP or y highP − x highP , which are respectively + 2 π and − 2 π , divided by the standard The sample is split into high-performers (highP) and low-performers (lowP) and the baseline performance is compared with the performance after NIBS. Bars represent ±1 SE.
above the medium size [i.e., "one large enough to be visible to the naked eye" (Cohen, 1988)], with a positive sign for the lowP group and a negative sign for the highP group.

Suggested Approaches
We have shown how a researcher using either the correlation approach or the categorization approach would easily believe to have found a potentially interesting effect, even in a situation in which no effect is present, just because of mathematical coupling and regression to the mean. However, investigating how baseline performance can modulate NIBS effects is a very interesting research question and should not be neglected. Methods for testing such effects have been developed that allow reducing the impact of such systematic biases. The first method has been proposed by Oldham (Oldham, 1962), and it consists in simply correlating the mean (or, equivalently, the sum) of the performance at baseline and after the stimulation (i.e., x+y 2 ) with the performance change (x − y). Albeit this method might appear very similar to the correlation method illustrated above, it can be demonstrated that it gets rid of the mathematical coupling (Tu and Gilthorpe, 2007). This method can be used in any situation in which the correlation approach is used, by simply changing one of the terms.
It has been shown that Oldham's method is equivalent to test a change in variance between x and y (Tu and Gilthorpe, 2007) and that a differential effect of NIBS according to the baseline implies a change in variance (Chiolero et al., 2013). An alternative test similar to Oldham's method is to directly test the differences between the variances of x and y (Tu and Gilthorpe, 2007). However, an important limitation of both this and Olhdam's deviation of such differences, which is in both cases 2π −2 π . The expected effect size is then given by d z = ± 2 π 2π −2 π = 1 π −1 . methods is that any factor increasing or decreasing variance after NIBS other than the genuine stimulation effects, namely a change in error variance after NIBS, could lead to spurious conclusions Chiolero et al., 2013).
In the above discussion, we have shown that if x and y are unrelated, the expected correlation between the baseline and the deviation from the baseline is r y−x,x = 1 √ 2 . Researchers could then wonder whether it would be possible to test the correlation observed in their samples against this value, instead of zero. The issue is slightly complicated by the fact that the expected correlation is not always r y−x,x = 1 √ 2 , but it depends on the correlation between x and y. Tu and colleagues  showed that the correct value can be calculated as 1−r xy 2 . This method showed performances comparable to the Oldham's method in simulation . Like Oldham's method, this strategy assumes that error variance in the assessment of performance does not differ before and after NIBS .
Another method has been proposed by Blomqvist (1987), which corrects the distortion introduced by regression to the mean due to measurement error in baseline performance. This method requires estimating the parameter k mentioned above, to recover the true unbiased correlation from the observed correlation or regression slope using the formula 3 r y−x,x = r y−x,x +k (1−k) . Parameter k, the measurement error, can be estimated as one minus the reliability of the test used for assessing performance (see Parsons et al., 2019 for guidance on how to estimate reliability in cognitive tests), and should be obtained on data independent of those used for the baseline (Tu and Gilthorpe, 2007). A limit of Blomqvsist's method is that it does not correct for regression to the mean due to factors other than measurement error, such as that due to genuine heterogeneity in the responses of patients to treatments (Tu and Gilthorpe, 2007). Methods based on multilevel linear models have also been suggested to obtain unbiased estimates. In particular, if one has available many repeated measures over time and is interested in estimating whether the (e.g., linear) trend in change over time is related to the baseline, one can employ multilevel linear models and estimate the correlation between random intercept (i.e., the interindividual variance in the baseline performance) and random slope (i.e., the interindividual variance in the deviation from the slope) (Byth and Cox, 2005;Chiolero et al., 2013). This is also possible if only two assessments of performance are available, but estimating such models requires constraining error variance to zero to make the model identified (Blance et al., 2005). When using mixed models, it is crucial to center the predictor variable (i.e., time should be coded as −0.5 if before NIBS and +0.5 if after NIBS, not as 0 and 1), otherwise estimates will be vulnerable to mathematical coupling (Blance et al., 2005). Unlike other methods reviewed, this solution allows testing more elaborated models including also covariates (Blance et al., 2005).

CONCLUSIONS
The main goal of this perspective article was clarifying the main biases connected to widely used methods to examine the association between baseline performance and NIBS effects, reviewing solutions proposed in the literature. In particular, we have shown that mathematical coupling and regression to the mean can have large distorting effects on estimates, leading to extremely biased conclusions even when the null hypothesis is true. We also reviewed several solutions to mitigate such biases. None of the methods reviewed can be considered as the perfect solution, and whether one of such methods is superior to the others is still debated (Hayes, 1988;Tu et al., 2005;Tu and Gilthorpe, 2007;Chiolero et al., 2013). However, any of these methods will be superior to both the correlation and the categorization approaches that have been used in the field of NIBS. In situations in which is difficult to determine which biases are more likely to affect one's estimate, we suggest to apply different methods (e.g., Oldham's and Blomqvist's method), to inspect the results after considering different sources of bias.
We wish to stress that the biases and the solutions reviewed here are not recent findings. Some of them have been known for more than fifty years (Oldham, 1962;Blomqvist, 1987). Furthermore, these biases are not strictly specific to NIBS, but are relevant whenever one is interested in examining the relationships between baseline levels and deviations from such levels. Nonetheless, knowledge of such biases and solutions does not seem to be effectively integrated in the NIBS literature. The present work thus provides a strong contribution to a deeper understanding of the non-linear effects observed in brain stimulation studies (Schwarzkopf et al., 2011;Jones and Berryhill, 2012;Tseng et al., 2012;Hsu et al., 2014;Benwell et al., 2015;Painter et al., 2015;Learmonth et al., 2015;Emrich et al., 2017;Penton et al., 2017;Schaal et al., 2017;Yang and Banissy, 2017;Paracampo et al., 2018;Silvanto et al., 2018), and represents a step forward toward a full exploitation of the potential of brain stimulation techniques.

AUTHOR CONTRIBUTIONS
CL, LC, and GC: conceptualization and writing-review and editing. CL and GC: manuscript preparation. All authors contributed to the article and approved the submitted version.