Enhancing Statistical Inference in Psychological Research via Prospective and Retrospective Design Analysis

In the past two decades, psychological science has experienced an unprecedented replicability crisis, which has uncovered several issues. Among others, the use and misuse of statistical inference plays a key role in this crisis. Indeed, statistical inference is too often viewed as an isolated procedure limited to the analysis of data that have already been collected. Instead, statistical reasoning is necessary both at the planning stage and when interpreting the results of a research project. Based on these considerations, we build on and further develop an idea proposed by Gelman and Carlin (2014) termed “prospective and retrospective design analysis.” Rather than focusing only on the statistical significance of a result and on the classical control of type I and type II errors, a comprehensive design analysis involves reasoning about what can be considered a plausible effect size. Furthermore, it introduces two relevant inferential risks: the exaggeration ratio or Type M error (i.e., the predictable average overestimation of an effect that emerges as statistically significant) and the sign error or Type S error (i.e., the risk that a statistically significant effect is estimated in the wrong direction). Another important aspect of design analysis is that it can be usefully carried out both in the planning phase of a study and for the evaluation of studies that have already been conducted, thus increasing researchers' awareness during all phases of a research project. To illustrate the benefits of a design analysis to the widest possible audience, we use a familiar example in psychology where the researcher is interested in analyzing the differences between two independent groups considering Cohen's d as an effect size measure. We examine the case in which the plausible effect size is formalized as a single value, and we propose a method in which uncertainty concerning the magnitude of the effect is formalized via probability distributions. Through several examples and an application to a real case study, we show that, even though a design analysis requires significant effort, it has the potential to contribute to planning more robust and replicable studies. Finally, future developments in the Bayesian framework are discussed.

Cohen's d is a standardized measure of effect size that allows to express differences in terms of the variability of the phenomena of interest irrespective of the original measurement unit. It is a useful solution when researchers utilize raw units which are quite arbitrary or lack meaning outside their investigation (Cohen, 1988). A Cohen's d of 0.1 means that the difference between the two population means is one-tenth of the common standard deviation. Borenstein et al. (2009) underline the importance of distinguishing between δ, the population Cohen's d value, and d, the estimated Cohen's d value from the sampled groups given by: In the numerator,X A andX B are the sample means in the two groups. In the denominator, S pooled is the pooled standard deviation: where n A and n B are the two sample sizes, and S 2 A and S 2 B are the standard deviations in the two groups.

Cohen's d interpretation
In well-established areas of study, the definition of a relevant effect size in terms of Cohen's d offers no particular difficulty. Population σ is normally already known or easy to estimate, and differences of interest are easily defined from the research context. Thus, researchers may already know which is the effect size of interest to evaluate, for example, the effectiveness of a specific treatment. On the contrary, in the case of less known areas, or when newly developed measures are employed, the definition of an effect size of interest may not be so simple.
In these cases, Cohen (1988) proposed some conventional operational definitions to interpret effect sizes. He suggested indicative values of d for "small", "medium", and "large" effect sizes.
• Small effect size: d = .2; This refers to small differences that are difficult to detect, such as approximately the size of the difference in mean height between 15-and 16-year-old girls.
• Medium effect size: d = .5; This refers to differences that are "large enough to be visible to the naked eye" (p.26). For example, the magnitude of the difference in height between 14-and 18-year-old girls.
• Large effect size: d = .8; This refers to very obvious differences, such as the mean difference in height between 13-and 18-year-old girls.
Another way to interpret and make sense of Cohen's d values is to consider the Common Language effect size statistic (CL; Ruscio, 2008), or Cohen's measure of non-overlap U 3 (Cohen, 1988). The former is defined as the probability that a randomly chosen member of population B scores higher than a randomly chosen member of population A. The latter is defined as the percentage of the population B which exceeds the mean of the population A. Figure S1 shows CL and U 3 values for small, medium, and large effect sizes.
However, as suggested by Cohen (1988, p.25), "The terms small, medium, and large are relative, not only to each other but to the area of behavioural science or even more particularly to the specific content and research method being employed in any given investigation". These values are only conventional references in the absence of any other information. Researchers should aim to define their own criteria, according to their specific research objectives and the related costs-benefits ratio. In some fields, even small changes could result in valuable gains.
Finally, it is important to underline an aspect that is often neglected when dealing with Cohen's d. It should be remembered that Cohen's d depends on the pooled standard deviation (i.e., increasing levels of standard deviation are associated with lower values of Cohen's d). Given that the pooled standard deviation partly reflects the accuracy of the measure used in a study, in the planning phase researchers should select measures that are as accurate as possible. Furthermore, when evaluating effect sizes of other studies, considerations about the accuracy of the utilized measure(s) should always be taken into account. 2), there is a 56% probability that a random subject from population B has a higher score than a random subject from population A (CL = .56) and 58% of population B is above the mean of population A (U 3 = .58). (B) In the case of a medium effect (d = .5), there is a 64% probability that a random subject from the population B has a higher score than a random subject from population A (CL = .64) and 69% of population B is above the mean of the population A (U 3 = .69). (C) In case of a large effect (d = .8), there is a 71% probability that a random subject from population B has a higher score than a random subject from population A (CL = .71) and 79% of population B is above the mean of population A (U 3 = .79).

APPENDIX B: R FUNCTIONS FOR DESIGN ANALYSIS Preliminary notes
The R functions presented in the paper to perform design analysis are: • design analysis( ) • design est( ) These functions are detailed described in the following section, and their code (i.e., PRDA.R) is available at the Open Science Framework (OSF) at the link https://osf.io/j8gsf/files/.
In the last section of this Appendix, all examples in the paper are also reproduced using the aforementioned functions. It should be noted that results might slightly differ because both functions follow a simulation approach. To obtain more stable results, it is possible to increase the default number of iterations.
Readers can use the functions to easily perform prospective and retrospective design analysis on their own data 1 .
Furthermore, the R code can be used as a starting point to extend design analysis to more complex cases than the one presented (i.e., the differences between two independent groups considering Cohen's d as an effect-size measure) and that were beyond the scope of this paper.
1 Please note, that a first version of a package, called PRDA, to perform prospective and retrospective design analysis is also available at https://github. com/masspastore/PRDA.

R functions
To use our R functions, first download the file PRDA.R at the link https://osf.io/j8gsf/ files/. To load the functions, simply type: The function design analysis( ) runs prospective and retrospective design analysis according to a Cohen's d (d) and a fixed type I error (sig.level). Specifically, if the user specifies: • power (power), then it performs prospective design analysis • sample size per each group (n), then it performs retrospective design analysis Note: It is necessary to provide either power or n. Default is from 2 to 1000. Note: rangen is used only for prospective design analysis

The function
> design_est( n1 , n2 = n1, target_d = NULL, target_d_limits = NULL, + distribution = c("uniform","normal"), k = 1/6, sig.level = 0.05, + B = 500, B0 = 500, return_data = FALSE ) The function design est( ) performs retrospective design analysis according to a plausible interval for Cohen's d (see, target d limits) or to a fixed Cohen's d (see, target d) and a fixed type I error (sig.level). Different sample sizes for each sample can be specified. Note: It is necessary to provide either target d or target d limits.

Function arguments
• n1 = sample size of first group • n2 = sample size of second group. Default is n1 • target d limits = vector of two values specifing the plausible interval of Cohen's d • distribution = a character string specifying the probability distribution associated with the plausible interval for Cohen's d, must be one of "uniform" or "normal" • k = if "normal" is specified as distribution, k is used to define the standard deviation of the doubly truncated normal distribution. Specifically, the standard deviation is calculated as the length of the plausible interval times k.