Manipulating the Alpha Level Cannot Cure Significance Testing

We argue that making accept/reject decisions on scientific hypotheses, including a recent call for changing the canonical alpha level from p = 0.05 to p = 0.005, is deleterious for the finding of new discoveries and the progress of science. Given that blanket and variable alpha levels both are problematic, it is sensible to dispense with significance testing altogether. There are alternatives that address study design and sample size much more directly than significance testing does; but none of the statistical tools should be taken as the new magic method giving clear-cut mechanical answers. Inference should not be based on single studies at all, but on cumulative evidence from multiple independent studies. When evaluating the strength of the evidence, we should consider, for example, auxiliary assumptions, the strength of the experimental design, and implications for applications. To boil all this down to a binary decision based on a p-value threshold of 0.05, 0.01, 0.005, or anything else, is not acceptable.

Given that blanket and variable alpha levels both are problematic, it is sensible to dispense with significance testing altogether. There are alternatives that address study design and sample size much more directly than significance testing does; but none of the statistical tools should be taken as the new magic method giving clear-cut mechanical answers. Inference should not be based on single studies at all, but on cumulative evidence from multiple independent studies. When evaluating the strength of the evidence, we should consider, for example, auxiliary assumptions, the strength of the experimental design, and implications for applications. To boil all this down to a binary decision based on a p-value threshold of 0.05, 0.01, 0.005, or anything else, is not acceptable.
Keywords: statistical significance, null hypothesis testing, p-value, significance testing, decision making Many researchers have criticized null hypothesis significance testing, though many have defended it too (see Balluerka et al., 2005, for a review). Sometimes, it is recommended that the alpha level be reduced to a more conservative value, to lower the Type I error rate. For example, Melton (1962), the editor of Journal of Experimental Social Psychology from 1950-1962, favored an alpha level of 0.01 over the typical 0.05 alpha level. More recently, Benjamin et al. (2018) recommended shifting to 0.005-consistent with Melton's comment that even the 0.01 level might not be "sufficiently impressive" to warrant publication (p. 554). In addition, Benjamin et al. (2018) stipulated that the 0.005 alpha level should be for new findings but were vague about what to do with findings that are not new. Though not necessarily endorsing significance testing as the preferred inferential statistical procedure (many of the authors apparently favor Bayesian procedures), Benjamin et al. (2018) did argue that using a 0.005 cutoff would fix much of what is wrong with significance testing. Unfortunately, as we will demonstrate, the problems with significance tests cannot be importantly mitigated merely by having a more conservative rejection criterion, and some problems are exacerbated by adopting a more conservative criterion.
We commence with some claims on the part of Benjamin et al. (2018). For example, they wrote ". . . changing the P value threshold is simple, aligns with the training undertaken by many researchers, and might quickly achieve broad acceptance." If significance testing-at any p-value threshold-is as badly flawed as we will maintain it is (see also Amrhein et al., 2017;Greenland, 2017), these reasons are clearly insufficient to justify merely changing the cutoff. Consider another claim: "The new significance threshold will help researchers and readers to understand and communicate evidence more accurately." But if researchers have understanding and communication problems with a 0.05 threshold, it is unclear how using a 0.005 threshold will eliminate these problems. And consider yet another claim: "Authors and readers can themselves take the initiative by describing and interpreting results more appropriately in light of the new proposed definition of statistical significance." Again, it is not clear how adopting a 0.005 threshold will allow authors and readers to take the initiative with respect to better data interpretation. Thus, even prior to a discussion of our main arguments, there is reason for the reader to be suspicious of hasty claims with no empirical support.
With the foregoing out of the way, consider that a basic problem with tests of significance is that the goal is to reject a null hypothesis. This goal seems to demand-if one is a Bayesianthat the posterior probability of the null hypothesis should be low given the obtained finding. But the p-value one obtains is the probability of the finding, and of more extreme findings, given that the null hypothesis and all other assumptions about the model were correct (Greenland et al., 2016;Greenland, 2017), and one would need to make an invalid inverse inference to draw a conclusion about the probability of the null hypothesis given the finding. And if one is a frequentist, there is no way to traverse the logical gap from the probability of the finding and of more extreme findings, given the null hypothesis, to a decision about whether one should accept or reject the null hypothesis (Briggs, 2016;Trafimow, 2017). We accept that, by frequentist logic, the probability of a Type I error really is lower if we use a 0.005 cutoff for p than a 0.05 cutoff, all else being equal. We also accept the Bayesian argument by Benjamin et al. (2018) that the null hypothesis is less likely if p = 0.005 than if p = 0.05, all else being equal. Finally, we acknowledge that Benjamin et al. (2018) provided a service for science by further stimulating debate about significance testing. But there are important issues Benjamin et al. (2018) seem not to have considered, discussed in the following sections. Trafimow and Earp (2017) argued against the general notion of setting an alpha level to make decisions to reject or not reject null hypotheses, and the arguments retain their force even if the alpha level is reduced to 0.005. In some ways, the reduction worsens matters. One problem is that p-values have sampling variability, as do other statistics (Cumming, 2012). But the p-value is special in that it is designed to look like pure noise if the null hypothesis and all other model assumptions are correct, for in that case the pvalue is uniformly distributed on [0,1] (Greenland, 2018). Under an alternative hypothesis, its distribution is shifted downwards, with the probability of p falling below the chosen cutoff being the power of the test. Because the actual power of typical studies is not very high, when the alternative is correct it will be largely a matter of luck whether the sampled p-value is below the chosen alpha level. When, as is often the case, the power is much below 50% (Smaldino and McElreath, 2016), the researcher is unlikely to re-sample a p-value below a significance threshold upon replication, as there may be many more p-values above than below the threshold in the p-value distribution (Goodman, 1992;Senn, 2002;Halsey et al., 2015). This problem gets worse as the cutoff is lowered, since for a constant sample size, the power drops with the cutoff.

REGRESSION AND REPLICABILITY
Even if one did not use a cutoff, the phenomenon of regression to the mean suggests that the p-value obtained in a replication experiment is likely to regress to whatever the mean p-value would be if many replications were performed. How much regression should occur? When the null hypothesis is incorrect, that depends on how variable the point estimates and thus the p-values are.
Furthermore, the variability of p-values results in poor correlation across replications. Based on data placed online by the Open Science Collaboration (2015; https://osf.io/fgjvw), Trafimow and de Boer (submitted) calculated a correlation of only 0.004 between p-values obtained in the original cohort of studies with p-values obtained in the replication cohort, as compared to the expected correlation of zero if all the null hypotheses and models used to compute the p-values were correct (and thus all the p-values were uniformly distributed).
There are several possible reasons for the low correlation, including that most of the studied associations may have in fact been nearly null, so that the p-values remained primarily a function of noise and thus a near-zero correlation should be expected. But even if many or most of the associations were far from null, thus shifting the p-values downward toward zero and creating a positive correlation on replication, that correlation will remain low due not only to the large random error in p-values, but also due to imperfect replication methodology and the nonlinear relation between p-values and effect sizes ("correcting" the correlation for attenuation due to restriction of range, in the original cohort of studies, increases the correlation to 0.01, which is still low). Also, if most of the tested null hypotheses were false, the low p-value replicability as evidenced by the Open Science Collaboration could be attributed, in part, to the publication bias caused by having a publishing criterion based on p-values (Locascio, 2017a;. But if one wishes to make such an attribution, although it may provide a justification for using p-values in a hypothetical scientific universe where p-values from false nulls are more replicable because of a lack of publication bias, the attribution provides yet another important reason to avoid any sort of publishing criteria based on pvalues or other statistical results . Thus, the obtained p-value in an original study has little to do with the p-value obtained in a replication experiment (which is just what the actual theory of p-values says should be the case). The best prediction would be a p-value for the replication experiment being vastly closer to the mean of the p-value distribution than to the p-value obtained in the original experiment. Under any hypothesis, the lower the pvalue published in the original experiment (e.g., 0.001 rather than 0.01), the more likely it represents a greater distance of the p-value from the p-value mean, implying increased regression to the mean.
All this means that binary decisions, based on p-values, about rejection or acceptance of hypotheses, about the strength of the evidence (Fisher, 1925(Fisher, , 1973, or about the severity of the test (Mayo, 1996), will be unreliable decisions. This could be argued to be a good reason not to use p-values at all, or at least not to use them for making decisions on whether or not to judge scientific hypotheses as being correct .

ERROR RATES AND VARIABLE ALPHA LEVELS
Another disadvantage of using any set alpha level for publication is that the relative importance of Type I and Type II errors might differ across studies within or between areas and researchers (Trafimow and Earp, 2017). Setting a blanket level of either 0.05 or 0.005, or anything else, forces researchers to pretend that the relative importance of Type I and Type II errors is constant. Benjamin et al. (2018) try to justify their recommendation to reduce to the 0.005 level by pointing out a few areas of science which use very low alpha levels, but this observation is just as consistent with the idea that a blanket level across science is undesirable. And there are good reasons why variation across fields and topics is to be expected: A wide variety of factors can influence the relative importance of Type I and Type II errors, thereby rendering any blanket recommendation undesirable. These factors may include the clarity of the theory, auxiliary assumptions, practical or applied concerns, or experimental rigor. Indeed, Miller and Ulrich (2016) showed how these and other factors have a direct bearing on the final research payoff. There is an impressive literature attesting to the difficulties in setting a blanket level recommendation (e.g., Buhl-Mortensen, 1996;Lemons et al., 1997;Lemons and Victor, 2008;Lieberman and Cunningham, 2009;Myhr, 2010;Rice and Trafimow, 2010;Mudge et al., 2012;Lakens et al., 2018).
However, we do not argue that every researcher should get to set her own alpha level for each study, as recommended by Neyman and Pearson (1933) and Lakens et al. (2018), because that has problems too (Trafimow and Earp, 2017). For example, with variable thresholds, many old problems with significance testing remain unsolved, such as the problems of regression to the mean of p-values, inflation of effect sizes (the "winner's curse, " see below), selective reporting and publication bias, and the general disadvantage of forcing decisions too quickly rather than considering cumulative evidence across experiments. In view of all the uncertainty surrounding statistical inference (Greenland, 2017(Greenland, , 2018, we strongly doubt that we could successfully "control" error rates if only we would justify our alpha level and other decisions in advance of a study, as Lakens et al. (2018) seem to suggest in their comment to Benjamin et al. (2018). Nonetheless, Lakens et al. (2018) conclude that "the term 'statistically significant' should no longer be used." We agree, but we think that significance testing with a justified alpha is still significance testing, whether the term "significance" is used or not.
Given that blanket and variable alpha levels both are problematic, it is sensible not to redefine statistical significance, but to dispense with significance testing altogether, as suggested by McShane et al. (2018) and , two other comments to Benjamin et al. (2018).

DEFINING REPLICABILITY
Yet another disadvantage pertains to what Benjamin et al. (2018) touted as the main advantage of their proposal, that published findings will be more replicable using the 0.005 than the 0.05 alpha level. This depends on what is meant by "replicate" (see Lykken, 1968, for some definitions). If one insists on the same alpha level for the original study and the replication study, then we see no reason to believe that there will be more successful replications using the 0.005 level than using the 0.05 level. In fact, the statistical regression argument made earlier suggests that the regression issue is made even worse using 0.005 than using 0.05. Alternatively, as Benjamin et al. (2018) seem to suggest, one could use 0.005 for the original study and 0.05 for the replication study. In this case, we agree that the combination of 0.005 and 0.05 will create fewer unsuccessful replications than the combination of 0.05 and 0.05 for the initial and replication studies, respectively. However, this comes at a high price in arbitrariness. Suppose that two studies come in at p < 0.005 and p < 0.05, respectively. This would count as a successful replication. In contrast, suppose that the two studies come in at p < 0.05 and p < 0.005, respectively. Only the second study would count, and the combination would not qualify as indicating a successful replication. Insisting that setting a cutoff of 0.005 renders research more replicable would demand much more specificity with respect to how to conceptualize replicability.
In addition, we do not see a single replication success or failure as definitive. If one wishes to make a strong case for replication success or failure, multiple replication attempts are desirable. As is attested to by recent successful replication studies in cognitive psychology (Zwaan et al., 2017) and social sciences (Mullinix et al., 2015), the quality of the theory and the degree to which model assumptions are met will importantly influence replicability.

QUESTIONING THE ASSUMPTIONS
The discussion thus far is under the pretense that the assumptions underlying the interpretation of p-values are true. But how likely is this? Berk and Freedman (2003) have made a strong case that the assumptions of random and independent sampling from a population are rarely true. The problems are particularly salient in the clinical sciences, where the falsity of the assumptions, as well as the divergences between statistical and clinical significance, are particularly obvious and dramatic (Bhardwaj et al., 2004;Ferrill et al., 2010;Fethney, 2010;Page, 2014). However, statistical tests not only test hypotheses but countless assumptions and the entire environment in which research takes place (Greenland, 2017(Greenland, , 2018. The problem of likely false assumptions, in combination with the other problems already discussed, render the illusory garnering of truth from p-values, or from any other statistical method, yet more dramatic.

THE POPULATION EFFECT SIZE
Let us continue with the significance and replication issues, reverting to the pretense that model assumptions are correct, while keeping in mind that this is unlikely. Consider that as matters now stand using tests of significance with the 0.05 criterion, the population effect size plays an important role both in obtaining statistical significance (all else being equal, the sample effect size will be larger if the population effect size is larger) and in obtaining statistical significance twice for a successful replication. Switching to the 0.005 cutoff would not lessen the importance of the population effect size, and would increase its importance unless sample sizes increased substantially from those currently used. And there is good reason to reject that replicability should depend on the population effect size. To see this quickly, consider one of the most important science experiments of all time, by Michelson and Morley (1887). They used their interferometer to test whether the universe is filled with a luminiferous ether that allows light to travel to Earth from the stars. Their sample effect size was very small, and physicists accept that the population effect size is zero because there is no luminiferous ether. Using traditional tests of significance with either a 0.05 or 0.005 cutoff, replicating Michelson and Morley would be problematic (see Sawilowsky, 2003, for a discussion of this experiment in the context of hypothesis testing). And yet physicists consider the experiment to be highly replicable (see also Meehl, 1967). Any proposal that features p-value rejection criteria forces the replication probability to be impacted by the population effect size, and so must be rejected if we accept the notion that replicability should not depend on population effect size.
In addition, with an alpha level of 0.005, large effect sizes would be more important for publication, and researchers might lean much more toward "obvious" research than toward testing creative ideas where there is more of a risk of small effects and of p-values that fail to meet the 0.005 bar. Very likely, a reason null results are so difficult to publish in sciences such as psychology is because the tradition of using p-value cutoffs is so ingrained. It would be beneficial to terminate this tradition.

ACCURACY OF PUBLISHED EFFECT SIZES
It is desirable that published facts in scientific literatures accurately reflect reality. Consider again the regression issue. The more stringent the criterion level for publishing, the more distance there is from a finding that passes the criterion to the mean, and so there is an increasing regression effect. Even at the 0.05 alpha level, researchers have long recognized that published effect sizes likely do not reflect reality, or at least not the reality that would be seen if there were many replications of each experiment and all were published (see Briggs, 2016;Grice, 2017;Hyman, 2017;Kline, 2017;Locascio, 2017a,b;Marks, 2017 for a recent discussion of this problem). Under reasonable sample sizes and reasonable population effect sizes, it is the abnormally large sample effect sizes that result in p-values that meet the 0.05 level, or the 0.005 level, or any other alpha level, as is obvious from the standpoint of statistical regression. And with typically low sample sizes, statistically significant effects often are overestimates of population effect sizes, which is called "effect size inflation, " "truth inflation, " or "winner's curse" (Amrhein et al., 2017). Effect size overestimation was empirically demonstrated in the Open Science Collaboration (2015), where the average effect size in the replication cohort of studies was dramatically reduced from the average effect size in the original cohort (from 0.403 to 0.197). Changing to a more stringent 0.005 cutoff would result in yet worse effect size overestimation (Button et al., 2013;. The importance of having published effect sizes accurately reflect population effect sizes contradicts the use of threshold criteria and of significance tests, at any alpha level.

SAMPLE SIZE AND ALTERNATIVES TO SIGNIFICANCE TESTING
We stress that replication depends largely on sample size, but there are factors that interfere with researchers using the large sample sizes necessary for good sampling precision and replicability. In addition to the obvious costs of obtaining large sample sizes, there may be an underappreciation of how much sample size matters (Vankov et al., 2014), of the importance of incentives to favor novelty over replicability (Nosek et al., 2012) and of a prevalent misconception that the complement of pvalues measures replicability (Cohen, 1994;Thompson, 1996;Greenland et al., 2016). A focus on sample size suggests an alternative to significance testing. Trafimow (2017;Trafimow and MacDonald, 2017) suggested a procedure as follows: The researcher specifies how close she wishes the sample statistics to be to their corresponding population parameters, and the desired probability of being that close. Trafimow's equations can be used to obtain the necessary sample size to meet this closeness specification. The researcher then obtains the necessary sample size, computes the descriptive statistics, and takes them as accurate estimates of population parameters (provisionally on new data, of course; an optimal way to obtain reliable estimation is via robust methods, see Huber, 1972;Tukey, 1979;Rousseeuw, 1991;Portnoy and He, 2000;Erceg-Hurn et al., 2013;Field and Wilcox, 2017). Similar methods have long existed in which sample size is based on the desired maximum width for confidence intervals.
This closeness procedure stresses (a) deciding what it takes to believe that the sample statistics are good estimates of the population parameters before data collection rather than afterwards, and (b) obtaining a large enough sample size to be confident that the obtained sample statistics really are within specified distances of corresponding population parameters. The procedure also does not promote publication bias because there is no cutoff for publication decisions. And the closeness procedure is not the same as traditional power analysis: First, the goal of traditional power analysis is to find the sample size needed to have a good chance of obtaining a statistically significant pvalue. Second, traditional power analysis is strongly influenced by the expected effect size, whereas the closeness procedure is uninfluenced by the expected effect size under normal (Gaussian) models.
The larger point is that there are creative alternatives to significance testing that confront the sample size issue much more directly than significance testing does. The "statistical toolbox" (Gigerenzer and Marewski, 2015) further includes, for example, confidence intervals (which should rather be renamed and be used as "compatibility intervals"-see Greenland, 2018), equivalence tests, p-values as continuous measures of refutational evidence against a model (Greenland, 2018), likelihood ratios, Bayesian methods, or information criteria. And in manufacturing or quality control situations, also Neyman-Pearson decisions can make sense (Bradley and Brand, 2016).
But for scientific exploration, none of those tools should become the new magic method giving clear-cut mechanical answers (Cohen, 1994), because every selection criterion will ignore uncertainty in favor of binary decision making and thus produce the same problems as those caused by significance testing. Using a threshold for the Bayes factor, for example, will result in a similar dilemma as with a threshold for the p-value: as Konijn et al. (2015) suggested, "God would love a Bayes factor of 3.01 nearly as much as a Bayes factor of 2.99." Finally, inference should not be based on single studies at all (Neyman and Pearson, 1933;Fisher, 1937;Greenland, 2017), nor on replications from the same lab, but on cumulative evidence from multiple independent studies. It is desirable to obtain precise estimates in those studies, but a more important goal is to eliminate publication bias by including wide confidence intervals and small effects in the literature, without which the cumulative evidence will be distorted (Amrhein et al., 2017. Along these lines, Briggs (2016) argues for abandoning parameter-based inference and adopting purely predictive, and therefore verifiable, probability models, and Greenland (2017) sees "a dire need to get away from inferential statistics and hew more closely to descriptions of study procedures, data collection [...], and the resulting data."

CONCLUSION
It seems appropriate to conclude with the basic issue that has been with us from the beginning. Should p-values and pvalue thresholds, or any other statistical tool, be used as the main criterion for making publication decisions, or decisions on accepting or rejecting hypotheses? The mere fact that researchers are concerned with replication, however it is conceptualized, indicates an appreciation that single studies are rarely definitive and rarely justify a final decision. When evaluating the strength of the evidence, sophisticated researchers consider, in an admittedly subjective way, theoretical considerations such as scope, explanatory breadth, and predictive power; the worth of the auxiliary assumptions connecting nonobservational terms in theories to observational terms in empirical hypotheses; the strength of the experimental design; and implications for applications. To boil all this down to a binary decision based on a p-value threshold of 0.05, 0.01, 0.005, or anything else, is not acceptable.

AUTHOR CONTRIBUTIONS
All authors listed have made a direct contribution to the paper or endorse its content, and approved it for publication.

ACKNOWLEDGMENTS
We thank Sander Greenland and Rink Hoekstra for comments and discussions. MG acknowledges support from VEGA 2/0047/15 grant. RvdS was supported by a grant from the Netherlands organization for scientific research: NWO-VIDI-45-14-006. Publication was financially supported by grant 156294 from the Swiss National Science Foundation to VA.