Moving Beyond Traditional Null Hypothesis Testing: Evaluating Expectations Directly

This mini-review illustrates that testing the traditional null hypothesis is not always the appropriate strategy. Half in jest, we discuss Aristotle's scientific investigations into the shape of the earth in the context of evaluating the traditional null hypothesis. We conclude that Aristotle was actually interested in evaluating informative hypotheses. In contemporary science the situation is not much different. That is, many researchers have no particular interest in the traditional null hypothesis. More can be learned from data by evaluating specific expectations, or so-called informative hypotheses, than by testing the traditional null hypothesis. These informative hypotheses will be introduced while providing an overview of the literature on evaluating informative hypothesis.

What is "Wrong" With the traditional null hypothesis? Cohen (1994) aptly summarized the criticism of traditional null hypothesis testing in the title of his paper "The earth is round (p < 0.05)." Let us elaborate on his criticism using an example inspired by this title originally meant to instruct and entertain.
The question of the shape of the earth was a recurring issue in scientific debate during the era of Aristotle (384-322 BC; see Rusell, 1997). By that time, the Greek idea that the earth was round dominated scientific thinking. The only serious opponents were the atomists Leucippus and Democritus, who still believed that the earth was a flat disk floating in the ocean, as certain ancient Mesopotamian philosophers had maintained. Now let us embark on some historical science fiction to tell the story of how Aristotle in his scientific investigations might have used different ways of evaluating hypotheses 1 .
We propose that in order to falsify the old Mesopotamian hypothesis, Aristotle might have used an approach based on testing the traditional null hypothesis: H 0 : The shape of the earth is a flat disk, H 1 : The shape of the earth is not a flat disk.
Clearly, these hypotheses are no statistical hypotheses and no actual statistical inference could have been carried out; these hypotheses are purely designed to serve as an example.
So, in the set up of our reverse science fiction, Aristotle would have gathered data about the shape of the earth and found evidence against the null hypothesis, for example: stars that were seen in introduction The present mini-review argues that testing the traditional null hypothesis is not always the appropriate strategy. That is, many researchers have no particular interest in the hypothesis "nothing is going on" (Cohen, 1990). So why test a hypothesis one is not really interested in? The APA stresses in its publication manual that null hypothesis testing should be a starting point for statistical analyses: "Reporting elements such as effect sizes and confidence intervals are needed to convey the most complete meaning of the results" (American Psychological Association, 2001, p. 33; see also Fidler, 2002). In the current paper we go beyond this first step of reporting effect sizes and confidence intervals, arguing that specific expectations should be evaluated directly. As Osborne (2010) stated: "The world doesn't need another journal promulgating 20th century thinking, genuflecting at the altar of p < 0.05. I challenge us to challenge tradition" (p. 3). This is exactly what we set out to do in the current paper. Statistical tools for the evaluation of informative hypotheses are becoming available and are more often used in applications. We provide an overview of the current state of affairs for the evaluation of informative hypotheses. But first we argue, half in jest, what is "wrong" with the traditional null hypothesis and introduce the informative hypothesis.
One important prior note has to be made. Researchers like Wagenmakers et al. (2008) criticize T-tests for rendering no legitimate results and argue that p-values are prone to misinterpretation. Others, such as Coulson et al. (2010), or Fidler andThompson (2001), explicitly argue against solely reporting p-values and argue for using confidence intervals. Along similar lines, using focused contrasts which could be used to evaluate expectations directly is proposed by Rosenthal et al. (2000). However, in the current paper we will focus on developments in statistics that move beyond using confidence intervals, effect sizes, and planned contrasts. Van de Schoot et al. Moving beyond null hypothesis testing In such a direct comparison the conclusion will be more informative.

What does this historical example teach us?
Evaluating specific expectations directly produces more useful results than sequentially testing traditional null hypotheses against catch-all rivals. We argue that researchers are often interested in the evaluation of informative hypotheses and already know that the traditional null hypothesis is an unrealistic hypothesis. This presupposes that prior knowledge often is available; if this is not the case, testing the traditional null hypothesis is appropriate. In most applied articles, however, prior knowledge is indeed available in the form of specific expectations about the ordering of statistical parameters. Let us illustrate this using an example of Van de Schoot et al. (2010). The authors investigated the association between popularity and antisocial behavior in a large sample of young adolescents from preparatory vocational schools (VMBO) in the Netherlands. In this setting, young adolescents are at increased risk of becoming (more) antisocial. Five so-called sociometric status groups were defined in terms of a combination of social preference and social impact: a popular, rejected, neglected, controversial, and an average group of adolescents. Each sociometric status group was characterized by distinct behavioral patterns which influenced the quality of social relations. For example, peer rejection was found to be related to antisocial behavior, whereas popular adolescents tended to be considered as well-known, attractive, athletic, and socially competent, although this group could also be antisocial, as was shown by Van de Schoot et al. (2010).
Suppose we want to compare these five sociometric status groups on the number of committed offenses reported to the police last year (minor theft, violence, and so on) and let the groups be denoted by μ1 for the mean on the number of committed offenses for the popular group, μ2 for the rejected group, μ3 for the neglected group, μ4 for the controversial group and μ5 for the average group. Different types of hypotheses can be formulated that are used in the procedures and are described in the remainder of this paper.
First, informative hypotheses can be formulated denoted by  H H  H   I  I  I N  1  2 , , , … for a set of N hypotheses. These hypotheses contain information about the ordering of the parameters in a model, in our example the five means. Such expectations about the ordering of parameters can stem from previous studies, a literature review or even academic debate. Consider an imaginary hypothesis with inequalities between the five mean scores, H I 1 3 1 5 2 4 : < < < < µ µ µ µ µ , where the neglected group is expected to commit fewer offenses compared to the popular group, who in turn are expected to commit fewer offenses compared to the average group, and so on. If no information is available about the ordering, this is denoted by a comma. Another expectation could be the hypothesis H I 2 3 1 5 2 4 : <{ , , }< µ µ µ µ µ , where the neglected group is expected to commit fewer offenses compared to the popular, average, and rejected groups. There is no expected ordering between these three groups, but all three are expected to commit fewer offenses than the controversial group. The research question would be which of the two informative hypotheses receives most support from the data.
Egypt were not seen in countries north of Egypt, while stars that never were beyond the range of observation in northern Europe were seen to rise and set in Egypt. Such observations could not be taken as evidence of a flat earth. H 0 would have been rejected, leading Aristotle to conclude that the earth cannot be represented by a flat disk.
In actual fact, Aristotle agreed with Pythagoras (582 to ca. 507 BC), who believed that all astronomical objects have a spherical shape, including the earth. So, once again embarking on an episode of imaginary history, Aristotle might also have tested: H 0′ : The shape of the earth is a sphere, H 1′: The shape of the earth is not a sphere. Now, imagine that Aristotle continued his search for data and that he gathered data yielding evidence against (!) the null hypothesis 2 : while standing on a mountain top, he noticed that the Earth's surface has many irregularities and concluded that if enough irregularities could be observed, this might provide just enough evidence to reject the null hypothesis. And so it might have happened that Aristotle once again rejected the null hypothesis, concluding that the earth is not a sphere [Cohen: "The earth is round (p < 0.05)"].
What can be learned from this conclusion? Not much! Both hypothesis tests reject the traditional null hypotheses H 0 and H 0′ . As a next step, following the Neyman-Pearson procedure of hypothesis testing, we could tentatively adopt the alternative hypotheses H 1 and H 1′ . This procedure tells us that the earth is neither a flat disk nor a sphere and consequently we remain ignorant of the earth's actual shape. This ignorance is a result of the "catch-all" alternative hypothesis as proposed by Neyman and Pearson (1967). Unfortunately, the catch-all includes all shapes that are non-flat and non-spherical, for example pearshaped 3 .
Rather than using the hypothesis tests given above, we might argue that Aristotle was actually interested in evaluating: H A : The shape of the earth is a flat disk, versus H B : The shape of the earth is a sphere. 2 At the time, no one was able to see the earth as a whole and know it to be a sphere by direct observation. But it was possible to derive some conclusions from the hypothesis that the earth is a sphere and use these to test the null hypothesis. For example, one could predict that if someone sailed west for a sufficient amount of time, this person would return to the original starting point (Magellan did this). Or one could predict that if the earth was a sphere, ships at sea would first show their sails above the horizon, and then later, as they sailed closer, their hulls (Galileo observed this). These precise predictions, if exactly confirmed, would establish a provisional objective reality for the idea that the earth is a sphere. 3 Admittedly, not all methodologists would agree on this point. In response to Aristotle's imagined disappointment, Popper would have argued that this insight is all that Aristotelian science, or any science for that matter, can hope for. When it comes to general hypotheses, or hypotheses that are beyond the reach of direct verification, we can only be sure of their falsification. Direct positive evidence for hypotheses about the shape of the earth cannot be obtained, so there would be no reason for Aristotle to be disappointed. Popper would have argued that as there is no way to prove that the earth is spherical from direct verification, we can only hypothesize that it has the shape of a sphere. Since Aristotle found evidence demonstrating that the earth is not spherical, this hypothesis is rejected. In fact, according to Popperian reasoning, Aristotle should rejoice in the fact that at least he now knows the earth is not a sphere! combination with inequality constraints imposed on regression coefficients. The methodology consists of several steps to be performed with the aid of commonly used software, Mplus (Muthén and Muthén, 2007) 6 . Van de Schoot and Strohmeier (in press) introduce the methodology to non-statisticians and show that using this method results in a power gain. That is, fewer participants are needed to obtain a significant effect compared to a default chisquare test.

model selection approach
A second way of evaluating an informative hypothesis is to use a model selection approach. This is not a test of the model in the sense of hypothesis testing, rather it is an evaluation between statistical models using a trade-off between model fit and model complexity. Several competing statistical models may be ranked according to their value on the model selection tool used and the one with the best trade-off is the winner of the model selection competition.
There is a variety of model selection procedures commonly used in practical applications, most notably Akaike's information criterion (AIC; Akaike, 1973), the Bayesian information criterion (BIC; Schwarz, 1978), and the deviance information criterion (DIC; Spiegelhalter et al., 2002). Problems with these standard model selection tools in the context of evaluating informative hypotheses arise because the tools are not equipped to deal with inequality constraints (Mulder et al., 2009a;Van de Schoot et al., under review-b). Although the model selection tools differ in their expression, the result always consists of two parts: the likelihood of the best fitting hypothesis within the model is a measure of model fit; and an expression containing the number of (effective) parameters of the model is a measure of complexity. The greater the number of dimensions, the greater the compensation for model complexity becomes. So, adding a parameter should be accompanied by an increase in model fit to accommodate for the increase in complexity. The problem is that the expression of complexity is based on the number of parameters in the model and cannot take inequality constraints into account. That is, H I 1 3 1 5 2 4 : < < < < µ µ µ µ µ and H I 2 3 1 5 2 4 : <{ , , }< µ µ µ µ µ would receive the same measure for complexity, which is unwanted, because H I 1 is more parsimonious thanH I 2 , due to more restriction imposed on the five means.
Alternative model selection tools have been proposed in the literature. First, an alternative model selection procedure is the paired-comparison information criterion (PCIC) proposed by Dayton (1998Dayton ( , 2003, with an application in Taylor et al. (2007). The PCIC is an exploratory approach which computes a default model selection tool for all logically possible subsets of group orderings. Only the source code for the programming language GAUSS was available for the PCIC (Dayton, 2001), but  made the PCIC available in a user friendly interface 7 . The disadvantage of the PCIC is that it is an exploratory approach.
Second, there is the traditional null hypothesis (denoted by H 0 ), which states that nothing is going on and all groups have the same score, H 0 : μ 1 = μ 2 = μ 3 = μ 4 = μ 5 . Third, if no constraints are imposed on any of the means and any ordering is equally likely, the hypothesis is called a "catch-all" alternative hypothesis, or an unconstrained hypothesis (denoted by H U ): H U : μ 1 , μ 2 , μ 3 , μ 4 , μ 5 . In the next section we present an overview of possible alternatives for traditional null hypothesis testing to evaluate one or more informative hypotheses.

evaluating informative hypotheses
Different procedures are described in a range of sources that allow for the evaluation of informative hypotheses. We present an overview of technical papers, software, and applications for two types of approaches: (1) hypothesis testing approaches and (2) model selection approaches. Note that we limit ourselves to a discussion of papers where software is available for applied researchers.

hypothesis testing approach
Some approaches reported in the literature render a p-value for the comparison of H I with H 0 or with H U . First, an adaptation of the traditional F-test for analysis of variance (ANOVA) was proposed by Silvapulle et al. (2002, see also Silvapulle andSen, 2004), called the F-bar test. It is a confirmatory method to test one single informative hypothesis in two steps, for example: µ µ µ µ µ where in the second hypothesis test H I 1 serves as the null hypothesis. Software for the F-bar test is described in , but applications have not yet, to our knowledge, been reported in the literature. Application of the F-bar test is easy using the software 4 and the results are comparable with a classical F-test. The disadvantage is that only one single informative hypothesis at a time can be evaluated and this only for univariate ANOVA.
Testing informative hypotheses for structural equation models (SEM) is described in Stoel et al. (2006), where constraints are imposed on variance terms to obtain only positive values (see also Gonzalez and Griffin, 2001). A likelihood ratio test is used and the software is available in the statistical package R (R Development Core Team, 2005) 5 .
The procedure described in Van de Schoot et al. (2010) also makes use of a likelihood ratio test, but goes one step further than Stoel et al. (2006). A parametric bootstrap procedure is used in Second, the literature also contains one modification of the AIC that can be used in the context of inequality constrained ANOVA models. It is called the order-restricted information criterion (ORIC; Anraku, 1999;Kuiper et al., in press) with an application in Hothorn et al. (2009). It can be used for the evaluation of models differing in the order restrictions among a set of means. Inequality constraints are taken into account in the estimation of the likelihood and in the penalty term of the ORIC. Software for ORIC is described in . The ORIC is as yet only available for ANOVA models, but a generalization is under construction.
Alternatives for the BIC and the DIC are under construction: see Romeijn et al. (under review) and Van de Schoot et al. (under review-a), respectively.
Finally, one other method of model selection, which is receiving more and more attention in the literature, involves the evaluation of informative hypothesis using Bayes factors. In this method each (informative) hypothesis of interest is provided with a "degree of support" which tells us exactly how much support there is for each of the hypotheses under investigation. This process involves collecting evidence that is meant to provide support for or against a given hypothesis; as evidence accumulates, the degree of support for a hypothesis increases or decreases.
The methodology of evaluating a set of inequality constrained hypotheses has proven to be a flexible tool that can deal with many types of constraints. We refer to the book of Hoijtink et al. (2008b), and the papers of Van de Schoot et al. (in press) and Van de Schoot et al. (2011) as a first step for interested readers. For a philosophical background, see Romeijn and Van de Schoot (2008) and for more information on hypothesis elicitation, see Van Wesel et al. (under review). Various papers describe comparisons between traditional null hypothesis testing and Bayesian evaluation of informative hypotheses; see , Hoijtink et al. (2008b), Hoijtink and Klugkist (2007), and Van de Schoot et al. (2011). conclusion Statistics have come a long way since the early beginnings of testing the traditional null hypothesis of "nothing is going on." Developments in statistics, in particular specific developments in the evaluation of informative hypothesis, allow researchers to directly evaluate their expectations specified with inequality constraints. This mini-review illustrates that testing the traditional null hypothesis is not always an appropriate strategy. We argued that more can be learned from data by evaluating informative hypotheses, than by testing the traditional null hypothesis. These informative hypotheses were introduced by means of an example. Finally, we presented the current state of affairs in the area of evaluating informative hypotheses.