Replication Requires Psychological Rather than Statistical Hypotheses: The Case of Eye Movements Enhancing Word Recollection

Can an experiment be replicated in a mechanical fashion without considering the processes underlying the initial results? Here I will consider a non-replication of Saccade Induced Retrieval Enhancement (SIRE) and argue that it results from focusing on statistical instead of on substantive process hypotheses. Particularly the theoretical integration of SIRE with Eye-Movement Desensitization and Reprocessing (EMDR) therapy, provides clues about when the memory enhancement should occur. A relatively large memory enhancement effect in participants with a consistent (i.e., extreme right or left) handedness should be observed, (a) when explicitly instructed to retrieve and imagine the memories during the eye manipulation, and (b) for emotionally negative material. A finer theoretical analysis may thus well explain the contrast between the original SIRE studies and the non-replication. Also the findings from preregistered confirmatory research (i.e., focusing solely on statistical hypotheses) should be considered preliminary, representing shifts on a gradual scale of evidence, and awaiting interpretation in terms of theoretical hypotheses. Stronger, but still not definitive, conclusions can better be postponed until after multi-study meta-analyses with theoretically motivated moderator variables have been performed.

Tests serve as a means for making short-term decisions about statistical hypotheses, which in turn can each correspond to an infinitely large number of different substantive process hypotheses. Such decisions may be highly fallible due to the occurrence not only of bias and QRPs but also of inadvertent technical and data processing errors. Even if the effect exists, moreover, it is expected to not become significant a number of times (i.e., the power never equals one), and the false negative rate may even be higher than the false positive rate (cf, Fiedler et al., 2012). No conclusion about non-significant effects can be drawn either way, and probably a preference for the initial hypothesis should be kept (cf, Dienes, 2014). Even if it becomes significant moreover, the finding may frequently be false (i.e., without any bias or QRPs; Ioannidis, 2005). Stroebe et al. (2012) were equally pessimistic about the self-corrective power in psychology and other fields of science (e.g., even physics), but limited their analysis to scientific fraud cases for which they noted that shortterm self-correction by replication and peer review did not seem to work. Confirmatory research practices would thus only filter out a percentage of false findings (and miss many "true" findings), but the remainder probably requires a longer and more gradual process of theoretical considerations and comparisons to other findings.
I would argue that self-correction in science, also of bona-fide but false claims, mostly does not result from active discussions or confirmatory research, but from a more passive quasi-Darwinian selection of ideas and hypotheses working on longer time scales (e.g., generations of scientists). Evolutionary development probably represents the most powerful optimization process available, and may well also apply to science (cf, Holland, 1975;Dawkins, 1986). The above fallibility of statistical tests severely limits their contribution to the quasi-evolutionary selection process. More often, false hypotheses are ignored in the long run (i.e., become "extinct"), whereas hypotheses that are more consistently supported by the evidence and fit in ongoing discussions have higher chances of survival and reproduction. The single-experiment support or rejection of a statistical hypothesis in the purely confirmatory view can better be replaced by a multi-experiment weighing of psychological hypotheses, which can be represented as different levels of a theoretically motivated moderator variable in a meta-analysis (cf, Ioannidis, 2005). Even after a meta-analysis, one cannot be completely sure that one's decisions about particular hypotheses are "true." This is also not too dissimilar from the optimization process performed by evolution. Scientific development can also get stuck in local optima, not being able to reach even fitter solutions.
Replication studies conforming to confirmatory standards may still be useful instruments, as long as they are sufficiently theoretically informed. In a purely mechanical view on replication, however, researchers try to reproduce statistical outcomes of tasks rather than predictions of well-specified theories (cf, Klein, 2014). These replicators run the risk of neglecting important moderators that may also not have been recognized explicitly in the initial, to-be-replicated study. Such a hidden variable may inadvertently have been set to different values in the original and the replication studies, which can even lead to opposing outcomes. If in this case falsification is erroneously concluded from non-replication, this may hamper the development of science rather than fostering it. It is certainly true that the hidden moderators invoked by nonreplicated researchers may sometimes appear trivial (e.g., testing in cubicles), and unrelated to theory (for this critique, see Yong, 2012;Klein, 2014), but I will discuss an example of a nonreplication where these variables could have been derived a priori from prominent theories in the field. The non-replication by Matzke et al. (2015) of Saccade-Induced Retrieval Enhancement (SIRE; e.g., Lyle et al., 2008) seems to suffer from such a theoretical neglect. Determining whether the initial result or the non-replication is "false" does not seem possible by statistical tests alone, but also requires consideration of the underlying process hypotheses and their associated hidden variables.

EYE MOVEMENTS, MEMORY, AND EMOTION
Two main fields of eye movement (EM) research, sharing an interest on memory processing after short periods of EMs, were linked by Matzke et al. (2015). SIRE investigates the enhancement of predominantly emotionally neutral memories after executing EMs (e.g., Lyle et al., 2008). Eye Movement Desensitization and Reprocessing therapy (EMDR; Shapiro, 1989) deals with the emotional processing of traumatic and anxious memories due to EMs (e.g., Armstrong and Vaughan, 1996;Lee and Cuijpers, 2013). The original SIRE studies did not refer to EMDR, although the eye manipulation, involving a 30 s period of EMs at a 1 s pace, was very similar to the therapeutic procedure. The to-be-retrieved material, moreover, did not consist of traumatic memories, as in EMDR therapy, but of low-to-medium frequency, largely affectively neutral, words.
In contrast to the growing confidence in the effectiveness of EMDR (e.g., van den Hout and Engelhard, 2012; Lee and Cuijpers, 2013), the evidence for SIRE has suffered from the non-replication by Matzke et al. (2015). These authors joined in an adversarial-collaboration replication study as proponents or skeptics of SIRE and could not reproduce the memory enhancement obtained by Lyle et al. (2008). Bayesian statistics revealed that the observed data were 15 times more likely under H 0 (i.e., no difference) than under H 1 (i.e., a difference in memory performance between eye manipulation conditions). The proponents in this study were not convinced by this single failure to replicate, but the skeptics even raised the possibility of bias and QRPs on the side of the SIRE research community to explain the initial finding. This conclusion does not seem warranted due to the high rate of false positives (cf, Ioannidis, 2005) and false negatives (cf, Fiedler et al., 2012) even in the absence of bias and QRPs. In addition, it disregards theoretical reasons for the discrepancy.
Only one account for SIRE was considered in the nonreplication (i.e., the hemispheric interaction hypothesis; Lyle et al., 2008Lyle et al., , 2012, which had previously been dismissed by the proponents in the adversarial collaboration (Samara et al., 2011). Other influential accounts, primarily for EMDR, such as the working memory account (Andrade et al., 1997) or the orienting response account (Armstrong and Vaughan, 1996;Stickgold, 2002), as well as the newer top-down attentional control account from the SIRE domain (Edlin and Lyle, 2013;Lyle and Edlin, 2015) were completely ignored. Elsewhere (Phaf, submitted), I have identified crucial hidden variables based on the linking of theoretical accounts for SIRE and EMDR that may well explain the contrast between the original SIRE findings and the nonreplication.
Two variables suggest themselves from the application of EMDR accounts to SIRE. Neither the retrieval, and re-imagining, during EMs, nor the emotionality of the memories were deemed important in SIRE research. Matzke et al. (2015) even explicitly suppressed the former influences by including a recency buffer at the end of the study list, and moreover strictly selected for affectively neutral words. Because Lyle et al. (2008Lyle et al. ( , 2012 had the EMs performed immediately after study, some recently presented words may still have been active during the EMs. Also the absence of selection for neutrality here meant that there could have been an unknown proportion of negative words in the list. In a non-preregistered (i.e., exploratory, in the statistical classification of Wagenmakers et al., 2012) study an explicit retrieval instruction during the eye manipulation, and the strict selection of negative material, has strongly amplified memory enhancement, far exceeding the effect sizes commonly reported for SIRE (Phaf, submitted). However, also this experiment cannot yield conclusive evidence concerning the crucial hidden factors, but should be followed up by further research that explicitly compares instructions to re-imagine with attempts to suppress such retrieval. The control over, or lack of, or attempts to actively suppress, memory re-activation during EMs could then serve as a moderator variable in meta-analyses of SIRE. The valence of the studied material, as well as the absence of control over valence, could be another moderator variable. To corroborate the present hypotheses, the largest effect sizes should be obtained with memory (re-activation) during EMs and for negative material. Instead of getting bogged down in a statistical impasse, such, probably exploratory, research would eventually advance our understanding of SIRE and may even help to improve EMDR.

REPLICATION REQUIRES THEORY
Statistical testing is not a goal in itself in Psychology, but the development of theory is. The confirmatory type of research proposed by Wagenmakers et al. (2012) tries to validate the statistical tests, but does not necessarily provide meaning to the results. Without a theoretical specification of the hypotheses even significant findings can mean anything, and their application (e.g., in case of practical interventions) may remain "magical" (as has been argued for EMDR, McNally, 1999). The nonreplication of Matzke et al. (2015) provides an example of not sufficiently addressing theory. Process hypotheses could have been derived here not only from EMDR but also from other potential sources (e.g., visual attention, working memory). This a-theoretical stance is fostered by an over-reliance on statistical tests. The practice of only describing test statistics but not actual results (e.g., means and measures of variance) in results sections of research papers (e.g., many studies had to be excluded for this reason from the meta-analysis of Phaf et al., 2014) further illustrates the frequent prioritization of mechanical statistical testing over theoretical analysis. The emphasis in these papers should shift from establishing that "something is there" to estimating and explaining what exactly is happening in the results.
Too often statistical testing acts as a stop criterion, which consists of the simple decision rule that an effect is there if it is significant and not there if it is non-significant, taken to indicate that no further theoretical analysis is needed. The non-replication of Matzke et al. (2015) may have reached this stop criterion even earlier, not after the tests were performed but in the initial stage when the tests were planned and preregistered. Even more theoretical work is required, however, after non-significance than after significance. If one considers a theoretical hypothesis to be refuted by non-significance, a superior alternative should always be formulated according to modern philosophy of science (e.g., Lakatos, 1970). There can be no hypothesis abandonment without hypothesis replacement. The utilization of the stop criterion distinguishes mechanical replication attempts of statistical hypotheses from theoretically informed replication attempts of substantive hypotheses. Some researchers even use it as a tool for relieving them from the burden of having to delve into a largely confusing abundance of prior findings and hypotheses. The stop criterion frequently results in what Ioannidis (2005) calls the Proteus phenomenon that squarely contradicting, but both significant, sets of results are published shortly after another, sometimes even in the same journal, without referring to the other. The opposing findings are not necessarily caused by bias or QRPs in one of the studies, but may simply reflect the majority of bona-fide significant findings being false, as Ioannidis argues. Another unfortunate consequence of this criterion is that the same research is often repeated over and over again (i.e., "the wheel is reinvented"), sometimes with slight modifications or (e.g., neuro-imaging) additions, while the researchers remain unaware of previous work. Due to the frequent application of the statistical stop criterion for theoretical analysis, psychology often does not seem to learn from its own research.
In terms of substantive hypotheses, classical null hypothesis statistical testing performs a kind of inverse, rather than direct, falsification. Instead of trying to falsify a concrete hypothesis, one tries to establish evidence against being nothing there. After rejecting the null hypothesis, one claims that this rejection supports one's proposed hypothesis, which may take any form other than the null. The H 1 thus extends to an infinite range of theories, and could better be renamed H ∞ to recognize the theoretical indifference of this hypothesis. This contrasts sharply with the dominant falsification practice in for instance physics. Here a non-trivial hypothesis is disconfirmed when the values predicted by theory fall outside the uncertainty interval around the observed results (e.g., Taylor, 1982). This approach compares predicted and actual results and concludes to nonfalsification in the absence rather than presence of a difference.
The physics approach to data analysis also entails more attention for measurement accuracy than in psychology. Non-significance in classical null hypothesis statistical testing more often indicates a lack of measurement accuracy than an absence of difference, however small it may be (cf, Cohen, 1990). Physical theories are undoubtedly among the most numerical and highly developed in the whole of science, and therefore probably better suited to this approach than psychological theories. The rigid application of statistical hypothesis-testing, however, seems to have aggravated the neglect of theory in psychology.
A single-experiment just cannot serve to decide conclusively whether a claim is false or not (cf, Hauer, 2004;Ioannidis, 2005). It merely adds weight, proportional to the accuracy of its measurements, to one or the other position. A publication of a new effect should be considered suggestive, but certainly not definitive "proof " (cf, Phaf et al., 2014). In the words of Medawar (1991): "In the outcome science is not a collection of facts or of unquestionable generalizations, but a logically connected network of hypotheses which represent our current opinion about what the real world is like." (p. 98) Scientific exaggeration is often required by funding agencies for research marketing purposes (also called "valorization" at Dutch universities), but may induce QRPs and even fraud. Scientific prudence and modesty seem better ways to reach a durable development of science. Confirmatory researchers may inadvertently add to this exaggeration, because they are inclined to think of science as collecting conclusive, sometimes even "proven, " facts, whereas history has shown it to consist of ongoing discussions with continuous weight shifts between alternative hypotheses (cf, Lakatos, 1970).
If null hypothesis statistical testing detracts from psychological hypotheses and even induces a false sense of certainty, why not abandon null hypothesis statistical testing altogether (cf, Cumming, 2014)? The reporting of only effect sizes and confidence intervals (CIs) may actually reduce publication bias, because the latter is based more often on significance levels than on effect sizes (cf, Simonsohn et al., 2014). In addition, these estimation statistics are more informative, because they, similar to physics, focus on what the effect is rather than on what it is not. CIs should be used as an indication of measurement accuracy rather than for making decisions on whether some unspecified "effect" is there or not (e.g., contains zero; see Gardner and Altman, 1986). The latter decisions are highly fallible (a majority is probably "false, " see Ioannidis, 2005;Fiedler et al., 2012), and we need other, more theoretical, arguments to determine the level of support for a hypothesis provided by a set of results. Stronger, but still not infallible, conclusions can better be postponed until after meta-analyses on the proposed hypotheses have been performed (Schmidt, 1996). These metaanalyses have the additional advantage of identifying publication bias and being able to correct for it with the Trim-and-Fill method (Duval and Tweedie, 2000), or possibly with the p-curve method (Simonsohn et al., 2014). Only when effect size and the extent of publication bias can be judged in a meta-analysis, one can have more confidence in a finding.
The primary aim of this comment is to juxtapose the statistically oriented approach and a more theoretically oriented approach. The statistical approach of Wagenmakers et al. (2012) entails a two-way classification in either exploratory or hypothesis-confirmatory research. The latter type can only have a binary outcome with respect to the decision being made, the hypothesis is either confirmed or not. To arrive at such an outcome, a replication attempt must rely on the original research having uncovered and made explicit all relevant processes (i.e., an exhaustive theoretical analysis). All other types of research fall in the exploratory category, even when they further develop the theory starting from quite specific hypotheses. Merely confirming preregistered hypotheses has, however, never yielded new hypotheses, whereas unexpected findings stimulating further investigations do have that capacity and may even be the royal road to scientific innovation (e.g., Lehrer, 2009). Calling it undirected exploratory research, moreover, also does not do justice to the gradual progress-by-adjustment type of research (cf, Lakatos, 1970). The latter type of research is often guided by well-specified and concrete process hypotheses, which may be far superior above merely expecting a difference. Although I think we should try to move away from null hypothesis statistical testing (cf, Cumming, 2014), in the meantime a statistical approach to experimental psychology should become more theoretically oriented and include a third category: hypothesisguided research.

AUTHOR CONTRIBUTIONS
The author confirms being the sole contributor of this work and approved it for publication.