^{1}

^{2}

^{*}

^{1}

^{1}

^{1}

^{2}

^{3}

^{4}

^{4}

^{4}

^{1}

^{4}

^{4}

^{2}

^{4}

^{1}

^{2}

^{3}

^{4}

Edited by: Rene Zeelenberg, Erasmus University Rotterdam, Netherlands

Reviewed by: Guido P. H. Band, Leiden University, Netherlands; Christoph Stahl, University of Cologne, Germany

*Correspondence: Mark Rotteveel, Social Psychology Program, Department of Psychology, Faculty of Behavioural and Social Sciences, University of Amsterdam, Weesperplein 4, 1018 XA, Amsterdam, Netherlands

This article was submitted to Cognition, a section of the journal Frontiers in Psychology

This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) or licensor are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

Within the literature on emotion and behavioral action, studies on approach-avoidance take up a prominent place. Several experimental paradigms feature successful conceptual replications but many original studies have not yet been replicated directly. We present such a direct replication attempt of two seminal experiments originally conducted by Chen and Bargh (

Several prominent psychological theories (Frijda,

In the field of emotion and approach-avoidance behavior, the Chen and Bargh (

In their second experiment, CB manipulated congruency within participants and eliminated the explicit evaluative instruction. Specifically, participants were instructed to respond to the mere presentation of the attitude objects; in one block of trials, participants had to push the lever (i.e., execute an “away-from yourself” movement), and in another block they had to pull the lever (i.e., execute a “toward yourself” movement). The results again demonstrated a congruency effect: pulling was faster for positively valenced attitude objects, and pushing was faster for negatively valenced attitude objects. On basis of these results, CB concluded that (1) attitude objects are automatically evaluated; and (2) attitude objects automatically trigger a behavioral tendency to approach or avoid.

Since the publication of CB, numerous papers have been published in which approach and avoidance behavior was studied; however, the automatic link between affective evaluation and approach-avoidance tendencies was often simply taken for granted. To complicate matters, different results have been obtained using different experimental paradigms such as the manikin task (e.g., De Houwer et al.,

Replication is at the core of the scientific effort to further our understanding of the empirical world. Many effects do replicate reliably across laboratories in psychology (e.g., Simons,

Direct replications benefit from preregistration of design and analysis plan, ensuring a clean separation between which analyses are pre-planned (i.e., confirmatory, hypothesis-testing) and which analyses are

Our decision to replicate the CB studies was motivated in part by a recent meta-analysis on approach and avoidance behavior including 29 published studies and 81 effect sizes (Phaf et al., ^{1}

In direct replication studies it is essential to be able to quantify evidence in favor of the null hypothesis. In addition, it is desirable to collect data until the results are compelling. Neither desideratum can be accomplished within the framework of frequentist statistics, and this is why our analysis of both experiments will rely on hypothesis testing using the Bayes factor (e.g., Edwards et al.,

A frequentist analysis would start with an assessment of the effect size of Experiment 1 from CB which would then form the basis of a power analysis to determine the number of participants that yields a specific probability for rejecting the null hypothesis when it is false. This frequentist analysis plan is needlessly constraining and potentially wasteful: the experiment cannot continue after the planned number of participants has been tested, and it cannot stop even when the data yield a compelling result earlier than expected (e.g., Wagenmakers,

_{01} = Bayes factor in favor of the null hypothesis; H0 = null hypothesis; H1 = alternative hypothesis. The horizontal black dashed line indicates complete ambiguous evidence, and the horizontal dashed gray lines indicate strong evidence either in favor of the null hypothesis [i.e., log(_{01}) ≥ log(10)] or in favor of the alternative hypothesis [i.e., log(_{01}) ≤ log(1/10)].

_{01} = Bayes factor in favor of the null hypothesis; H0 = null hypothesis; H1 = alternative hypothesis. The horizontal black dashed line indicates complete ambiguous evidence, and the horizontal dashed gray lines indicate strong evidence either in favor of the null hypothesis [i.e., log(_{01}) ≥ log(10)] or in favor of the alternative hypothesis [i.e., log(_{01}) ≤ log(1/10)].

Based on the above considerations, our sampling plan was as follows: We set out to collect a minimum of 20 participants in each between-subject condition (i.e., the congruent and incongruent condition, for a minimum of 40 participants in total). Next we planned to monitor the Bayes factor and stop the experiment whenever both critical hypothesis tests (detailed below) reached a Bayes factor that could be considered “strong” evidence (Jeffreys,

We recruited 100 students (23 male, mean age = 21.2 year, SD = 0.42; Congruent: 10 male; Incongruent: 13 male) from the University of Amsterdam. All participants were rewarded with course credits or €5. Only students with Dutch as their native language were allowed to participate. One participant did not meet this criterion and was excluded from further analysis. All participants were informed about the procedure with an information brochure and subsequently signed an informed consent form.

Participants were seated in a dimly lit room at approximately one arm length's distance in front of a computer monitor. A 100 cm long lever (see Figure ^{2}

After reading the information brochure and signing an informed consent form, participants were seated in front of the computer screen with the lever next to their dominant hand, after which they read the procedure of the experiment off the screen. Participants were asked to classify the targets presented on the screen as either “good” or “bad” (for the exact wording of the specific lever movement instructions see Appendix

Before the start of the actual experiment we confirmed that participants understood the instructions correctly by making them perform 10 practice trials with 10 separate targets that were not part of the 78 target words (except for “stereo,” see Table

During the experiment each target appeared on screen until the participant pulled or pushed the lever above the 15.6 and 15.3° angle, respectively, necessary to trigger the response-switches. The computer recorded the time delay between the appearance of the target, the onset of the lever movement, and the triggering of the response-switch as specified before. The computer also recorded whether the lever had been pulled or pushed. After every response the target disappeared and it took 4 s until the next trial commenced and a new target appeared again at the center of the screen. The targets were presented in a random order with every target appearing once only. After responding on the last trial the experimenter returned to the room to thank and debrief the participant.

Based on the reasoning of CB and our own pilot tests, all trials with latencies greater than 3000 ms or smaller than 300 ms were excluded from further analysis (pulling with “good” judgments: 1.7%; pulling with “bad” judgments: 2.2%; pushing with “good” judgments: 1%; pushing with “bad” judgments: 1.2%). These criteria for outlier removal had been specified in the preregistration document. Whereas CB removed only latencies greater than 4000 ms, we had to reduce this value because pilot testing showed that in our setup one can easily push/pull the lever under 4000 ms. As in CB, and as specified in the preregistration document, the dependent measure for all analyses was the mean log(10)-transformed response latency for every participant. Results are reported as untransformed response latencies. The crucial hypothesis concerns the interaction that describes the congruency effect. Specifically, the congruency effect can be decomposed in two directional hypotheses: the first hypothesis states that participants respond faster to a positive target by pulling instead of pushing a lever; the second hypothesis states that participants respond faster to a negative target by pushing instead of pulling a lever. As specified in the preregistration document, the two crucial hypotheses will be assessed separately by means of two default Bayes factors for unpaired, one-sided

As described above, Bayes factors quantify the support that the data provide for the null hypothesis vis-a-vis the alternative hypothesis. Support in favor of the alternative hypotheses constitutes support in favor of the effects reported by CB in their Experiment 1.

For “good” evaluations, pulling the lever was a little faster (_{01} = 4.51 for “good” evaluations (i.e., the data for “good” evaluations are almost five times more likely under the null hypothesis than under the alternative hypothesis) and _{01} = 1.95 for “bad” evaluations (i.e., the data for “bad” evaluations are almost twice as likely under the null hypothesis than under the alternative hypothesis). Figures

^{*} in Experiment 1 response latencies reflect good vs. bad judgments whereas response latencies in Experiment 2 reflect responses to good vs. bad words)

Good^{*} |
1147 (29) | 1165 (30) | 562 (13) | 571 (14) |

Bad^{*} |
1267 (39) | 1204 (35) | 574 (12) | 562 (13) |

To probe the robustness of our conclusions, we varied the shape of the prior for the effect size under the alternative hypothesis. Figures

_{01}) is plotted as a function of the scale parameter r of the Cauchy prior for the effect size under the alternative hypothesis. The dot indicates the result from the default prior, the horizontal black dashed line indicates complete ambiguous evidence, and the horizontal dashed gray lines indicate strong evidence either in favor of the null hypothesis [i.e., log(_{01}) ≥ log(10)] or in favor of the alternative hypothesis [i.e., log(_{01}) ≤ log(1/10)].

_{01}) is plotted as a function of the scale parameter r of the Cauchy prior for the effect size under the alternative hypothesis. The dot indicates the result from the default prior, the horizontal black dashed line indicates complete ambiguous evidence, and the horizontal dashed gray lines indicate strong evidence either in favor of the null hypothesis [i.e., log(_{01}) ≥ log(10)] or in favor of the alternative hypothesis [i.e., log(_{01}) ≤ log(1/10)].

For completeness, we also analyzed the data using a frequentist repeated measures 2 (Evaluation: “Good” vs. “Bad”) × 2 (Instruction: Congruent vs. Incongruent) analysis of variance (ANOVA). Although congruent lever movements were faster (_{(1, 97)} < 1, _{(1, 97)} = 36.52, ^{2}_{p} = 0.274; _{good} = 1156 ms, _{good} = 21; _{bad} = 1236 ms, _{bad} = 26]. This main effect of judgment was the opposite of that obtained by CB (i.e., “bad” evaluations were faster than “good” evaluations). As shown in Table _{(1, 97)} = 3.02, ^{2}_{p} = 0.030]: Pulling the lever with “good” evaluations was somewhat faster than pushing [_{(1, 97)} = 0.18, _{(1, 97)} = 1.38,

The importance of the Two-Way interaction was also assessed with the help of a Bayesian ANOVA (Rouder et al., _{01} = 1.20).

In sum, our preregistered Bayesian hypothesis tests yielded evidence in favor of the null hypothesis, although the strength of this evidence was not compelling. The exploratory Bayesian ANOVA suggested that the data do not favor the alternative hypothesis over the null hypothesis. From a Bayesian perspective, the data certainly did not support the hypothesis as proposed by CB although our experiment included almost twice as many participants (

The discrepancy between the outcome of the frequentist and Bayesian hypothesis tests arguably reflects the shortcomings of

In comparison with the (corrected for publication bias) small to medium sized effect size reported in Phaf et al. (

_{01}) ≥ log(10)] or in favor of the alternative hypothesis [i.e., log(_{01}) ≤ log(1/10)].

A frequentist analysis would start with an assessment of the effect size of Experiment 2 from CB which would then form the basis of a power analysis. As for Experiment 1, however, our analysis is based on monitoring the Bayes factors of the critical hypothesis tests (detailed below).

Specifically, our sampling plan was as follows: We first set out to collect a minimum of 30 participants in a within-subject design. Next we planned to monitor the Bayes factors and stop the experiment whenever both critical hypothesis tests (detailed below) reached a Bayes factor that could be considered “strong” evidence (Jeffreys,

We recruited 56 students from the University of Amsterdam. Six participants were excluded for the following reasons: three participants did not operate the lever as instructed; two participants did not receive the correct instructions due to technical failure; and one left-handed participant was excluded because the experimental setup was not positioned correctly (i.e., at the left side). The remaining 50 participants (10 male, mean age = 21.3 year, SD = 3.5) were all native Dutch speakers and had not participated in Experiment 1. Participants were rewarded with course credits or €5. All participants were informed about the procedure with an information brochure and subsequently signed an informed consent form.

The same materials used in Experiment 1 were also used in Experiment 2, except that “worms” was now included in the stimulus set, resulting in a total of 78 targets (i.e., 39 positive targets and 39 negative targets). The procedure differed only with respect to instructions given (See Appendix

After half of the trials had been completed, a text appeared on screen to inform the participants that instructions would now change and that they had to switch lever movement direction from pushing to pulling (or vice versa). Additionally, the experimenter returned to the room to explain the new instructions and to ensure that the participants had understood them. Across all participants in CB's Experiment 2, the targets were presented in a fixed order. Although not explicated in CB, this may have been done to ensure the presence of an equal number of positive and negative objects as well as an equal number of weak attitude objects in both conditions (for details see CB). We followed these constraints but presented our targets in a semi-random fashion; targets were randomly drawn without replacement from two different lists containing 19 positive targets and 20 negative targets or 20 positive targets and 19 negative targets, respectively. For every participant the order of both lists was the same.

Our data analysis closely followed that of Experiment 1, the main exception being that the design was fully within-subjects instead of between-subjects with regard to the association between affective valence of the targets and specific lever movement. As outlined in the preregistration document, we followed the reasoning of CB and treated response times above 1500 ms and below 300 ms as outliers, and excluded them from the analysis (pulling with positive words: 2.4%; pulling with negative words: 1.2%; pushing with positive words: 1.6%; pushing with negative words: 1.7%). As in Experiment 1, and as outlined in the preregistration document, the dependent measure was the mean log(10)-transformed response latency for every participant. The crucial hypothesis (i.e., alternative hypothesis) concerned the interaction that describes the congruency effect. Specifically, the congruency effect can be decomposed in two directional hypotheses: the first hypothesis states that participants respond faster to a positive target by pulling instead of pushing a lever; the second hypothesis states that participants respond faster to a negative target by pushing instead of pulling a lever. Both hypotheses were assessed separately by means of two default Bayes factors for paired, one-sided

Bayes factors quantify the support that the data provide for the null hypothesis vis-a-vis the alternative hypothesis. Support in favor of the alternative hypotheses constitutes support in favor of the effects reported by CB in their Experiment 2.

As Table _{01} = 3.10; for negative words, it yielded _{01} = 1.11. In other words, for both positive and negative words we obtained “anecdotal” evidence (Jeffreys,

_{01} = Bayes factor in favor of the null hypothesis; H0 = null hypothesis; H1 = alternative hypothesis. The horizontal black dashed line indicates complete ambiguous evidence, and the horizontal dashed gray lines indicate strong evidence either in favor of the null hypothesis [i.e., log(_{01}) ≥ log(10)] or in favor of the alternative hypothesis [i.e., log(_{01}) ≤ log(1/10)].

_{01} = Bayes factor in favor of the null hypothesis; H0 = null hypothesis; H1 = alternative hypothesis. The horizontal black dashed line indicates complete ambiguous evidence, and the horizontal dashed gray lines indicate strong evidence either in favor of the null hypothesis [i.e., log(_{01}) ≥ log(10)] or in favor of the alternative hypothesis [i.e., log(_{01}) ≤ log(1/10)].

To probe the robustness of our conclusions, we varied the shape of the prior for the effect size under the alternative hypothesis. Figures

_{01}) is plotted as a function of the scale parameter r of the Cauchy prior for the effect size under the alternative hypothesis. The dot indicates the result from the default prior, the horizontal black dashed line indicates complete ambiguous evidence, and the horizontal dashed gray lines indicate strong evidence either in favor of the null hypothesis [i.e., log(_{01}) ≥ log(10)] or in favor of the alternative hypothesis [i.e., log(_{01}) ≤ log(1/10)].

_{01}) is plotted as a function of the scale parameter r of the Cauchy prior for the effect size under the alternative hypothesis. The dot indicates the result from the default prior, the horizontal black dashed line indicates complete ambiguous evidence, and the horizontal dashed gray lines indicate strong evidence either in favor of the null hypothesis [i.e., log(_{01}) ≥ log(10)] or in favor of the alternative hypothesis [i.e., log(_{01}) ≤ log(1/10)].

For completeness, we also analyzed the data using a frequentist ANOVA (Table _{(49)} = 1.713, _{(1, 49)} = 2.93, ^{2}_{p} = 0.057]: Pulling the lever with positive words was a little faster [_{(49)} = 1.054, _{(49)} = 1.742,

The importance of the Two-Way interaction was also assessed with the help of a Bayesian ANOVA (Rouder et al., _{01} = 4.28), that is, the observed data are 4.28 times more likely under the model without the interaction compared to the model with the interaction.

In sum, as was the case for Experiment 1, our preregistered Bayesian hypothesis tests yielded evidence in favor of the null hypothesis, although the strength of this evidence was rather modest. Figure

Our attempts to replicate the CB experiments did not succeed: for both replication attempts, the preregistered Bayesian hypothesis tests showed that the data provided more evidence for the null hypotheses than for the alternative hypotheses. The strength of this evidence is certainly not compelling, but the results do suggest that additional direct preregistered replications of the CB experiments are called for.

Using a frequentist ANOVA, exploratory analyses of Experiment 1 and Experiment 2 revealed a weak indication for congruency, expressed through interaction effects that were both marginally significant. These results were not, however, corroborated by a Bayesian ANOVA, which again provided weak evidence in favor of the absence of an interaction. This inconsistency arises as a result of the statistical peculiarities of

Although we attempted to duplicate the original experimental setup as accurately as possible there were of course small differences in the experimental setup and procedure that can maybe account for the differences in results. First, in our experimental setup participants had to move their hand a bit more than in the original setup. This could have resulted in less easy movements, for instance, maybe interfering with the congruency effect. When trying out both trajectories ourselves though we could not feel any more interference in the longer one we used than in the shorter one used by CB; hence, we do not believe this difference can explain the discrepant results. Moreover, latencies obtained in our experiment were faster than the original latencies suggesting that when any of such interference took place it certainly did not slow down our participants. Second, we used fewer words in our experiments (resp., 77 and 78 out of 92) than CB (i.e., 82 out of 92) assuming that targets used for practice trials in the original experiment were not used in the actual experiments and reported results. Of course this difference was due to our efforts to get our stimulus set to resemble the original stimulus set used by CB as close as possible so we do not think this difference can explain the discrepant results either. Third, in our experiments targets were randomly presented (Experiment 1) as well as semi-randomly (Experiment 2) whereas the original authors used a single randomly ordered list of words in both experiments. If this difference could explain the discrepant results we should probably conclude that the original findings are due to experimental noise alone but this would contrast again with our findings and the findings in Phaf et al. (

It seems clear that although we failed to replicate CB using our pre-registered Bayesian analyses we cannot conclude that there is no link between affective evaluation and approach-avoidance behavior. First, the evidence in favor of the null hypothesis is not compelling and stems from only two experiments. Second, using exploratory frequentist statistics we found weak evidence for this link. Third, a recent meta-analysis (Phaf et al.,

In addition, we were also unable to replicate the main effect of evaluative judgment (Experiment 1) and affective meaning (Experiment 2) that was obtained by CB. In Experiment 1 of CB, negative evaluations were faster than positive evaluations; in Experiment 2 of CB, participants responded faster to negative than to positive words. In contrast, in our Experiment 1 we found that positive evaluations were faster than negative evaluations; in our Experiment 2, there was no evidence that affective meaning influenced lever movement in the absence of explicit affective evaluation. This finding and the aforementioned findings suggests that the pattern of results obtained by CB may be more fragile than previously thought.

The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

This work was funded in part by the ERC grant “Bayes or Bust.” We thank Bert Molenkamp and Coos Hakvoort for their help with the hardware and software for the experiments.

The Supplementary Material for this article can be found online at:

^{1}

^{2}We contacted John A. Bargh for help in designing our experiments.