On the reliability of retrieval-induced forgetting

Rowland, Christopher A.; Bates, Lauren E.; DeLosh, Edward L.

doi:10.3389/fpsyg.2014.01343

ORIGINAL RESEARCH article

Front. Psychol., 21 November 2014

Sec. Cognition

Volume 5 - 2014 | https://doi.org/10.3389/fpsyg.2014.01343

This article is part of the Research TopicReplication Attempts of Important Results in the Study of Cognition.View all 12 articles

On the reliability of retrieval-induced forgetting

Christopher A. Rowland^*

Lauren E. Bates

Edward L. DeLosh

Psychology, Colorado State University, Fort Collins, CO, USA

Memory is modified through the act of retrieval. Although retrieving a target piece of information may strengthen the retrieved information itself, it may also serve to weaken retention of related information. This phenomenon, termed retrieval-induced forgetting, has garnered substantial interest for its implications as to why forgetting occurs. The present study attempted to replicate the seminal work by Anderson et al. (1994) on retrieval-induced forgetting, given the apparent sensitivity of the effect to certain deviations from the original paradigm developed to study the phenomenon. The study extends the conditions under which retrieval-induced forgetting has been examined by utilizing both a traditional college undergraduate sample (Experiment 1), along with a more diverse internet sample (Experiment 2). In addition, Experiment 3 details a replication attempt of retrieval-induced forgetting using Anderson and Spellman's (1995) independent cue procedure. Retrieval-induced forgetting was observed when using the traditional retrieval practice paradigm with undergraduate (Experiment 1) and internet (Experiment 2) samples, though the effect was not found when using the independent cue procedure (Experiment 3). Thus, the study can provide an indication as to the robustness of retrieval-induced forgetting to deviations from the traditional college undergraduate samples that have been used in the majority of existing research on the effect.

Introduction

Retrieving information from memory affects the later memorability of the retrieved target itself, but also has consequences for the memorability of non-target, related information, as well. The present paper focuses on the latter. Empirical research suggests that when information is retrieved, related information can become less memorable as a consequence of the retrieval attempt. This phenomenon is known as retrieval-induced forgetting (RIF; Anderson et al., 1994). Investigations of RIF have had a major impact on our understanding of human memory, with a particularly strong influence on theories of forgetting.

Interference theory, one of the prominent theories of forgetting that has been developed over the last several decades, assumes that forgetting is due to retrieval failure rather than a direct weakening or loss of stored information. Retrieval failure, in turn, is thought to be a consequence of the interference that occurs when information competes for retrieval. Such views have been instantiated in detailed mathematical models, such as the Search of Associative Memory (SAM) model (Raaijmakers and Shiffrin, 1981; Mensink and Raaijmakers, 1988). In the case of SAM, information in long-term memory does not lose the strength of its representation over time, but may instead be inaccessible at a given moment, given a specific cue, when other information in memory is also strongly associated to the same cue (also see Nairne, 2002). As such, forgetting can be conceptualized as an inability to remember a target at a particular instant due to retrieval competition, competition that results in interference between a specific cue and a specific target.

Interference accounts of forgetting have been successful in explaining numerous memory phenomena (Raaijmakers and Jakab, 2013). Despite this success, Anderson and colleagues elaborated on a competing theory of forgetting that quickly gained traction. In their seminal study, Anderson et al. (1994) developed a methodology, termed the retrieval practice paradigm, which yielded data supportive of an inhibitory theory of forgetting. By way of comparison, interference theory suggests that information may be inaccessible when a particular cue or set of cues is present, without any impact on the stored memory itself. That is, the association between a cue and a target may be weakened or blocked, but the target, item representation remains unaffected. Inhibition theory, in contrast, suggests that forgetting can occur due to the direct weakening or suppression of information in memory. During retrieval, the representation of competing information may be suppressed, making that information less likely to be remembered in the future. Thorough reviews of interference and inhibition theories of forgetting can be found elsewhere (e.g., Anderson, 2003; Storm and Levy, 2012; Raaijmakers and Jakab, 2013), but the key point for the present purposes is that the phenomenon of retrieval-induced forgetting has sparked substantial debate in cognitive psychology regarding the nature of memory and forgetting.

Anderson et al. (1994) drew support for inhibition theory using their retrieval practice paradigm. In this paradigm, participants begin by studying lists of category exemplars (e.g., ANIMAL–CAT, ANIMAL–DOG, etc.) for a number of different categories. Following initial study, a retrieval practice phase is administered. Half of the items, selected from half of the lists initially studied, are subjected to repeated cued recall tests (e.g., ANIMAL–C___). From this manipulation, three classes of items emerge: those that are in lists which receive retrieval practice and are themselves practiced (RP+ items); those that are in lists receiving retrieval practice, but are not themselves practiced (RP− items); and those in lists that do not receive retrieval practice at all (NRP items). Typically a delay (e.g., 20 min) is administered after the retrieval practice phase, during which participants complete a distracter task. This is followed by a final memory test for the originally presented category exemplars. Retrieval-induced forgetting refers to the finding that RP− items are remembered at a lower frequency than NRP (control) items. The finding is notable as RP− and NRP items both receive identical treatment (i.e., exposure only at initial study), suggesting the differential performance at final test results from the RP− items being categorically related to those items that received retrieval practice (RP+ items). According to inhibition theory, the retrieval of RP+ items during practice has the effect of suppressing competing responses (non-target, related category exemplars, i.e., RP− items), thereby making the RP− items less accessible at final test relative to the NRP baseline items.

In contrast, interference accounts of RIF specify that any observed impairment of RP− items results not from the suppression of target representations in memory, but from interference resulting from the strengthening of category cues to RP+ items. That is, during retrieval practice, RP+ items become more strongly linked to category cues, and as such, those cues become less effective at cueing RP− items due to their strong association to RP+ items. An additional, non-inhibition based account of RIF, draws attention to the importance of context in influencing item accessibility (Jonker et al., 2013). This context-based account suggests that RIF will be observed when a shift in context occurs between initial study and retrieval practice, and additionally, when the retrieval practice context is reactivated during the final test. An NRP category as a cue at final test will reinstantiate the initial study context (as this is the only point in the retrieval practice paradigm that NRP lists are exposed), thereby allowing access to NRP items. However, given a retrieval practice category cue, the retrieval practice phase context will likely be instantiated in preference to the initial study phase (e.g., as it occurs temporally closer to the final test phase), thereby selectively facilitating access to RP+ items that were re-exposed during retrieval practice, but failing to facilitate access to RP− items that are only linked to the initial study context. Under these circumstances, internal context cues will lead to a relative bias toward sampling RP+ items, and a relative detriment to RP− items, thereby resulting in RIF. As such, both inhibitory and non-inhibitory based mechanisms have been proposed to explain the empirical observation of RIF.

The original Anderson et al. (1994) paper has received considerable attention in the literature, having been cited over 700 times, and the retrieval practice paradigm has gained widespread usage, typically with only minor deviations to the base procedure. In many cases, this has resulted in replications of the RIF phenomenon (e.g., see Storm and Levy, 2012). Notably, however, there have also been a number of failures to replicate under circumstances that closely resemble the original retrieval practice procedure, or with just minor deviations from the original methods. Moreover, given that failures to replicate an effect are rarely published (Rosenthal, 1979), it is unclear precisely how many failed (or successful) replications have been conducted and remain unreported.

RIF appears to be sensitive to a number of moderating factors. Following from the original retrieval practice paradigm (Anderson et al., 1994), RIF is commonly assessed using a category- or category plus stem–cued recall procedure. The types of cues available at final test have implications for the theoretical accounts of RIF. For example, inhibitory accounts predict that forgetting occurs due to suppression occurring at an item-specific level, and thus RIF should be observed given any final test format when assessing specific targets in memory. However, interference accounts typically only predict RIF when using final test cues that were also used during retrieval practice, and thus are not “independent” (i.e., associated with other learned items; though note there may be more subtle nuances in determining whether a given cue should be considered independent or not, e.g., Camp et al., 2007). Variations in the format of the final test have yielded inconsistent results, however. Butler et al. (2001) did not attain a RIF effect in various fragment completion tasks, with and without category cues. Rowland and DeLosh (2014) report a lack of RIF when using free recall final tests, whereas Koustaal et al. (1999) show a RIF effect in free recall. The latter authors and others (e.g., Carroll et al., 2007) failed to find RIF for final recognition tests, however, although this pattern appears to be inconsistent across the literature, as well (cf. Gomez-Ariza et al., 2005; Spitzer and Bauml, 2007). In short, variations in final test format have produced disparate results that are hard to reconcile with any single theoretical account of RIF.

RIF is also sensitive to the strength of the semantic relationship between competing targets. In accordance with the inhibition account of RIF, there must be sufficient competition between possible targets given a cue to induce suppression. Accordingly, RIF does not reliably appear with weak exemplars (Anderson et al., 1994, Experiment 2). At the other extreme, RIF may also fail to emerge if competing targets are highly similar (Shivde and Anderson, 2001; Bauml and Hartinger, 2002; Goodmon and Anderson, 2011). Thus, RIF does not seem to follow a monotonic relationship with the strength of the relationship between competing targets. Consequentially, the pattern of results expected given a specific stimuli set can be difficult to predict a priori (see Raaijmakers and Jakab, 2013, p. 103, for a related issue).

Additional failures to replicate have been reported when there were slight deviations to the original retrieval practice procedure. This has been the case when a long retention interval (e.g., 24 h) has been employed (MacLeod and Macrae, 2001; Saunders and MacLeod, 2002), when certain types of implicit final tests are used (Perfect et al., 2002), when different cues are used at final test than those employed during retrieval practice (e.g., Perfect et al., 2004; Camp et al., 2007), and when speeded responses are required during final testing (Verde and Perfect, 2011). Similarly, instructing participants to engage in an integration strategy during encoding (i.e., to intentionally relate items to one another), can yield a null RIF effect (Anderson and McCulloch, 1999; Smith and Hunt, 2000; Bauml and Hartinger, 2002), as can the use of prose materials in certain circumstances (Little et al., 2011). Furthermore, RIF may be dependent on mood state and stress level, such that inducing a negative mood (Bauml and Kuhnbandner, 2007), or high stress (Koessler et al., 2009) in participants can eliminate the RIF effect.

In a similar vein, RIF appears somewhat sensitive to the participant population used in a given study. Like much research in psychology, many RIF studies have been conducted with predominantly healthy, college enrolled participants. However, the effect appears mitigated or eliminated in clinically depressed patients (Groome and Sterkaj, 2010), and similarly, does not as consistently emerge in ADHD patients (Storm and White, 2010); populations that may have impaired inhibitory control (and thus may not be expected to show as large of a RIF effect according to inhibitory accounts of RIF). As such, although RIF has been observed in a substantial number of instances (Anderson, 2003; see Storm and Levy, 2012), it has also failed to be observed under conditions that deviate slightly from the original retrieval practice paradigm, or with changes in population.

An additional point of interest concerns demonstrations of a finding in stark contrast to RIF, termed retrieval induced facilitation (RIFA). Studies of RIFA utilize a paradigm highly similar, and in some cases identical to the retrieval practice paradigm. For example, Chan et al. (2006, Experiment 1) had participants study a prose passage about toucans from which two related sets of questions (Sets A and B) were derived. Some participants then engaged in retrieval practice over a subset of questions (Set A) about the passage. On a final test, performance on the previously unexposed but related questions (Set B) was facilitated by virtue of initial testing, such that Set B recall was greater than in a comparison condition where participants initially restudied (rather than retrieved) the Set A questions. Similar patterns of results have been reported in the memory literature (Chan, 2009, 2010; Cranney et al., 2009; Rowland and DeLosh, 2014), in addition to the literature on adjunct questions in educational research (in which answering questions during or after reading text materials may facilitate–rather than inhibit–the learning of related but un-tested information; see Hamaker, 1989, for a review).

Although the circumstances in which RIF fails to emerge, and similarly, in which RIF reverses to RIFA, may be viewed as boundary conditions of the RIF effect, it is important to verify and better establish the reliability and magnitude of the core RIF effect itself. Across the literature, the magnitude of RIF appears to vary widely, likely resulting from both sampling error and a variety of moderating factors. In the case of the latter, theorists have identified possible moderators that can explain some of the null RIF results in a manner consistent with inhibition theory (see, e.g., Anderson, 2003), but such explanations have also been called into question based on inconsistencies in experimental results (Raaijmakers and Jakab, 2013). Regardless of theoretical orientation, one possibility is that the RIF effect is highly sensitive to the experimental paradigm employed, such that subtle changes in method can produce large variations in results. Alternatively, the RIF effect that arises from the original RIF paradigm may be reliable but relatively small, making it difficult to detect without substantial power. Thus, a high powered replication of the original Anderson et al. (1994) study can help to establish an estimate of the effect size of RIF, whether negative, null, slight, or substantial. This, in turn, may serve as a baseline for research that seeks to specify the key variables that influence the emergence of RIF, and isolate the conditions under which RIF reverses to a facilitation effect. Some recent investigations have started to examine such issues (e.g., Chan, 2009; Little et al., 2011), and although promising, there are a number of unanswered questions given the variety of circumstances in which RIF has been failed to be observed.

In addition to replicating Anderson et al. (1994) under conditions resembling the original study (i.e., using the same methods and sampling from a similar college population), an additional replication with a more diverse population may be illuminating, given some of the individual differences that have already been identified. To this end, Experiments 1 and 2 are designed to provide highly powered replications of Anderson et al. (1994, Experiment 1). Experiment 1 was conducted as per Anderson et al. (1994), sampling from an undergraduate college population. Experiment 2 sampled from a more varied population via the internet using Amazon Mechanical Turk. Internet sampling is becoming more common in psychological research (Mason and Suri, 2011), a trend that is likely to become more prevalent, given the development and growth of tools and platforms (e.g., Amazon Mechanical Turk) that facilitate the ability of researchers to effectively and conveniently utilize internet sampling (Buhrmester et al., 2011; Mason and Suri, 2011). Internet samples tend to differ in demographic characteristics from traditional university subject pools, with internet samples being more diverse on a number of dimensions (Gosling et al., 2004; Buhrmester et al., 2011). Although an emerging literature has suggested that internet sampling procedures yield results consistent with traditional lab-based studies in certain well-established cognitive tasks (e.g., Paolacci et al., 2010), more validation is needed (Buhrmester et al., 2011). As such, a high powered RIF replication attempt with an internet sample can provide a valuable contribution to both our understanding of RIF, and more generally, the burgeoning literature on internet sampling for psychological research.

Along with two replication attempts of Anderson et al. (1994), an additional attempt to replicate a RIF effect using the independent cue procedure was conducted as Experiment 3, following Anderson and Spellman (1995, Experiment 2). The independent cue procedure is a modification to the core retrieval practice paradigm, designed in such a way to differentiate between interference and inhibitory contributions to RIF. In brief, participants follow the base retrieval practice paradigm, but at final test are cued to recall learned items by the use of novel cues, rather than cues that have established associations during the learning and retrieval practice phases of the experiment. The logic behind this method derives from the goal of attempting to disentangle potential interference and inhibitory effects from both contributing to RIF. In the standard retrieval practice paradigm, participants practice retrieval on RP+ items that belong to specific categories, and then at final test are asked to recall items from those categories using the categories themselves as cues. If RIF is observed such that RP− items (from the same categories as the RP+ items) are recalled at lower frequencies than NRP items, the increased forgetting of RP− items may reflect item-specific suppression (i.e., inhibition of RP− items as they compete for retrieval with RP+ items), but alternatively, may reflect interference. That is, strengthening the association between cues (i.e., categories) and RP+ items during retrieval practice may lessen the later likelihood of recalling RP− items because the cue to RP+ item associations interfere with ones access to the RP− items linked to the same cue (i.e., the category). In other words, retrieval practice may not lead to suppression of target information in memory, but rather could simply weaken access to certain information as a consequence of weakening the effectiveness of available cues.

The independent cue procedure from Anderson and Spellman (1995) attempts to separate interference and inhibition effects by using novel cues that have not been differentially strengthened to RP+ items during the course of the experiment. Thus, associative interference effects are presumed to be mitigated, allowing for one to interpret an observation of RIF as reflecting inhibitory processes. Experiment 3 thus attempted to replicate Anderson and Spellman (1995, Experiment 2), in order to contribute a test of cue independent RIF to the existing literature.

Experiment 1

Experiment 1 was designed to closely replicate Anderson et al. (1994; Experiment 1). Aside from increasing the number of participants in the study, the number of stimuli learned by participants was also increased in an effort to reduce variance and maximize power.

Method

Participants

An a priori power analysis using the G-Power software program (Faul et al., 2007) determined a required sampled size of 70 participants to detect a small to medium sized forgetting effect (d = 0.4) with 0.95 power. Observed effect size from Anderson et al. (1994) could not be computed due to insufficient data, and thus d = 0.4 was used as an estimate. Following from power analysis results, 70 undergraduate psychology students at Colorado State University were planned to be solicited to participate in the study. Note that with the participant session scheduling method we utilized (i.e., soliciting groups of participants at a given time), we ended up receiving data from 72 participants, and the results are presented with this full data set.

Design

The experiment utilized a within-participant design, manipulating item type (i.e., retrieval-practice status: RP+, RP−, and NRP, described below). Participants were randomly assigned to one of four counterbalancing conditions in which the categories presented during the practice phase were varied. The factor of retrieval-practice status has three levels, following Anderson et al. (1994): RP+ items which were exemplars belonging to a tested category that were practiced a total of three times during the initial test phase, RP− items which were exemplars belonging to a tested category that were not shown during the practice phase, and NRP items which were exemplars belonging to non-tested categories. As such, items were counterbalanced according to list type (retrieval practice lists or not), and item type within lists (RP+ or RP−).

Materials

Category selection. Eighteen categories, two of which served as fillers, were drawn from published norms (Marshall and Cofer, 1970). Of the 18 categories, eight of them were taken directly from Appendix C of Anderson et al. (1994). The other ten categories that were created for this experiment were chosen under the same selection criteria as the original eight categories, following Anderson et al. (1994).

Exemplar selection. For the eight categories obtained from Anderson et al. (1994), all strong exemplars were selected for use in this experiment. For the remaining ten categories, chosen exemplars were ensured to have an average rank order of eight according to the Battig and Montague (1969) category norms. Exemplars were low frequency, non-compound, unambiguous words with an average word frequency of 12 occurrences per million (Kucera and Francis, 1967). No two exemplars began with the same first two letters within a category, ensuring that each cue (i.e., the first two letters of each exemplar) in the practice phase was unique. In addition, the effectiveness of each cue was assessed by measuring versatility (Solso and Juel, 1980), yielding a mean versatility value of 244 (see Anderson et al., 1994, for elaboration on versatility values).

Procedure

The experiment consisted of four consecutive phases: a learning phase, a retrieval practice phase, a distracter phase, and a final category-cued recall phase in which the category names acted as cues for the previously studied exemplars. In the learning phase, subjects were instructed that they would be exposed to a series of word pairs containing a category and a word belonging to that category (i.e., an exemplar, e.g., Fruit: Banana), and that they would have 5 s to study each pair for a later test. After each pair had been presented for 5 s on a PowerPoint presentation, an auditory signal indicated advancement to the next word pair. The order of exemplars was determined by blocked randomization in which each block contained one exemplar from each of the 18 categories. This resulted in six blocks of 18 items each. The order within each block was randomized except for in the first block where items from the filler categories appear first, and in the last block where items from the filler categories appear last.

Next, during the retrieval practice phase, participants were shown a series of category-exemplar pair fragments with the first two letters of each exemplar as a retrieval cue (e.g., Fruit: Ba_____). Participants were instructed to write down the word with the missing letters by thinking back to the initial learning phase and recalling the exemplar that was previously shown. Ten seconds were provided per word pair, after which an auditory signal indicated that a new word pair would be shown. Category-exemplar pairs were randomly positioned, with no exemplar appearing more than once without at least one other exemplar shown in-between the two instances. Participants were exposed to each of the RP+ category-exemplar pairs three times during the practice phase. As in the learning phase, the first and last items of the practice phase belonged to the filler categories to mitigate primacy and recency effects. After completion of the retrieval practice phase, participants engaged in a 20 min operation span (OSPAN) task (Unsworth et al., 2005), serving as a distractor. The distracter OSPAN task was presented on the PowerPoint presentation and participants were instructed to record their answers on a separate answer sheet provided by an experimenter.

In the final test phase, participants were provided with the category names from the previously learned lists, and were given instructions to recall as many previously studied exemplars as possible for each category. Each of the 18 category names was presented sequentially, with 30 s given for recall of exemplars from each category before proceeding to the next category. The order of the experimental category cues was randomized, with the filler categories provided first and last (i.e., lists 1 and 18).

Known differences from original study

Few differences exist between our design and that of Anderson et al. (1994). However, one difference within our study was the exclusion of weak exemplars (i.e., we utilized only the “strong exemplar” condition from Anderson et al., 1994, Experiment 1), given that the strong exemplar condition produced the more robust forgetting effect the original report. One further deviation from the original study concerns the number of word lists employed. We doubled the eight experimental lists used by Anderson et al. (1994) to 16 lists (i.e., 2 filler, 16 experimental) in our study. The goal of this modification was to reduce variability and increase power.

Confirmatory Analysis Plan

Our analysis plans are designed to focus on the result(s) of interest for each Experiment. For Experiment 1 we first conducted a planned comparison t-test between RP− and NRP item final recall performance, anticipating lower RP− performance (i.e., a RIF effect). Such a planned comparison provides the strongest means by which to detect the effect of interest. Second, we ran a one-way repeated measures ANOVA on final recall performance for all three types of items (RP+, RP−, NRP), with Bonferroni corrected post-hoc tests as warranted, in order to more traditionally assess the impact of the retrieval practice phase on final retention.

An additional method for evaluating replication outcomes proposed by Simonsohn (2014) centers on examining the effect sizes observed in a replication attempt against an expected minimum effect size that would be observed in the original study given an arbitrarily low amount of power. This method involves determining whether the replication attempt yields an effect that differs from the null (as per a traditional null-hypothesis significance test), in addition to an effect size based on the expected minimum effect that would be observed in the original study with 0.33 power: the “d_33% null” (refer to Simonsohn, 2014, for further elaboration on the logic and potential benefits of this method). If a replication attempt yields an effect both larger than the null, and at the same time not significantly smaller than the d_33% null, the replication is considered successful. On the other hand, replication attempts that fail to reject the null (i.e., are non-significant), but at the same time reject the d_33% null (i.e., are reliably smaller than d_33%) can be considered “informative” failures to replicate. In contrast, failing to reject both the null and the d_33%null provide less information, and can be considered “uninformative.” Thus, we considered the standard statistical test outcomes, along with the results of the Simonsohn (2014) method to inform the outcome of each experiment.

Participant data was only excluded if responses demonstrated an obvious lack of responding to the experimental procedure during the retrieval practice, distractor, or final recall phase (e.g., no responses given, unrelated responses given).

Results

Final test performance is reported in Figure 1. A planned comparison examining final recall of RP− and NRP items yielded a reliable RIF effect, t₍₇₁₎ = 2.59, p = 0.01, with NRP items recalled at a higher frequency (M = 0.45, SE = 0.01) than RP− items (M = 0.41, SE = 0.02), d = 0.31. A repeated measures ANOVA comparing all three item types yielded differences between final recall frequencies, F_{(2, 142)} = 228.87, p < 0.01, η²_p = 0.76. Post-hoc comparisons revealed superior recall of RP+ items (M = 0.79, SE = 0.02) to both RP− and NRP items, p's < 0.01, d's = 1.95 and 2.33, respectively.

FIGURE 1

Figure 1. Final recall test performance for each item type in Experiments 1 and 2.

We next applied the Simonsohn (2014) method to compare our RIF effect result with that of the original study. Based on the sample size from Anderson et al. (1994, Experiment 2; n = 36), we calculated d_33% as 0.26. Given the significant RIF effect observed in Experiment 1, and effect size d = 0.31, we can conclude that the replication was successful.

Experiment 2

The goal of Experiment 2 was to provide a further replication attempt of Anderson et al. (1994; Experiment 1) through the use of a broader sample. To this end, we utilized an internet sampling procedure, as opposed to soliciting participants from our undergraduate participant pool, as was done in Experiment 1 and in the original study.