Assessing Boundary Conditions of the Testing Effect: On the Relative Efficacy of Covert vs. Overt Retrieval

Sundqvist, Max L.; Mäntylä, Timo; Jönsson, Fredrik U.

doi:10.3389/fpsyg.2017.01018

ORIGINAL RESEARCH article

Front. Psychol., 21 June 2017

Sec. Cognition

Volume 8 - 2017 | https://doi.org/10.3389/fpsyg.2017.01018

Assessing Boundary Conditions of the Testing Effect: On the Relative Efficacy of Covert vs. Overt Retrieval

Max L. Sundqvist^*

Timo Mäntylä

Fredrik U. Jönsson

Department of Psychology, Stockholm University, Stockholm, Sweden

Repeated testing during learning often improves later memory, which is often referred to as the testing effect. To clarify its boundary conditions, we examined whether the testing effect was selectively affected by covert (retrieved but not articulated) or overt (retrieved and articulated) response format. In Experiments 1 and 2, we compared immediate (5 min) and delayed (1 week) cued recall for paired associates following study-only, covert, and overt conditions, including two types of overt articulation (typing and writing). A clear testing effect was observed in both experiments, but with no selective effects of response format. In Experiments 3 and 4, we compared covert and overt retrieval under blocked and random list orders. The effect sizes were small in both experiments, but there was a significant effect of response format, with overt retrieval showing better final recall performance than covert retrieval. There were no significant effects of blocked vs. random list orders with respect to the testing effect produced. Taken together, these findings suggest that, under specific circumstances, overt retrieval may lead to a greater testing effect than that of covert retrieval, but because of small effect sizes, it appears that the testing effect is mainly the result of retrieval processes and that articulation has fairly little to add to its magnitude in a paired-associates learning paradigm.

Introduction

A wealth of research has shown that individuals who repeatedly test memory during learning will perform better on a later recall test than those who spend an equal amount of time repeatedly studying the same material, a phenomenon often referred to as the testing effect (e.g., Gates, 1917; Carrier and Pashler, 1992; see Roediger and Karpicke, 2006, for a review). This kind of self-testing has several advantages in terms of learning, monitoring and regulation: It acts as a diagnostic test of the ongoing learning process which may in turn help to direct further studying efforts to where they are most needed (Metcalfe, 2009). Perhaps more importantly, it may also boost memory itself, as evidenced by the testing effect.

Although the testing effect itself is a robust phenomenon, its boundary conditions are less well understood. While the testing effect has been found in a multitude of materials (e.g., Wheeler and Roediger, 1992; Carpenter and DeLosh, 2006; Roediger and Karpicke, 2006; Carpenter and Pashler, 2007; Kang et al., 2007; Karpicke and Roediger, 2007), all these findings are based on the same response format, namely an overt testing procedure. When tested during learning, participants’ memory is typically assessed by having them overtly articulate the correct answer, for instance by typing it on a keyboard or saying it out loud. If the answer is not articulated, there is no way, experimentally speaking, of scoring these responses. In everyday settings, however, many students will likely engage in retrieval practice that is entirely covert, that is, an answer that is retrieved and produced internally by thinking it, but with no overt articulation of that information. For this reason, it is important to know if there is a relative advantage in terms of the efficacy of these response formats, as it has implications not only for understanding the testing effect itself, but also for the development of optimal learning and teaching instructions. Dunlosky et al. (2013) reviewed the effectiveness of various learning techniques, and found that retrieval practice was among the few that had high utility (i.e., the effect was robust and generalized widely). The testing effect is clearly a robust effect, and it can explain why retrieval practice is of such high utility as a learning technique, but it remains to be seen whether the relative efficacy of covert and overt retrieval is also a robust phenomenon, or if it only exists in experimental settings that do not generalize to real-world settings – if it exists at all. One way of assessing this is to consider effect sizes. If they are small (e.g., Cohen’s d < 0.3; Cohen, 1988), the real-world implications are also limited (see Table 1).

TABLE 1

TABLE 1. The relative efficacy of overt vs. covert retrieval as reported in 13 experiments.

The testing effect has been demonstrated for a large number of materials and testing formats (see Roediger et al., 2010, for a review), although the specific response format of the test (during learning) has received little attention. Most testing-effect experiments utilize either free recall or cued recall in various forms of overt testing. However, some studies have reported a testing effect following a covert retrieval practice (e.g., Carpenter et al., 2006; Carpenter and Pashler, 2007; Carpenter et al., 2008; Kang, 2010; Jönsson et al., 2014).

Similar findings have also been made in metamemory research, where the act of judging the degree to which something has been learned (i.e., judgments of learning, JOLs; Nelson and Dunlosky, 1991) seems to improve memory itself. When a JOL is made after a delay, it elicits an attempted retrieval of the sought-after information, and successful retrieval is associated with a testing effect (Spellman and Bjork, 1992). If the JOL was made immediately after study, the information would likely be available in short-term memory, and therefore no testing effect should be produced because there was no retrieval attempt (e.g., Nelson and Dunlosky, 1991). In other words, delayed JOLs should produce testing effects because they entail covert retrieval (e.g., Sundqvist et al., 2012; Jönsson et al., 2014; Akdoğan et al., 2015; see Rhodes and Tauber, 2011, for a review), although not all studies confirm this. For instance, Tauber et al. (2015) found that while both delayed JOLs and delayed testing entail covert retrieval, delayed JOLs only had a minor effect on final test performance. Most studies on the delayed JOL effect do not directly compare memory performance following covert and overt retrieval, but they nonetheless provide evidence that a testing effect can be produced by covert retrieval alone, which begs the question of whether articulation has something to add to its magnitude.

Should Response Format Affect the Magnitude of the Testing Effect?

Although the hypothesis of response format as a moderator for the testing effect is relatively novel, many previous studies have examined the relationship between modality and memory (e.g., Penney, 1975). However, these studies are mainly concerned with the modality of presentation, rather than response (Harvey and Beaman, 2007). Gardiner et al. (1977) examined memory performance after having learned words, by either saying them out loud or writing them down, or both, and found that word recognition was more accurate for the participants who had both spoken and written the words, compared to any of the groups that only spoke or wrote the words. Their interpretation of these results was that (successful) retrieval of some information should strengthen the memory trace for that information, and that the various ways of articulating the answer (e.g., by saying it out loud or writing it down) would cause qualitative differences in the recoding of the trace, such that auditory, articulatory, kinesthetic, or visual attributes become part of the trace, depending on the mode of articulation. This line of reasoning is closely related to the production effect (e.g., Ozubko and MacLeod, 2010), whereby saying a word aloud during learning can enhance memory, compared to reading it silently. A reasonable explanation of the production effect is that the creation of a verbal cue (that is not present when only reading the word) facilitates future retrieval (MacLeod et al., 2010). Although the production effect is concerned with encoding, rather than retrieval, it is reasonable to suspect that the mechanism driving the production effect could also cause overt retrieval to be more beneficial for memory than covert retrieval. If this is the case, we should also expect this relative advantage for various forms of overt retrieval (which all entail articulation), relative to covert retrieval, in a testing effect paradigm. Like the production effect, the generation effect (e.g., Jacoby, 1978; Slamecka and Graf, 1978) also posits enhanced memory performance as a result of articulation. However, as demonstrated by Karpicke and Zaromb (2010), the testing effect and the generation effect differ by mode of retrieval, such that intentional retrieval is more beneficial for retention than generation (or production) under incidental retrieval instructions. That is, it matters whether retrieval is its own goal, or simply some means of completing some other task. Given this account, covert and overt retrieval should produce testing effects of equal magnitudes. So, the production effect and the generation effect, although highly similar, have different implications for the relative efficacy of covert vs. overt retrieval, which is all the more reason to further investigate the testing effects produced by different response formats.

At this point, we may ask why covert and overt retrieval should give rise to testing effects of different magnitudes at all. A possible explanation comes from the transfer-appropriate processing (TAP) account of the testing effect (see Roediger and Karpicke, 2006), which states that the degree of congruency, between encoding and retrieval, will increase the likelihood of successful retrieval, and since final testing is virtually always overt, the TAP hypothesis would predict that overt testing should produce a stronger testing effect than covert testing.

Yet another, perhaps less intriguing explanation is simply the amount of time dedicated to processes involved in the retrieval and articulation of information. As mentioned earlier, if the only difference between covert and overt retrieval is the act of articulation, then we may assume that overt retrieval typically should take longer than covert retrieval simply because it takes additional time to articulate the information that has just been retrieved. This time could be regarded as additional exposure to the information itself, which is likely to increase the memory strength for that information, which in turn boosts the testing effect. Naturally, this can be avoided by having equated exposure times for both covert and overt retrieval conditions. Nonetheless, if we were to provide study advice to students on the basis of the findings in the testing effect literature, the explanation for this relative efficacy becomes rather irrelevant; overt testing should be preferred over covert testing, even if the associated benefit is only due to differences in exposure or processing time, simply because what matters is the memorial benefit itself – not the reason why it exists.

There are four studies, of particular relevance to this work, that have investigated the relative efficacy of covert and overt retrieval on the testing effect (Izawa, 1976; Putnam and Roediger, 2013; Smith et al., 2013; Jönsson et al., 2014). Izawa (1976) had subjects undergo cycles of studying and testing, where testing was either silent (covert) or vocalized (overt). At final recall, covert and overt testing conditions performed equally well, meaning that there was no difference in the magnitude of the testing effect produced by covert vs. overt testing (although there were short-term effects of vocalization). Jönsson et al. (2014) found that overt retrieval produced stronger testing effects than covert retrieval, although the effect size was small (Cohen’s d = 0.21). Specifically, in the first of two experiments of their study, they found a response format by retention interval interaction, indicating a testing effect. However, the interaction was mainly driven by differences between the study-only and overt response format conditions, and there was no significant difference between covert and overt conditions at the 1-week retention interval. In the second experiment, covert retrieval was compared to overt retrieval in a within-subjects design (as opposed to Experiment 1, which manipulated response format between subjects), and a main effect of response mode was found, such that overt retrieval was more beneficial than covert retrieval in terms of memory performance. Putnam and Roediger (2013) found mixed evidence of response format, such that overt retrieval led to better final recall in only one of three experiments (Experiment 1 of their study failed to replicate a testing effect, as the restudy condition was confounded by the addition of item-wise JOLs following restudy). Smith et al. (2013) found no difference in free recall performance of items that had been tested overtly or covertly during learning. Taken together, these inconclusive results warrant further investigation of the role of response format on the testing effect.

The Current Study

The purpose of the current study was to explore possible factors that could help explain and reconcile the disparate results within this field (e.g., Putnam and Roediger, 2013; Smith et al., 2013; Jönsson et al., 2014). Given the design and findings of these studies, there were five main considerations that governed the overall design of the four experiments of this paper:

First of all, a covert retrieval condition would need to be included, with which to compare an overt retrieval condition. This comparison was the main focus of this paper, and is therefore included in all four experiments.

Second, a study-only condition would need to be compared to a study-test condition, across a short and a long retention interval, simply to replicate a testing effect. This would serve mainly as a confirmation that the stimulus material and the tests used would indeed produce a testing effect.

Third, there are different overt response format that may affect the outcome differentially, meaning that overt retrieval could be subdivided into two or more conditions, such as typing and handwriting. This was done for two reasons: (i) from point of view of the TAP account of the testing effect, the magnitude of the testing effect may depend of the level of congruency between the circumstances during learning and the circumstances during testing. If response formats differed (or were the same) during learning and final testing, this would allow not only for comparison between covert and overt retrieval, with respect to the testing effect produced, but also within different forms of overt retrieval, or different levels of TAP congruency, and (ii) based on research on the production effect, articulation may be beneficial for memory under certain circumstances, but as evidenced from research on haptics and handwriting, not all forms of articulation may benefit memory the same way (see Mangen and Velay, 2010 for a review). For instance, handwriting appears to be more beneficial to memory than typing on a keyboard (Mangen et al., 2015; although not all studies have found this advantage, e.g., Vaughn et al., 1992) because the level of embodied cognition involved in handwriting is believed to be higher than in the case of typing, and that this makes memory for handwritten information more distinct and rich in terms of sensomotor and visual content. Thus, if the magnitude of the testing effect depends on articulation, through some mechanism that has yet to be fully explicated, we should expect that different modes of articulation will boost the testing effect to different extents, given the findings of Mangen et al. (2015). If articulation does not contribute to the magnitude of the testing effect, we should expect to observe no differences in memory performance between covert and overt retrieval, regardless of how the articulation was carried out in the overt conditions. This was the aim of Experiments 1 and 2.

Fourth, the different response formats needed to be tested either in blocks, as was the case in both Putnam and Roediger (2013), and Jönsson et al. (2014), or in a random order for each trial, as in Experiment 3 of this paper. Rowland et al. (2014) investigated the effects of mixed vs. pure lists in a testing effect paradigm, and found no differences in the magnitude of the testing effects created by either kind of list. However, their finding that the test effect itself is unaffected by list order does not necessarily mean that the relative efficacy of covert vs. overt retrieval is also unaffected by list order. For instance, Jonker et al. (2014) found a production effect for items read either silently or aloud, but only for mixed (i.e., random) and not pure (i.e., blocked) lists. While this finding pertains more to the item-order account (see McDaniel and Bugg, 2008) than the testing effect, it is an example of differences in memory performance as a function of list order. Moreover, the list order manipulation is of interest because it is directly connected to the way participants perceive the tasks of either covertly or overtly retrieving information. In the sense that covert retrieval is identical to overt retrieval – the difference being a lack of articulation – we can reasonably assume that the overt retrieval process, until the point of articulation, is very similar to the covert retrieval process, if not identical. However, built into this assumption is that participants are not able to anticipate whether the information that has just been covertly retrieved will also need to be overtly articulated. In cued-recall tests that present items in blocks of covert or overt tests, participants are very likely to understand that several items will be tested in the same way (i.e., covertly or overtly) until a change takes place, after which the response format will again remain the same for several items. This design creates a possibility for participants to adopt different retrieval strategies, criteria, or thresholds for giving an affirmative response. If covert and overt testing are instead carried out in random order, participants will have no way of knowing whether the information that is initially retrieved will also need to be articulated overtly.

Jönsson et al. (2014) investigated this possibility by comparing response latencies for overt and covert retrieval during learning and found no differences between the two response formats. This would suggest that covert and overt retrieval indeed involve similar retrieval processes, however, two processes that are equal in duration do not necessarily need to be identical in all other regards. Therefore, testing items covertly and overtly either in blocks or in a random order may provide an explanation to the relative efficacy of covert and overt retrieval that does not pertain to the act of articulation. The rationale is that if the testing effect is only driven by retrieval, covert and overt retrieval should create testing effects of equal magnitude, especially in the case of random testing order, for reasons stated above. For tests given in covert and overt blocks, the retrieval process itself might differ by response format. This would explain the advantage for overt retrieval found by Jönsson et al. (2014), as the result of differences in retrieval processes rather than an added memorial benefit by means of articulation. If, on the other hand, this advantage is due to articulation, we should not expect differences between tests given in blocked or random order. This was the aim of Experiment 3.

Fifth, and finally, the distinction between blocked and random testing order applies only to designs where response format was manipulated within subjects, as a between-subject design would assign only one response format to each subject (i.e., one block of tests with one response format) and therefore, there could be no such condition. For this reason, the testing order (i.e., blocked vs. random) itself would need to be manipulated both within and between subjects, such that some participants experienced both random and blocked testing, and others only one of the two. Again, this was done to investigate whether the retrieval processes involved in both covert and overt retrieval could indeed be considered identical. In a false memory paradigm, Huff et al. (2015) manipulated list order both within and between subjects, and found that free recall performance was better for blocked than for random lists, but only when list order was manipulated within subjects, indicating that there may be carryover effects when testing participants in both blocked and random orders. So if, for instance, we observed a difference in the testing effects created by overt and covert retrieval, depending on whether the participants were subjected to either random or blocked testing – or both – we could conclude that one testing order had an influence on the other. This could happen either by a random test affecting a subsequent blocked test, or vice versa. For this reason, the sequences of testing would need to be fully counterbalanced to avoid order effects. This was the aim of Experiment 4.

Experiment 1

In Experiment 1, we sought to compare cued recall performance with respect to both a short (∼5 min) and a long (1 week) retention interval, as well as four different learning conditions (study-only vs. covert vs. typing vs. writing). The inclusion of a study-only condition, which is similar to a control condition, was simply a way of ensuring that the given design did in fact produce a testing effect. In addition to cued recall performance, we also measured response latencies to establish whether they differ by modes of retrieval and/or articulation.

Method

Participants, Design, and Materials

Thirty-two (11 males) participants, with a mean age of 27.19 years (SD = 8.95, range 19–59), were recruited from various academic disciplines and different universities, institutes and colleges throughout the municipality of Stockholm. For their participation in the study they received either course credit or a movie voucher.

The experiment was designed using E-prime 2.0 professional software (Psychological Software Tools, Pittsburgh, PA, United States) and was run on desktop computers. The stimulus list consisted of 48 word pairs (e.g., flicka - pojke) taken from Swedish Associations Norms (Shaps et al., 1976) that had similar association values (varying from one to three). The association value was computed by Shaps et al. (1976), where participants reported the first word they associated with a certain word they were presented with. An association value of x meant that out of 100 participants, x individuals reported a specific word associated with a target word. All items in the stimulus list had an association value of two.

Procedure

Participants were presented with a written consent form and general description of the experiment was provided. After starting the computer script, their age and gender was entered, and all further instructions were thereafter displayed on the computer screen. The experiment consisted of three phases:

Study phase

In the study phase, participants were allowed to study each word pair individually for 6 s, in a random order, and this process was repeated for a total of three times. Between each block of 48 items, a distractor task was given, in which participants would verify as many mathematical expressions as possible in 30 s. This was done by pressing “1” on the keyboard for a correct mathematical expression, and “0” for an incorrect expression.

Testing phase

The testing phase contained four separate conditions that were manipulated within subjects. The 48 items were randomly, but evenly, assigned to four conditions, meaning that each condition contained a subset of 12 items which were all displayed or tested in a random order. The four conditions were covert, type, write, and study-only. The study condition contained a fourth opportunity to study each item after the study phase. For the other three conditions, a two-step testing procedure was adopted, with slight variations depending on condition.

First, a cue word was shown to the participants. This is the left word in the word pair, and participants were instructed to try and remember the right (target) word. If they believed they would be able to answer, they would press the ENTER key within 5 s. If this was not done, the script would move on to the next item. If ENTER was, however, pressed within 5 s, participants would either write or type their answer, or do nothing at all, depending on condition. The script ensured that each item would be presented for a total of 12 s, so pressing ENTER after 5 s would leave 7 s to give an answer. Similarly, pressing ENTER after 3 s would leave 9 s to provide the answer, and so on.

In the covert condition, pressing ENTER meant that one would have to wait for the remainder of the 12-s period for that particular item. Although time-consuming, this was the only way to ensure that exposure time did not differ between different items and conditions. For this reason, the items in the study condition were also displayed for 12 s.

In the type condition, participants were prompted to type their answer on the keyboard after they had pressed ENTER the first time. When finished, they would submit their answer by pressing ENTER again.

In the write condition, participants would instead write their answer (i.e., the target word) on a sheet of paper in front of them, and then press ENTER again. Apart from the way the answer was articulated, the procedure was identical to that of the type condition.

Final recall phase

After having completed the testing phase, participants were given an on-line typing speed test¹, in which the task was to copy a template text verbatim in 1 min. When 1 min had passed, a score was given that reflected the number of words that the participant had correctly copied. This test was taken three times, and the highest of the three scores was noted.

The typing test served as a short retention interval for the first of the two final cued-recall tests. In these tests, six words from each condition (i.e., half of the items) were selected randomly to be tested at both the short (5 min) and the long (7 days) retention interval. A cue word was shown and participants were given 15 s to type their answer on a keyboard and press ENTER to submit the answer. After 1 week, participants returned to take the final cued recall test, which contained the other half of the items.

Results

An alpha level of 0.05 was used, and for the analyses of variance (ANOVA) effect sizes are denoted by partial eta squared ( $η_{p}^{2}$ ) or Cohen’s d.

Cued Recall during Learning

Given the design of the experiment, data only allowed for comparison of the cued recall performance during learning, between two of the four learning conditions. This is because no articulation took place for the study-only and covert conditions. Remember that during learning, the participants pressed the ENTER button when (and if) they had recalled an item, but thereafter, only the type and write conditions allowed participants to articulate their responses (for the study-only condition, no action was required from the participants). However, upon closer inspection of the number of ENTER presses associated with each condition, there appears to be little difference at least in the proportion of affirmative responses across conditions. On average, subjects pressed ENTER equally often for items that belonged to the covert (F_3,96 = 20.35, $η_{p}^{2}$ = 0.39, p = 0.001), type (M = 10.00;SD = 2.00)and write (M = 10.03;SD = 2.02) conditions, that is, roughly 84% of all trials.

There was no significant difference in the cued recall performance between the type condition (M = 0.76;SD = 0.20) and the write condition (M = 0.74;SD = 0.26) during learning, t₃₁ < 1. For affirmative responses (i.e., ENTER presses within the specified time frame), recall was generally high for both the type (M = 0.92;SD = 0.17) and write (M = 0.86;SD = 0.22) conditions. Again, these differences were not significant.

Final Cued Recall

A response format × retention interval repeated measures ANOVA on cued recall data showed significant main effects of retention interval, F_1,31 = 178.05, $η_{p}^{2}$ = 0.85, p = 0.001, and response format, F_3,93 = 5.98, $η_{p}^{2}$ = 0.16, p = 0.001, as well as their interaction, F_3,93 = 8.97, $η_{p}^{2}$ = 0.22, p = 0.001. As can be seen in Table 2, the conditions study-only, covert, and write did not differ at the short retention interval, although the type condition differed significantly from the covert (t₃₁ = 2.58, p = 0.015) and write (t₃₁ = 2.92, p = 0.006) conditions, but not the study-only condition, t₃₁ < 1. At the long retention interval, the cued recall performance of the conditions covert, type, and write all differed significantly from the study-only condition (covert : t₃₁ = 4.48, p = 0.001; type : t₃₁ = 3.49, p = 0.01; write : t₃₁ = 4.56, p = 0.001) but not each other, ts₃₁ < 1.

TABLE 2

TABLE 2. Cued recall performance as a function of the response format and retention interval (with standard deviations in parentheses).

Sidak post hoc comparisons revealed that the mean of the study-only condition differed significantly from those of all other conditions (covert: M_I-J = 0.10; SE = 0.03, p < 0.05; type: M_I-J = 0.12; SE = 0.03, p < 0.01; write: M_I-J = 0.08; SE = 0.03, p < 0.05). This suggests that the effect was mainly driven by the study-only condition relative to the other three conditions.

Response Latencies during Learning and Final Recall

The study-only condition had no measurable response latencies during the learning phase (both cue and target words were shown for 12 s), and was thus excluded from this comparison. For the short and long retention intervals, however, the response latencies of all four conditions are displayed in Table 3 below. As response latency measurements often yield non-parametric data (as was the case in this experiment), the response latencies are presented in median rather than mean values.

TABLE 3

TABLE 3. Median response latencies, in milliseconds, during learning and final recall.

Participants had larger response latencies after longer than shorter retention intervals, which is to be expected as a result of forgetting. A Wilcoxon signed-ranks test revealed that at the short retention interval, the response latencies for items that were only studied were higher than those of items that were tested, but only with respect to the write condition, Z = 1.81, p = 0.07. No significant differences were found at the long retention interval.

Discussion

The results of Experiment 1 showed a clear testing effect, but its magnitude was not affected by response format in that the covert or overt (i.e., type and write) conditions showed comparable levels of delayed recall, measured in terms of response accuracy and latency. However, these findings do not rule out the possibility of a difference in the relative efficacy of covert vs. overt retrieval. Specifically, a testing effect produced by only one testing session (during learning) may not be sufficiently sensitive for detecting potential effects of response format. This possibility was further investigated in Experiment 2.

Experiment 2

In Experiment 2, we wanted to ascertain whether the lack of difference in cued recall performance, with respect to the covert vs. type vs. write conditions, would remain even if the magnitude of the testing effect itself was increased. To this end, we included three consecutive testing sessions during initial learning.