The Power of Feedback Revisited: A Meta-Analysis of Educational Feedback Research

Wisniewski, Benedikt; Zierer, Klaus; Hattie, John

doi:10.3389/fpsyg.2019.03087

REVIEW article

Front. Psychol., 22 January 2020

Sec. Educational Psychology

Volume 10 - 2019 | https://doi.org/10.3389/fpsyg.2019.03087

The Power of Feedback Revisited: A Meta-Analysis of Educational Feedback Research

1. Department of School Pedagogy, University of Augsburg, Augsburg, Germany
2. Melbourne Graduate School of Education, University of Melbourne, Parkville, VIC, Australia

Abstract

A meta-analysis (435 studies, k = 994, N > 61,000) of empirical research on the effects of feedback on student learning was conducted with the purpose of replicating and expanding the Visible Learning research (Hattie and Timperley, 2007; Hattie, 2009; Hattie and Zierer, 2019) from meta-synthesis. Overall results based on a random-effects model indicate a medium effect (d = 0.48) of feedback on student learning, but the significant heterogeneity in the data shows that feedback cannot be understood as a single consistent form of treatment. A moderator analysis revealed that the impact is substantially influenced by the information content conveyed. Furthermore, feedback has higher impact on cognitive and motor skills outcomes than on motivational and behavioral outcomes. We discuss these findings in the light of the assumptions made in The power of feedback (Hattie and Timperley, 2007). In general, the results suggest that feedback has rightly become a focus of teaching research and practice. However, they also point toward the necessity of interpreting different forms of feedback as independent measures.

Introduction

Feedback is information provided by an agent regarding aspects of one’s performance or understanding (Hattie and Timperley, 2007). There is an extensive body of research on this subject: Kluger and de Nisi (1996) conducted among the most comprehensive review, based on 131 studies, over 12,000 participants, with an average effect of 0.38, noting that about a third of the effects were negative. More specifically, in the classroom domain, Hattie and Timperley (2007), Hattie (2009), and Hattie and Zierer (2019) conducted meta-syntheses relating to the effects of feedback on student achievement (which we refer to as Visible Learning research). These indicated a high effect (between 0.70 and 0.79) of feedback on student achievement in general. However, the authors noted the considerable variance of effects, identifying those forms of feedback as powerful that aid students in building cues and checking erroneous hypotheses and ideas, resulting in the development of more effective information processing strategies and understanding (Hattie and Timperley, 2007).

Given the impact of the Visible Learning research (over 25,000 citations on Google Scholar), it is important to ask whether the results presented on the effectiveness of feedback and the variables which moderate this effectiveness will stand up to scrutiny. A comprehensive meta-analysis on educational feedback which integrates the existing primary studies is still a desiderate.

Key Proposals of the Visible Learning Research

Sadler (1989) claimed that the main purpose of feedback is to reduce discrepancies between current understandings and performance and a goal. From this, Hattie and Timperley (2007) argued that feedback can have different perspectives: “feed-up” (comparison of the actual status with a target status, providing information to students and teachers about the learning goals to be accomplished), “feed-back” (comparison of the actual status with a previous status, providing information to students and teachers about what they have accomplished relative to some expected standard or prior performance), and “feed-forward” (explanation of the target status based on the actual status, providing information to students and teachers that leads to an adaption of learning in the form of enhanced challenges, more self-regulation over the learning process, greater fluency and automaticity, more strategies and processes to work on the tasks, deeper understanding, and more information about what is and what is not understood). Additionally, feedback can be differentiated according to its level of cognitive complexity: It can refer to a task, a process, one’s self-regulation, or one’s self. Task level feedback means that someone receives feedback about the content, facts, or surface information (How well have the tasks been completed and understood? Is the result of a task correct or incorrect?). Feedback at the level of process means that a person receives feedback on the strategies of his or her performance. Feedback at this level is aimed at the processing of information that is necessary to understand or complete a certain task (What needs to be done to understand and master the tasks?). Feedback at the level of self-regulation means that someone receives feedback about the individual’s regulation of the strategies they are using to their performance. In contrast to process level feedback, feedback on this level does not provide information on choosing or developing strategies but to monitor the use of strategies in the learning process. It aims at a greater skill in self-evaluation or confidence to engage further on a task (What can be done to manage, guide and monitor your way of action?). The self-level focuses on the personal characteristics of the feedback recipient (often praise about the person). One of the arguments about the variability is that feedback needs to focus on the appropriate question and level of cognitive complexity, if not the message can easily be ignored, misunderstood and of low value to the recipient. Generally, it has been shown that the majority of feedback in classes is task feedback, the most received and interpreted is about “where to next,” and the least effective is self or praise feedback (Hattie and Timperley, 2007).

Effectiveness of Feedback

Hattie and Timperley (2007) made basic assumptions with respect to variables that moderate the effectiveness of feedback on student achievement. The type of feedback was found to be decisive, with praise, punishment, rewards, and corrective feedback all having low or low to medium effects on average, but corrective feedback being highly effective for enhancing the learning of new skills and tasks. With regard to the feedback channel, video/audio and computer-assisted feedback were compared. For both forms, the synthesis showed medium high to high effects. It was also noted that specific written comments are more effective than providing grades. Hattie and Timperley (2007) also investigated the timing of feedback (immediate/delayed) and the valence (positive/negative feedback), reporting inconsistent results. It was proposed that forms of feedback with a lack of information value have low effects on student achievement.

Methodological Considerations

As noted, the major research method in the Visible Learning research is synthesizing meta-analyses. The unit of analysis was the individual meta-analysis and each meta-analysis was given the same weight, regardless of the number of studies or sample size, using a fixed-effect model for the integration. This approach allows to make general assumptions about the effectiveness of feedback without the need to look at every single primary study but brings with it some restrictions addressed in the following:

Firstly, the use of a fixed-effect model may not be appropriate. A meaningful interpretation of the mean of integrated effects with this model is only possible if these effects are homogenous (Hedges and Olkin, 1985). Because previous research on feedback includes studies that differ in variants of treatment, age of participants, school type, etc., it is highly likely that the effect size varies from study to study, which is not taken into account by a fixed-effect model. By contrast, under the random-effects model, we do not assume one true effect but try to estimate the mean of a distribution of effects. The effect sizes of the studies are assumed to represent a random sample from a particular distribution of these effect sizes (Borenstein et al., 2010). The random-effects model incorporates the systematic variation of effect sizes into the weighting scheme assuming the variation to depend on factors that are unknown or that cannot be taken into account. Using the random-effects model, the variance for each primary study is in most cases larger than under the fixed-effect model because it consists of the fixed-effect variance plus a variance component τ². This results also in larger confidence intervals.

Secondly, a source of distortion when using a synthesis approach results from overlapping samples of studies. By integrating a number of meta-analyses dealing with effects of feedback interventions without checking every single primary study, there is a high probability that the samples of primary studies integrated in these meta-analyses are not independent of each other, but at least some primary studies were integrated in more than one meta-analysis. Therefore, these would have to be considered as duplets–primary studies that are included in the result of the synthesis more than once–and consequently cause a distortion. In contrast to meta-synthesis, a meta-analytical approach allows to remove duplets and therefore prevent a distortion of results.

The question arises, whether synthesizing research on feedback on different levels, from different perspectives and in different directions and compressing this research in a single effect size value leads to interpretable results. In contrast to a synthesis approach, the meta-analysis of primary studies allows to weigh study effects, consider the issues of systematic variation of effect sizes, remove duplets, and search for moderator variables based on study characteristics. Therefore, a meta-analysis is likely to produce more precise results.

Research Questions

One of the most consistent findings about the power of feedback is the remarkable variability of effects. The existing research has identified several relevant moderators like timing and specificity of the goals and task complexity (Kluger and DeNisi, 1996) and sought to understand how recipients (e.g., students, teachers) receive and understand feedback, how to frame feedback to maximize this reception, and the more critical aspects of feedback that optimize its reception and use (Hattie and Clarke, 2018; Brooks et al., 2019).

The purpose of the present study was to integrate the primary studies that provide information on feedback effects on student learning (achievement, motivation, behavior), with a meta-analytic approach that takes into account the methodological problems described in the previous part and to compare the results to the results of the Visible Learning research. Therefore, the study also investigates the differences between meta-synthesis and meta-analysis.

In particular, our study addressed the following research questions:

RQ1: What is the overall effect of feedback on student learning based on an integration of each of the primary studies within each of the all meta-analyses used in the Visible Learning research?

RQ2: To what extent is the effect of feedback moderated by specific feedback characteristics?

Method

General Procedure

This meta-analysis is a quantitative integration of empirical research comparing the effects of feedback on student learning outcomes. The typical strategy is (1) to compute a summary effects for all primary studies, (2) to calculate the heterogeneity of the summary effect, and (3) in case of heterogeneity between studies to identify study characteristics (i.e., moderators) that may account for a part of or all of that heterogeneity. In detail, and as suggested by Moher et al. (2009), we

•
specified the study and reported characteristics making the criteria for eligibility transparent,
•
described all information sources and the process for selecting studies,
•
described methods of data extraction from the studies,
•
described methods used for assessing risk of bias of individual studies,
•
stated the principal summary measures,
•
described the methods of handling data and combining results of studies, and
•
described methods of additional analyses (sensitivity and moderator analyses).

The following procedure was employed in this review (see Figure 1): First, we identified primary studies from existing meta-analyses and decided whether to include these based on four inclusion criteria. Then we developed a coding scheme to compare the effects of different feedback interventions. In the next step, we defined an effect size for each primary study or study part, either by extracting it from an existing meta-analysis or (when this was not possible) calculating it from information provided in the respective primary study.

FIGURE 1

We used the random-effects model for integration of the effect sizes that met our inclusion criteria to calculate an average effect size for all studies, and, in a next step, for subgroups defined by our coding scheme. We checked for heterogeneity across the studies and conducted outlier analysis and moderator analysis to assist in reducing the heterogeneity of effect sizes.

Identification of Studies and Inclusion Criteria

To identify the primary studies for our meta-analysis, we searched 32 existing meta-analyses that were used in the context of the Visible Learning research for information on primary studies that included relevant data for integration (effect sizes, sample sizes). To be included, each study had to meet the following inclusion criteria: It had to

•
contain an empirical comparison of a form of feedback intervention between an experimental and a control group, or a pre-post comparison;
•
report constitutive elements to calculate an effect size (e.g., include means, standard deviations, and sample sizes)
•
report at least one dependent variable related to student learning (achievement, motivation, or behavioral change) and
•
have an identifiable educational context (data obtained with samples of students or teachers in a kindergarten, primary school, secondary school, college or university)

The inclusion criteria are comparable to the criteria that were used to include meta-analyses in the Visible Learning research syntheses but allow to exclude studies from meta-analyses that encompass both an educational and a non-educational context (Wiersma, 1992; Kluger and DeNisi, 1996; Standley, 1996).

We included studies with controlled designs as well as pre-post-test designs, and this became a moderator to investigate any differences related to design (Slavin, 2008). Whenever existing meta-analyses reported the relevant statistical data from the primary studies, we used this data. When no relevant statistical data from primary studies were reported, we contacted the authors of the meta-analyses via e-mail and asked to provide the missing information. Four authors responded and three of them provided the effect sizes and sample sizes of the primary studies which they used in their meta-analysis. When no relevant data were reported in a meta-analysis and authors didn’t provide them, we reconstructed the effect sizes directly from the primary studies whenever possible. Some studies were excluded, either because we could not reconstruct their effect size, or because of other reasons as specified in Table 1.

TABLE 1

Authors	Year	Status	Access to effect sizes and sample sizes
Azevedo and Bernard	1995	Included	From meta-analysis
Bangert-Drowns et al.	1991	Included	From meta-analysis
Biber et al.	2011	Included	From meta-analysis
Brown	2014	Excluded	Data not available
Getsie et al.	1985	Excluded	No effect sizes indicated; no individual references provided
Graham et al.	2015	Partially included	From meta-analysis
Kang and Han	2015	Included	From meta-analysis
Kluger and DeNisi	1996	Partially included	Original data received from authors; studies that do not deal with educational context excluded
Kulik and Kulik	1988	Partially included	33 sample size values missing; reconstruction from the primary studies not possible
L’Hommedieu et al.	1990	Included	From meta-analysis
Li	2010	Included	From meta-analysis
Lysakowski and Walberg	1980	Excluded	No effect sizes indicated, effect sizes and sample sizes not reconstructable from original studies
Lysakowski and Walberg	1981	Partially included	44 of 54 primary studies excluded because they do not deal with feedback on the relevant outcomes effect sizes reconstructed from primary studies
Lyster and Saito	2010	Partially included	Effect sizes reconstructed from original studies
Menges and Brinko	1986	Partially included	Missing values reconstructed from primary studies
Miller	2003	Included	Updated set of studies was used (Miller and Pan, 2012), instead of the eleven effects from eight studies used by Miller (2003), 31 effects from 13 studies were integrated
Neubert	1998	Partially included	Effect sizes/sample sizes reconstructed from primary studies
Rummel and Feinberg	1988	Partially included	38 of 45 studies excluded because they do not deal with the relevant outcomes
Russell and Spada	2006	included	From meta-analysis
Schimmel	1983	Excluded	Data not available
Skiba, Casey, and Center	1985	Excluded	Data no longer available (even directly from authors)
Standley	1996	Partially included	82 of 98 studies excluded because they do not deal with a school context
Swanson and Lussier	2001	Partially included	Effect sizes/sample sizes reconstructed from primary studies
Tenenbaum and Goldring	1989	Included	Statistical data from meta-analysis, but no references of integrated studies provided
Travlos and Pratt	1995	Partially included	From meta-analysis
Truscott	2007	Included	From meta-analysis
van der Kleij et al.	2015	Included	From meta-analysis
Walberg	1982	Excluded	No effect sizes and sample sizes indicated; reconstruction of data no longer possible
Wiersma	1992	Partially included	10 of 20 studies excluded because they do not deal with an educational context
Wilkinson	1981	Excluded	Data not available
Witt et al.	2006	Excluded	No data on feedback effects
Yeany and Miller	1983	Partially included	45 of 49 studies excluded because data was not reconstructable

Existing meta-analyses investigating the factor feedback.

Coding of Study Features

To be able to identify characteristics that influence the impact of feedback, a coding scheme was developed. It includes the following categories of study features: publication type (i.e., journal article, dissertation), outcome measure (i.e., cognitive, motivational, physical, behavioral), type of feedback (i.e., reinforcement/punishment, corrective, high-information), feedback channel (i.e., written, oral, video-, audio- or computer-assisted), and direction (i.e., teacher > learner, learner > teacher). Some of the study features of interest had to be dropped (i.e., perspective of feedback, way of measuring the outcome) because there were insufficient data, or the feature could not be defined based on the article abstracts. Generally, the study features for our coding scheme are orientated toward Hattie’s and Timperley’s (2007) coding features.

We analyzed inter-coder consistency to ensure reliability among coders by randomly selecting 10% of the studies and having them coded separately by two coders. Based on this, we assessed intercoder reliability of each coding variable. For the 6 moderator variables, Krippendorff’s alpha ranged from 0.81 to 0.99, and therefore above the acceptable level (Krippendorff, 2004). The two coders then discussed and resolved remaining disagreements and established an operational rule that provided precise criteria for the coding of studies according to each moderator variable. The lead author then used these operational rules to code the rest of the studies.

Calculation of Effect Sizes

For the computation of effect sizes, tests for heterogeneity, and in the analysis of moderator variables, we used the Meta and Metafor packages for R (R Core Team, 2017). To compare study results, Cohen’s (1988)d effect size measure was applied. This is calculated as

with the pooled standard deviation of

Hedges and Olkin (1985) demonstrated that the unsystematic error variance of a primary study is determined by the variance of the effect size. The higher the variance, the less precise the study effect. Because study effects that have higher precision are to be weighted more strongly than effects that have lower precision, the inverse of the variance of the study effect in relation to the inverse of the sum of the variance inverse values of all k primary studies serves as a correction factor (Rustenbach, 2003). The inverse variance weight is calculated as

The average weighted effect size d is the sum of all weighted effect sizes of the k primary studies. In the fixed-effect model, the variance equals , which is derived from the individual study variances (5). In the random-effects model, the variance consists of a first component (5) and a second component, τ² (6), which is the variance of the effect size distribution.

Integration Model

The model of random effects (Hedges and Vevea, 1998) was used to integrate the study effects. With a random-effects model we attempted to generalize findings beyond the included studies by assuming that the selected studies are random samples from a larger population (Cheung et al., 2012). Consequently, study effects may vary within a single study and between individual studies, hence no common population value is assumed.

The random-effects model takes two variance components into account. These are the sum of the individual standard errors of the study effects resulting from the sample basis of the individual studies, and the variation in the random selection of the effect sizes for the meta-analysis. A meaningful interpretation of average effect sizes from several primary studies does not necessarily require homogeneity (i.e., that the variation of the study effects is solely random, Rustenbach. 2003). The basic assumption here is that differences in effect sizes within the sample are due to sample errors as well as systematic variation.

The integration of multiple effect sizes does not only require independence of the primary studies included in the meta-analyses, but also independence of the observed effects reported in the primary studies. The second assumption is violated when sampling errors and/or true effects are correlated. This can be the case when studies report more than one effect and these effects stem from comparisons with a common control group (multiple treatment studies, Gleser and Olkin, 2009). To adequately integrate statistically dependent effect sizes, there are different approaches, for example selecting one effect size per study, averaging all effects reported in one study, or conducting multivariate meta-analysis (which requires knowledge of the underlying covariance structure among effect sizes). If a study reported more than one effect size and the multiple outcomes could not be treated as independent from each other (because they used one common sample), we accounted for this non-independence by robust variance estimation (RVE, Sidik and Jonkman, 2006; Hedges et al., 2010). This method allows the integration of statistically dependent effect sizes within a meta-analysis without knowledge of the underlying covariance structure among effect sizes.

Bias and Heterogeneity

Possible selection bias was tested by the means of a funnel plot, a scatter diagram that plots the treatment effect on the x-axis against the study size on the y-axis, and the means of a normal-quantile-plot, in which the observed effect sizes are compared with the expected values of the effect sizes drawn from a normal distribution. Additionally, Egger’s et al. (1997) regression test was used to detect funnel plot asymmetry.

A Q-test (Shadish and Haddock, 1994) was performed to test the homogeneity of the observed effect sizes.

The test variable Q is χ²-distributed with degrees of freedom of the number k-1. Q can be used to check whether effect sizes of a group are homogeneous or whether at least one of the effect sizes differs significantly from the others. In order to be able to provide information about the degree of heterogeneity, I²was computed (Deeks et al., 2008). I² is a measure of the degree of heterogeneity among a set of studies along a 0% – 100% scale and can be interpreted as moderate for values between 30 and 60%, substantial for values over 50%, and considerable for values over 75% (Deeks et al., 2008).

Outlier Analysis

By definition, no outliers exist in the random-effects model because the individual study effects are not based on a constant population mean. Extreme values are attributed to natural variation. An outlier analysis, however, can serve to identify unusual primary studies. We used the method of adjusted standardized residuals to determine whether effect sizes have inflated variance. An adjusted residual is the deviation of an individual study effect from the adjusted mean effect, i.e., the mean effect of all other study effects. Adjusted standardized residuals follow the normal distribution and are therefore significantly different from 0 when they are >1.96. They are conventionally classified as extreme values when > 2 (Hedges and Olkin, 1985).

Moderator Analysis

For heterogeneous data sets, suitable moderator variables must be used for a more meaningful interpretation. In extreme cases, this can lead to a division into k factor levels if none of the primary studies can be integrated into a homogeneous group. Q_B reflects the amount of heterogeneity that can be attributed to the moderator variable, whereas Q_W provides information on the amount of heterogeneity that remains within the moderator category. The actual suitability of a moderator variable within a fixed-effect model is demonstrated by the fact that homogeneous effect sizes are present within the primary study group defined by it (Q_Wempirical < Q_Wcritical) and at the same time the average effect sizes of the individual groups differ significantly from each other (Q_Bempirical > Q_Bcritical). If both conditions are fulfilled, homogeneous factor levels are present, which are defined by moderator variables, leading to a meaningful separation of the primary studies. However, by this definition, homogeneity of effect sizes within hypothesized moderator groups will occur rarely in real data, which means that fixed-effect models are rarely appropriate for real data in meta-analyses and random-effects models should be preferred (Hunter and Schmidt, 2000). In random-effects models, it can be tested if moderators are suitable for reducing heterogeneity (the random-effects model then becoming a mixed-effects model), but without assuming homogeneity (Viechtbauer, 2007). Therefore, we used the article abstracts of the primary studies to define meaningful moderator variables and set the moderator values for each primary study according to our coding scheme. The following moderators were used:

Research Design

Studies with control groups were separated from studies with a pre-post-test design. Effect sizes from pre-post designs are generally less reliable and less informative about the effects of the intervention because they are likely to be influenced by confounding variables (Morris and DeShon, 2002).

Publication Type

The type of publication (journal article or dissertation) was used as a moderator. Published studies may be prone to having larger effect sizes than unpublished studies because they are less likely to be rejected when they present significant results (Light and Pillemer, 1986).

Outcome Measure

The Visible Learning research investigated the impact of the factor feedback on student achievement. However, not all primary studies that were integrated in the meta-analyses contain an achievement outcome measure. Consequently, for our meta-analysis, we differentiated between four types of outcome measure: cognitive (including student achievement, retention, cognitive test performance), motivational (including intrinsic motivation, locus of control, self-efficacy and persistence), physical (development of motor skills) and behavioral (student behavior in classrooms, discipline).

Type of Feedback

A further distinction was made between different types of feedback, namely reinforcement/punishment, corrective feedback, and high-information feedback. Forms of reinforcement and punishment apply pleasant or aversive consequences to increase or decrease the frequency of a desired response or behavior. These forms of feedback provide a minimum amount of information on task level and no information on the levels of process or self-regulation. Corrective forms of feedback typically contain information about the task level in the form of “right or wrong” and the provision of the correct answer to the task. Feedback not only refers to how successfully a skill was performed (knowledge on result), but also to how a skill is performed (knowledge of performance). For some forms of feedback, i.e., modeling, additional information is provided on how the skill could be performed more successfully. Feedback was classified as high-information feedback when it was constituted by information as described for corrective feedback and additionally contained information on self-regulation from monitoring attention, emotions, or motivation during the learning process.

Feedback Channel

Some studies investigated the effects of feedback according to the channel by which it is provided. Hence, the distinction between three forms: oral, written, and video-, audio- or computer-assisted feedback.

Feedback Direction

This moderator refers to who gives and who receives feedback. We differentiated between feedback that is given by teachers to students, feedback that is given by students to teachers, and feedback that is given by students to students.

Results

Identification of Studies

Our search strategy yielded 732 primary studies (see Figure 1). After the selection process, in the final data set, 994 effect sizes from 435 studies (listed in the Supplementary Appendix), including about 61,000 subjects, were used for our meta-analysis.

Figure 2 shows the distribution of the included effect sizes related to the years of publication. The median of publication year is 1985. Fifteen percent of the integrated effect sizes are taken from studies published in the last 15 years.

FIGURE 2

General Impact of Feedback

The integration of all study effects with the random-effects model leads to a weighted average effect size of d = 0.55. 17% of the effects were negative. The confidence interval ranges from 0.48 to 0.62. Cohen’s U3, the percentage of those scores in the experimental groups that exceed the average score in the control groups is 70%. The probability of homogenous effects is <0.001 with Q = 7,339 (df = 993) and I² = 86.47%.