Appropriate Criteria: Key to Effective Rubrics

Brookhart, Susan M.

doi:10.3389/feduc.2018.00022

REVIEW article

Front. Educ., 10 April 2018

Sec. Assessment, Testing and Applied Measurement

Volume 3 - 2018 | https://doi.org/10.3389/feduc.2018.00022

This article is part of the Research TopicTransparency in Assessment – Exploring the Influence of Explicit Assessment CriteriaView all 10 articles

Appropriate Criteria: Key to Effective Rubrics

Susan M. Brookhart^*

Department of Educational Foundations and Leadership, Duquesne University, Pittsburgh, PA, United States

True rubrics feature criteria appropriate to an assessment's purpose, and they describe these criteria across a continuum of performance levels. The presence of both criteria and performance level descriptions distinguishes rubrics from other kinds of evaluation tools (e.g., checklists, rating scales). This paper reviewed studies of rubrics in higher education from 2005 to 2017. The types of rubrics studied in higher education to date have been mostly analytic (considering each criterion separately), descriptive rubrics, typically with four or five performance levels. Other types of rubrics have also been studied, and some studies called their assessment tool a “rubric” when in fact it was a rating scale. Further, for a few (7 out of 51) rubrics, performance level descriptions used rating-scale language or counted occurrences of elements instead of describing quality. Rubrics using this kind of language may be expected to be more useful for grading than for learning. Finally, no relationship was found between type or quality of rubric and study results. All studies described positive outcomes for rubric use.

A rubric articulates expectations for student work by listing criteria for the work and performance level descriptions across a continuum of quality (Andrade, 2000; Arter and Chappuis, 2006). Thus, a rubric has two parts: criteria that express what to look for in the work and performance level descriptions that describe what instantiations of those criteria look like in work at varying quality levels, from low to high.

Other assessment tools, like rating scales and checklists, are sometimes confused with rubrics. Rubrics, checklists, and rating scales all have criteria; the scale is what distinguishes them. Checklists ask for dichotomous decisions (typically has/doesn't have or yes/no) for each criterion. Rating scales ask for decisions across a scale that does not describe the performance. Common rating scales include numerical scales (e.g., 1–5), evaluative scales (e.g., Excellent-Good-Fair-Poor), and frequency scales (e.g., Always, Usually-Sometimes-Never). Frequency scales are sometimes useful for ratings of behavior, but none of the rating scales offer students a description of the quality of their performance they can easily use to envision their next steps in learning. The purpose of this paper is to investigate the types of rubrics that have been studied in higher education.

Rubrics have been analyzed in several different ways. One important characteristic of rubrics is whether they are general or task-specific (Arter and McTighe, 2001; Arter and Chappuis, 2006; Brookhart, 2013). General rubrics apply to a family of similar tasks (e.g., persuasive writing prompts, mathematics problem solving). For example, a general rubric for an essay on characterization might include a performance level description that reads, “Used relevant textual evidence to support conclusions about a character.” Task-specific rubrics specify the specific facts, concepts, and/or procedures that students' responses to a task should contain. For example, a task-specific rubric for the characterization essay might specify which pieces of textual evidence the student should have located and what conclusions the student should have drawn from this evidence. The generality of the rubric is perhaps the most important characteristic, because general rubrics can be shared with students and used for learning as well as for grading.

The prevailing hypothesis about how rubrics help students is that they make explicit both the expectations for student work and, more generally, describe what learning looks like (Andrade, 2000; Arter and McTighe, 2001; Arter and Chappuis, 2006; Bell et al., 2013; Brookhart, 2013; Nordrum et al., 2013; Panadero and Jonsson, 2013). In this way, rubrics play a role in the formative learning cycle (Where am I going? Where am I now? Where to next? Hattie and Timperley, 2007) and support student agency and self-regulation (Andrade, 2010). Some research has borne out this idea, showing that rubrics do make expectations explicit for students (Jonsson, 2014; Prins et al., 2016) and that students do use rubrics for this purpose (Andrade and Du, 2005; Garcia-Ros, 2011). General rubrics should be written with descriptive language, as opposed to evaluative language (e.g., excellent, poor) because descriptive language helps students envision where they are in their learning and where they should go next.

Another important way to characterize rubrics is whether they are analytic or holistic. Analytic rubrics consider criteria one at a time, which means they are better for feedback to students (Arter and McTighe, 2001; Arter and Chappuis, 2006; Brookhart, 2013; Brookhart and Nitko, 2019). Holistic criteria consider all the criteria simultaneously, requiring only one decision on one scale. This means they are better for grading, for times when students will not need to use feedback, because making only one decision is quicker and less cognitively demanding than making several.

Rubrics have been characterized by the number of criteria and number of levels they use. The number of criteria should be linked to the intended learning outcome(s) to be assessed, and the number of levels should be related to the types of decisions that need to be made and to the number of reliable distinctions in student work that are possible and helpful.

Dawson (2017) recently summarized a set of 14 rubric design elements that characterize both the rubrics themselves and their use in context. His intent was to provide more precision to discussions about rubrics and to future research in the area. His 14 areas included: specificity, secrecy, exemplars, scoring strategy, evaluative criteria, quality levels, quality definitions, judgment complexity, users and uses, creators, quality processes, accompanying feedback information, presentation, and explanation. In Dawson's terms, this study focused on specificity, evaluative criteria, quality levels, quality definitions, quality processes, and presentation (how the information is displayed).

Four recent literature reviews on the topic of rubrics (Jonsson and Svingby, 2007; Reddy and Andrade, 2010; Panadero and Jonsson, 2013; Brookhart and Chen, 2015) summarize research on rubrics. Brookhart and Chen (2015) updated Jonsson and Svingby's (2007) comprehensive literature review. Panadero and Jonsson (2013) specifically addressed the use of rubrics in formative assessment and the fact that formative assessment begins with students understanding expectations. They posited that rubrics help improve student learning through several mechanisms (p. 138): increasing transparency, reducing anxiety, aiding the feedback process, improving student self-efficacy, or supporting student Self-regulation.

Reddy and Andrade (2010) addressed the use of rubrics in post-secondary education specifically. They noted that rubrics have the potential to identify needs in courses and programs, and have been found to support learning (although not in all studies). The found that the validity and reliability of rubrics can be established, but this is not always done in higher education applications of rubrics. Finally, they found that some higher education faculty may resist the use of rubrics, which may be linked to a limited understanding of the purposes of rubrics. Students generally perceive that rubrics serve purposes of learning and achievement, while some faculty members think of rubrics primarily as grading schemes (p. 439). In fact, rubrics are not as easy to use for grading as some traditional rating or point schemes; the reason to use rubrics is that they can support learning and align learning with grading.

Some criticisms and challenges for rubrics have been noted. Nordrum et al. (2013) summarized words of caution from several scholars about the potential for the criteria used in rubrics to be subjective or vague, or to narrow students' understandings of learning (see also Torrance, 2007). In a backhanded way, these criticisms support the thesis of this review, namely, that appropriate criteria are the key to the effectiveness of a rubric. Such criticisms are reasonable and get their traction from the fact that many ineffective or poor-quality rubrics exist, that do have vague or narrow criteria. A particularly dramatic example of this happens when the criteria in a rubric are about following the directions for an assignment rather than describing learning (e.g., “has three sources” rather than “uses a variety of relevant, credible sources”). Rubrics of this kind misdirect student efforts and mis-measure learning.

Sadler (2014) argued that codification of qualities of good work into criteria cannot mean the same thing in all contexts and cannot be specific enough to guide student thinking. He suggests instantiation instead of codification, describing a process of induction where the qualities of good work are inferred from a body of work samples. In fact, this method is already used in classrooms when teachers seek to clarify criteria for rubrics (Arter and Chappuis, 2006) or when teachers co-create rubrics with students (Andrade and Heritage, 2017).

Purpose of the Study

A number of scholars have published studies of the reliability, validity, and/or effectiveness of rubrics in higher education and provided the rubrics themselves for inspection. This allows for the investigation of several research questions, including:

(1) What are the types and quality of the rubrics studied in higher education?

(2) Are there any relationships between the type and quality of these rubrics and reported reliability, validity, and/or effects on learning and motivation?

Question 1 was of interest because, after doing the previous review (Brookhart and Chen, 2015), I became aware that not all of the assessment tools in studies that claimed to be about rubrics were characterized by both criteria and performance level descriptions, as for true rubrics (Andrade, 2000). The purpose of Research Question 1 was simply to describe the distribution of assessment tool types in a systematic manner.

Question 2 was of interest from a learning perspective. Various types of assessment tools can be used reliably (Brookhart and Nitko, 2019) and be valid for specific purposes. An additional claim, however, is made about true rubrics. Because the performance level descriptions describe performance across a continuum of work quality, rubrics are intended to be useful for students' learning (Andrade, 2000; Brookhart, 2013). The criteria and performance level descriptions, together, can help students conceptualize their learning goal, focus on important aspects of learning and performance, and envision where they are in their learning and what they should try to improve (Falchikov and Boud, 1989). Thus I hypothesized that there would not be a relationship between type of rubric and conventional reliability and validity evidence. However, I did expect a relationship between type of rubric and the effects of rubrics on learning and motivation, expecting true descriptive rubrics to support student learning better than the other types of tools.

Method

This study is a literature review. Study selection began with the data base of studies selected for Brookhart and Chen (2015), a previous review of literature on rubrics from 2005 to 2013. Thirty-six studies from that review were done in the context of higher education. I conducted an electronic search for articles published from 2013 to 2017 in the ERIC database. This yielded 10 additional studies, for a total of 46 studies. The 46 studies have the following characteristics: (a) conducted in higher education, (b) studied the rubrics (i.e., did not just use the rubrics to study something else, or give a description of “how-to-do-rubrics”), and (c) included the rubrics in the article.

There are two reasons for limiting the studies to the higher education context. One, most published studies of rubrics have been conducted in higher education. I do not think this means fewer rubrics are being used in the K-12 context; I observe a lot of rubric use in K-12. Higher education users, however, are more likely to do a formal review of some kind and publish their results. Thus the number of available studies was large enough to support a review. Two, given that more published information on rubrics exists in higher education than K-12, limiting the review to higher education holds constant one possible source of complexity in understanding rubric use, because all of the students are adult learners. Rubrics used with K-12 students must be written at an appropriate developmental or educational level. The reason for limiting the studies to ones that included a copy of the rubrics in the article was that the analysis for this review required classifying the type and characteristics of the rubrics themselves.

Information about the 46 studies was entered into a spreadsheet. Information noted about the studies included country, level (undergraduate or graduate), type (rubric, rating scale, or point scheme), how the rubric considered criteria (analytic or holistic), whether the performance level descriptors were truly descriptive or used rating scale and/or numerical language in the levels, type of construct assessed by the rubrics (cognitive or behavioral), whether the rubrics were used with students or just by instructors for grading, sample, study method (e.g., case study, quasi-experimental), and findings. Descriptive and summary information about these classifications and study descriptions was used to address the research questions.

As an example of what is meant by descriptive language in a rubric, consider this excerpt from Prins et al. (2016). This is the performance level description for Level 3 of the criterion Manuscript Structure from a rubric for research theses (p. 133):

All elements are logically connected and keypoints within sections are organized. Research questions, hypotheses, research design, results, inferences and evaluations are related and form a consistent and concise argumentation.

Notice that a key characteristic of the language in this performance level description is that it describes the work. Thus for students who aspire to this high level, the rubric depicts for them what their work needs to look like in order to reach that goal.

In contrast, if performance level descriptions are written in evaluative language (for example, if the performance level description above had read, “The paper shows excellent manuscript structure”), the rubric does not give students the information they need to further their learning. Rubrics written in evaluative language do not give students a depiction of work at that level and, therefore, do not provide a clear description of the learning goal. An example of evaluative language used in a rubric can be found in the performance level descriptions for one of the criteria of an oral communication rubric (Avanzino, 2010, p. 109). This is the performance level description for Level 2 (Adequate) on the criterion of Delivery:

Speaker's delivery style/use of notes (manuscript or extemporaneous) is average; inconsistent focus on audience.

Notice that the key word in the first part of the performance level description, “average,” does not give any information to the student about what average delivery looks like in regard to style and use of notes. The second part of the performance level description, “inconsistent focus on audience,” is descriptive and gives students information about what Level 2 performance looks like in regard to audience focus.

Results and Discussion

The 46 studies yielded 51 different rubrics because several studies included more than one rubric. The two sections below take up results for each research question in turn.

Type and Quality of Rubrics

Table 1 displays counts of the type and quality of rubrics found in the studies. Most of the rubrics (29 out of 51, 57%) were analytic, descriptive rubrics. This means they considered the criteria separately, requiring a separate decision about work quality for each criterion. In addition, it means that the performance level descriptions used descriptive, as opposed to evaluative, language, which is expected to be more supportive of learning. Most commonly, these rubrics described four (14) or five (8) performance levels.

TABLE 1

Table 1. Types of rubrics used in studies of rubrics in higher education.

Four of the 51 rubrics (8%) were holistic, descriptive rubrics. This means they considered the criteria simultaneously, requiring one decision about work quality across all criteria at once. In addition, the performance level descriptions used the desired descriptive language.

Three of the rubrics were descriptive and task-specific. One of these was an analytic rubric and two were holistic rubrics. None of the three could be shared with students, because they would “give away” answers. Such rubrics are more useful for grading than for formative assessment supporting learning. This does not necessarily mean the rubrics were not of quality, because they served well the grading function for which they were designed. However, they represent a missed opportunity to support learning as well as grading.

A few of the rubrics were not written in a descriptive manner. Six of the analytic rubrics and one of the holistic rubrics used rating scale language and/or listed counts of occurrences of elements in the work, instead of describing the quality of student learning and performance. Thus 7 out of 51 (14%) of the rubrics were not of the quality that is expected to be best for student learning (Arter and McTighe, 2001; Arter and Chappuis, 2006; Andrade, 2010; Brookhart, 2013).

Finally, eight of the 51 rubrics (16%) were not rubrics but rather rating scales (5) or point schemes for grading (3). It is possible that the authors were not aware of the more nuanced meaning of “rubric” currently used by educators and used the term in a more generic way to mean any scoring scheme.

As the heart of Research Question 1 was about the potential of the rubrics used to contribute to student learning, I also coded the studies according to whether the rubrics were used with students or whether they were just used by instructors for grading. Of the 46 studies, 26 (56%) reported using the rubrics with students and 20 (43%) did not use rubrics with students but rather used them only for grading.

Relation of Rubric Type to Reliability, Validity, and Learning

Different studies reported different characteristics of their rubrics. I charted studies that reported evidence for the reliability of information from rubrics (Table 2) and the validity of information from rubrics (Table 3). For the sake of completeness, Table 4 lists six studies that presented their work with rubrics in a descriptive case-study style that did not fit easily into Table 2 or Table 3 or in Table 5 (below) about the effects of rubrics on learning. With the inclusion of Table 4, readers have descriptions of all 51 rubrics in all 46 studies reported under Research Question 1.

TABLE 2

Table 2. Reliability evidence for rubrics.

TABLE 3

Table 3. Validity evidence for rubrics.

TABLE 4

Table 4. Descriptive case studies about developing and using rubrics.

TABLE 5

Table 5. Studies of the effects of rubric use on student learning and motivation to learn.

Reliability was most commonly studied as inter-rater reliability, arguably the most important for rubrics because judgment is involved in matching student work with performance level descriptions, or as internal consistency among criteria. Construct validity was addressed with a variety of methods, from expert review to factor analysis; some studies also addressed consequential evidence for validity with student or faculty questionnaires. No discernable patterns were found that indicated one form of rubric was preferable to another in regard to reliability or validity. Although this conforms to my hypothesis, this result is also partly because most of the studies' reported results and experience with rubrics were positive, no matter what type of rubric was used.

Table 5 describes 13 studies of the effects of rubrics on learning or motivation, all with positive results. Learning was most commonly operationalized as improvement in student work. Motivation was typically operationalized as student responses to questionnaires. In these studies as well, no discernable pattern was found regarding type of rubric. Despite the logical and learning-based arguments made in the literature and summarized in the introduction to this article, rubrics with both descriptive and evaluative performance level descriptions both led to at least some positive results for students. Eight of these studies used descriptive rubrics and five used evaluative rubrics. It is possible that the lack of association of type of rubric with study findings is a result of publication bias, because most of the studies had good things to say about rubrics and their effects. The small sample size (13 studies) may also be an issue.

Conclusions

Rubrics are becoming more and more evident as part of assessment in higher education. Evidence for that claim is simply the number of studies that are published investigating this new and growing interest and the assertions made in those studies about rising interest in rubrics.

Research Question 1 asked about the type and quality of rubrics published in studies of rubrics in higher education. The number of criteria varies widely depending on the rubric and its purpose. Three, four, and five are the most common number of levels. While most of the rubrics are descriptive—the type of rubrics generally expected to be most useful for learning—many are not. Perhaps most surprising, and potentially troubling, is that only 56% of the studies reported using rubrics with students. If all that is required is a grading scheme, traditional point schemes or rating scales are easier for instructors to use. The value of a rubric lies in its formative potential (Panadero and Jonsson, 2013), where the same tool that students can use to learn and monitor their learning is then used for grading and final evaluation by instructors.

Research Question 2 asked whether rubric type and quality were related to measurement quality (reliability and validity) or effects on learning and motivation to learn. Among studies in this review, reported reliability and validity was not related to type of rubric. Reported effects on learning and/or motivation were not related to type of rubric. The discussion above speculated that part of the reason for these findings might be publication bias, because only studies with good effects—whatever the type of rubric they used—were reported.

However, we should not dismiss all the results with a hand-wave about publication bias. All of the tools in the studies of rubrics—true rubrics, rating scales, checklists—had criteria. The differences were in the type of scale and scale descriptions used. Criteria lay out for students and instructors what is expected in student work and, by extension, what it looks like when evidence of intended learning has been produced. Several of the articles stated explicitly that the point of rubrics was to make assignment expectations explicit (e.g., Andrade and Du, 2005; Fraser et al., 2005; Reynolds-Keefer, 2010; Vandenberg et al., 2010; Jonsson, 2014; Prins et al., 2016). The criteria are the assignment expectations: the qualities the final work should display. The performance level descriptions instantiate those expectations at different levels of competence. Thus, one firm conclusion from this review is that appropriate criteria are the key to effective rubrics. Trivial or surface-level criteria will not draw learning goals for students as clearly as substantive criteria. Students will try to produce what is expected of them. If the criterion is simply having or counting something in their work (e.g., “has 5 paragraphs”), students need not pay attention to the quality of what their work has. If the criterion is substantive (e.g., “states a compelling thesis”), attention to quality becomes part of the work.

It is likely that appropriate performance level descriptions are also key for effective rubrics, but this review did not establish this fact. A major recommendation for future research is to design studies that investigate how students use the performance level descriptions as they work, in monitoring their work, and in their self-assessment judgments. Future research might also focus on two additional characteristics of rubrics (Dawson, 2017): users and uses and judgment complexity. Several studies in this review established that students use rubrics to make expectations explicit. However, in only 56% of the studies were rubrics used with students, thus missing the opportunity to take advantage of this important rubric function. Therefore, it seems important to seek additional understanding of users and uses of rubrics. In this review, judgment complexity was a clear issue for one study (Young, 2013). In that study, a complex rubric was found more useful for learning, but a holistic rating scale was easier to use once the learning had occurred. This hint from one study suggests that different degrees of judgment complexity might be more useful in different stages of learning.

Rubrics are one way to make learning expectations explicit for learners. Appropriate criteria are key. More research is needed that establishes how performance level descriptions function during learning and, more generally, how students use rubrics for learning, not just that they do.

Author Contributions

The author confirms being the sole contributor of this work and approved it for publication.

Conflict of Interest Statement

The author declares that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

References

Andrade, H. G. (2000). Using rubrics to promote thinking and learning. Educational Leadership 57, 13–18. Available online at: http://www.ascd.org/publications/educational-leadership/feb00/vol57/num05/Using-Rubrics-to-Promote-Thinking-and-Learning.aspx

Google Scholar

Andrade, H., and Du, Y. (2005). Student perspectives on rubric-referenced assessment. Pract. Assess. Res. Eval. 10, 1–11. Available online at: http://pareonline.net/pdf/v10n3.pdf

Google Scholar

Andrade, H., and Heritage, M. (2017). Using Assessment to Enhance Learning, Achievement, and Academic Self-Regulation. New York, NY: Routledge.

Google Scholar

Andrade, H. L. (2010). “Students as the definitive source of formative assessment: academic self-assessment and the self-regulation of learning,” in Handbook of Formative Assessment, eds H. L. Andrade and G. J. Cizek (New York, NY: Routledge), 90–105.

Arter, J. A., and Chappuis, J. (2006). Creating and Recognizing Quality Rubrics. Boston: Pearson.

Arter, J. A., and McTighe, J. (2001). Scoring Rubrics in the Classroom: Using Performance Criteria for Assessing and Improving Student Performance. Thousand Oaks, CA: Corwin.

Google Scholar

Ash, S. L., Clayton, P. H., and Atkinson, M. P. (2005). Integrating reflection and assessment to capture and improve student learning. Mich. J. Comm. Serv. Learn. 11, 49–60. Available online at: http://hdl.handle.net/2027/spo.3239521.0011.204

Google Scholar

Avanzino, S. (2010). Starting from scratch and getting somewhere: assessment of oral communication proficiency in general education across lower and upper division courses. Commun. Teach. 24, 91–110. doi: 10.1080/17404621003680898