Edited by: Anders Jönsson, Kristianstad University, Sweden
Reviewed by: Susan M. Brookhart, Duquesne University, United States; Phillip Dawson, Deakin University, Australia
This article was submitted to Assessment, Testing and Applied Measurement, a section of the journal Frontiers in Education
This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.
Analytic and holistic marking are typically researched as opposites, generating a mixed and inconclusive evidence base. Holistic marking is low on content validity but efficient. Analytic approaches are praised for transparency and detailed feedback. Capturing complex criteria interactions, when deciding marks, is claimed to be better suited to holistic approaches whilst analytic rules are thought to be limited. Both guidance and evidence in this area remain limited to date. Drawing from the known complementary strengths of these approaches, a university department enhanced its customary holistic marking practices by introducing analytic rubrics for feedback and as ancillary during marking. The customary holistic approach to deciding marks was retained in the absence of a clear rationale from the literature. Exploring the relationship between the analytic criteria and holistic marks became the focus of an exploratory study during a trial year that would use two perspectives. Following guidance from the literature, practitioners formulated analytic rules drawing on their understanding of the role of criteria, to explain output marks by allocating weightings. Secondly, data derived throughout the year consisting of holistic marks and analytic judgements (criteria) data were analyzed using machine learning techniques (random forests). This study reports on data from essay-based questions (exams) for years 2 and 3 of study (
Most assessment methods used in higher education elicit student-constructed responses to an open question (e.g., essay, reports, projects, presentations). These are divergent assessments in that a broad range of student individual responses can meet the desired criteria and outcomes. The implementation of divergent assessments presents challenges in marking and also ensuring students understand expectations (Brown et al.,
One of the most salient challenges is marking, attracting much public attention and debate. Concerns range from consistency across markers within and across institutions (Bloxham et al.,
In particular the UK university sector has been urged to enhance transparency (Woolf,
Reviews on rubric research converge on key benefits of rubric use. Firstly, rubrics are powerful allies for instruction and learning (Andrade,
Literature reviews indicate inconsistent uses of the labels holistic and analytic and they are better conceptualized as a continuum (Hunter et al.,
Choosing between features of holistic and analytic marking presents challenges in practice. Holistic marking consists of forming overall judgements on student work where criteria are considered simultaneously. Links to standards may be achieved by virtue of reference to written descriptors (rubrics) and exemplars (Sadler,
Use of holistic and analytic approaches, in particular when marks are required, polarizes opinion both in the literature and practice. Dominant discourses and research perspectives have emphasized their seemingly opposite natures. Below inconsistencies between evidence and some arguments used in the advocacy of holistic approaches are explored to highlight the less explored combined potential of these approaches. In order to advance existing understanding, it will be argued, new perspectives on the matter are required.
The existing body of work has mainly focused on comparative approaches in research so far generating an inconclusive evidence base and no clear rationale for the use of either approach (Jönsson and Svingby,
Summary of key advantages and disadvantages of different types of rubrics.
Analytic | Each criterion (dimension, trait) is evaluated separately |
Gives diagnostic information to teacher |
More time to score than holistic rubrics |
Holistic | All criteria (dimensions, traits) are evaluated simultaneously |
Scoring is faster than with analytic rubrics |
Single overall score does not communicate information about what to do to improve |
In essence, holistic marking offers greater reliability (cross-marker agreement) (Jones and Alcock,
The most recent review on rubric related research (Brookhart,
The remainder of our review considers claims made in favor of either approach that remain unsubstantiated, in particular, in relation to content and structural validity. Advocacy in favor of holistic scoring and its underlying
Content validity (Messick,
The second main concern raised in the literature on analytic approaches concerns structural validity which considers the relationship between outcomes (marks) and these being interpretable according to more important criteria and learning outcomes (Messick,
It is on this particular area where, beyond these claims and very generic guidance to practitioners, there is little understanding for example on rules that underlie holistic marking and the complexity of criteria interactions. Equally, understanding how these stand in comparison to the analytic, expert derived weightings (or rules), would really advance our understanding of the limitations and advantages of both approaches. A very scarce evidence base with studies using mathematical modeling (Principal Component Analysis, PCA) on tutors' ranking of theses and their reasons, in a holistic manner, suggest the use of idiosyncratic weightings applied to criteria (Björklund et al.,
Moreover, structural validity also considers communication and maintaining standards, that is, staff and students should also share understanding of expectations of quality and criteria (Messick,
Lastly, the complementary strengths and natures of analytic and holistic approaches are less well-understood and have been suggested as a productive avenue of research (Hunter et al.,
In sum, both analytic and holistic approaches in marking offer strengths. The review highlights areas where the role of analytic and holistic approaches in particular in relation to structural validity rests on advocacy and research is needed. We draw particular attention to the largely unexplored area of the nature of underlying rules, of both holistic and analytic approaches to marking that need to be better understood. Much advocacy for holistic approaches rests on shortcomings of analytic approaches but no evidence of their strengths. The review establishes the need to focus on understanding better the underlying nature of the rules of these approaches. Holistic and analytic marking, treated as opposites, in research and the literature leave many unresolved questions in practice. Emphasizing their combined strengths has been less well-explored and is suggested as a way forward. A practice perspective on these literature discussions is provided below.
A university department in the United Kingdom (UK) allows the exploration of highlighted areas where research is limited and ways in which these cause tensions in practice. The literature review above provided a basis for the formulation of a project to gradually transform existing customary holistic marking practices. Rubric design and uses (marking, feedback) would be the starting point. Developing students' evaluative competence (Sadler,
The transition toward greater transparency considered the inclusion of elements from analytic approaches, on the basis of the review above, but holistic elements were retained in the absence of clear rationales. The resulting model of practice combines elements from both analytic and holistic approaches where a clear rationale existed.
A stepped approach to the cultural transformation to marking and rubric use in practice was adopted both considering the literature and nature of changing marking practice.
Approaches to marking and feedback: traditional and trial year.
Rubric design (display) | Holistic statements of quality (overall quality for levels) | Analytic criteria with descriptor levels—matrix type display |
Rubric use during marking | Unspecified marker use of the holistic rubrics during marking | Analytic judgements on criteria and levels would be indicated using the rubrics during marking |
Marks derivation | Holistic—all criteria considered simultaneously by marker to decide a percentage mark | |
Rubric use for feedback (during marking) | Holistic rubric was not used in the feedback | The analytic criteria and judgements of levels indicated explicitly during marking would be included as part of the feedback provided to students |
Rubrics are recognized to have a positive impact in instruction and learning. The review has drawn attention to questions on holistic and analytic approaches to marking that, both in the literature and practice, polarize opinion with little conclusive guidance or rationale as is repeatedly echoed in reviews of the field (Jönsson and Prins,
The case provides an opportunity to examine claims in the literature favoring holistic approaches for their natural capacity to capture complex judgements in contrast with analytic approaches (e.g., too linear, compensatory and relying on expert understanding). Advancing understanding in the literature and practice on this contentious aspect is the focus of our study. We aim to explore how holistic marking and its outputs align with assumptions made about the uses of criteria and their relative importance. The objective is to inform decisions in practice on this matter whilst providing exploratory insights to advance our understanding of how output marks relate to significant criteria and the capacity of analytic and holistic approaches to represent that. The research questions are:
What are the expectations by practitioners of the importance and contribution of different criteria in marking?
What is the contribution of criteria associated with holistic marks?
How do practitioner expected and data derived weightings in holistic marking relate?
The study is part of a university department-wide (science subject) enhancement initiative to modify its marking practices in relation to rubric design and uses. The overall aim was to strengthen the congruence between rubrics, marking, and feedback with the intended instructional learning outcomes of the undergraduate programs of study. Understanding the importance of criteria in relation to holistic marks was the focus of investigation during a trial year.
The overall framework chosen is a case study (Yin,
Exploration of holistic marks in context and in relation to analytic criteria was approached in a multifaceted manner. Design, marking and rubric review stages in the course of the transformation project and trial year offered opportunities to record expectations of staff on the role of criteria and marking data to offer ways of exploring how overall marks (holistic) related to analytic criteria.
The department consists of ~700 undergraduate students and 55 academic staff. A department team of three experienced lecturers from the same science discipline in the department and an assessment adviser was set up to work collaboratively on the design of analytic rubrics and lead the implementation of changes in the trial year. The assessment adviser works across the university with all faculty on projects to enhance practice. University-wide objectives inform the work but departments lead on the implementation (e.g., pace of change, scale) given the known sensitivities of cultural transformation. In the initial stages a professor, formerly head of the department, also joined to steer and approve the direction of the work. The steps followed are informed by guidance literature (Boston,
Five experienced markers from the same department were invited to individual interviews with the assessment adviser on how they made holistic decisions during marking. The markers were invited to bring a sample of student written coursework covering a range of qualities (poor, good, and excellent) in their view. This would enable contrasts and elicitation of criteria providing a basis for their articulation of aspects that are deemed more important, of a higher level, in the eyes of markers when deciding marks, which is the custom. Semi-structured interviews were planned to elicit key criteria considered in different levels of quality:
From the best example of work, through to the lower quality, markers were prompted to explain the salient features that distinguished them: “
Prompts for further probing were used: “
This initial exercise was followed with a thematic analysis in relation to key learning outcomes. The coursework was laboratory reports which have overlapping learning outcomes with both essays and exams. This initial exercise provided a basis to describe criteria and level descriptors for all assessments which were then amended accordingly to fit different assessment methods. The pre-existing holistic marking guides were also reviewed in the process and integrated as part of the new analytic, more descriptive, rubrics. All key assessment types in the department were covered: laboratory reports, essays (coursework); essay-based exams, and projects. The study reports on essay-based questions in exams given the in-depth case study and modeling. Future publications will report on results for other assessment types. An extract of the rubric for exam marking of essay questions is shown in
Extract of the rubric designed for marking—example of criteria related to the learning outcome of critical thinking in essay questions in exams.
Critical thinking | Development of argument | The writing is structured in a logical order such that the reader can identify a very clear line of argument throughout. Evidence of independent thinking is demonstrated through the development of an original argument | The reader is able to identify an apparent line of argument, which may be better structured in some areas than others. | An attempt has been made to develop a line of argument, but this may be unconvincing or lacking in clarity in some areas | It is difficult for the reader to identify a distinct line of argument. Relationships between statements may be hard to recognize. | There is minimal or no apparent development of an argument |
Critical reflection on theory and the work of others | A very clear and consistent critique of research and theory is presented throughout. The writing shows an excellent depth of understanding of how past research links together. | A clear critique of theories and literature is presented. The writing shows a good understanding of how past research links together. | Critical judgement of the value of research and theory is presented in areas, but this may be limited. The writing shows a basic understanding of how past research links together. | Critical judgement of the value of research and theory is very limited or absent. The writing shows an insufficient grasp of how past research links together. | There is very little or no evidence of reflection on past research or theory. |
The department design team concluded that since the assessment types were consistent across the program, the same criteria (and associated learning outcomes) were relevant across all years. This is also recommended in the literature to enhance a clear message to the students in their progression through the years of study (Brookhart,
Allocation of weightings to analytic rubrics is best done intuitively drawing from expert insights of practitioners into the learning outcomes and progression (Dawson,
Whilst the rubric that would be published during the trial year would only be used for feedback during marking, the department team also captured the weightings to each criterion considering, in their view, the important aspects of performance in line with instructional learning outcomes. The department leads on this project met and discussed how they thought criteria should be weighted for both the year 2 and the year 3 analytic rubrics. They were asked to consider their own experience, understanding of the department practice and expectations from the undergraduate programs of study. Marker feedback from the interviews also provided additional insights from five colleagues. Starting with expert views is the first step in building a validity argument (Messick,
The use of a small leading team of three, at this stage, was deliberate for being congruent with the common accepted practices in our context. Typically, module leads make design decisions such as the ones described. Consultation with other markers may happen but that is up to module leads. The whole body of practitioners was not surveyed on their expectations as the team and the interviews offered sufficient insights into assumptions made about the relative importance of criteria in the context of the curriculum objectives. As indicated, introducing too many changes at this point could have been counterproductive.
Lastly, at this point, these weightings were not publicized since they were not relevant to the changes introduced to practice during the trial year. These were relevant to contrast, for the purpose of our investigation in the trial year, how practitioners understood relevance of criteria. Post-marking moderation of assessments in the department had been the main mechanism to check on consistency in marking. In the context of customary holistic marking being the accepted practice, it was counterintuitive to introduce discussions about weightings of criteria at this early stage in introducing changes to the marking culture and practice. Discussions in the wider marking team were planned after the trial year to consider expected relevance of criteria and the analysis of marking data.
At the start of the academic year, all new analytic rubrics were published to staff and students. Activities to engage students in understanding rubrics, quality and expectations from assessment were also implemented (see Boud et al.,
For the duration of an entire academic year, all rubrics were rolled out across all assessments in the department. Markers were instructed to:
Mark in the traditional holistic manner as they had always done.
Use the analytic rubrics to indicate performance levels against criteria during marking.
In a meeting, staff were informed of the rubrics that were to be used across the department in undergraduate assessment. All staff were invited to attend a talk in which a member of the department team outlined the evidence behind rubrics discussed in the literature review and the reasons why analytic rubrics for feedback were being introduced. The specifics of the construction of rubrics were shared (e.g., consultation, interviews with markers), and the criteria and definitions outlined. This discussion served to air any concerns with the provided rubrics. It was important to highlight the evidence-based approach taken to the decision to use rubrics and the creation of the rubrics themselves as this helped to increase staff confidence and trust of the rubric. Staff were shown how the rubric would be used in during marking coursework (on Turnitin). The plan for implementation was outlined, along with tips for marking. These tips encouraged focusing on quality descriptors, avoiding comparisons between students and making quicker judgements on criteria (Brown,
Decision on marks during marking would remain holistic (i.e., by considering all criteria simultaneously) as was the custom. Assessment types used were the same and decisions on marks (holistic) remained unchanged, criteria were simply made more explicit, that is, they had not changed, they were existing already. It was therefore assumed that decision making, now including a more detailed analytic rubric, would serve the purpose of enhancing validity by possibly reducing the use of irrelevant criteria which is a known risk in the previous approach to marking. Markers were marking to the same standards that were already established in the department and had been maintained through post-marking moderation. Moderation in the department provides a check on all assessments with the module convenors controlling for consistency in marking across marking teams. Whilst greater enhancement of shared standards is in the university's agenda (e.g., introducing pre-marking marker training exercises), in the context of the changes this would be considered in the future once the new rubrics were embedded.
Initially, the introduction of analytic rubrics, as part of the marking process, aimed at providing more detailed and qualitative feedback in line with relevant criteria for the tasks therefore enhancing transparency of marking. Colleagues were informed that marks and analytic rubric judgements, during the academic year, would be analyzed with the aim to gain insights into the relationship between holistic marks and assessment criteria. It was agreed that transitioning from holistic approaches to decision making (marks) to analytic, at such a scale, would require greater clarity about why analytic rules may be needed and the nature of those rules based not only on experts' understanding of criteria but actually by exploring how the customary holistic marking operated. It is noteworthy that these decisions in the implementation of the trial year were in response to sensitivities and the existing perceptions that the traditional holistic marking and moderation processes had established the standards. Introducing analytic rubrics was already a major shift for the department and, in our experience, introducing multiple new ideas simultaneously could be counterproductive.
Rubrics were implemented across the department using diverse modes (Excel, online marking tools) depending on assessment type. For example, in the case of exam marking, markers were given Excel spreadsheets where they would record judgements against set criteria for essay questions in exams during the marking process.
The modified marking procedures served as the data collection mechanism. Marking data from the entire academic trial year generated, for each piece of work (e.g., essay):
Markers' overall judgements (holistic), summative and expressed in the customary percentage scale. Holistic marking in the department uses peg marking with percentage marks ending in 2, 5, and 8. For example, a maker could not give the grade a 63, they would have to choose between a 62 and a 65. This is typically recommended in the literature (Suskie,
Markers in the process of marking recorded their judgements on individual criteria and levels using the analytic rubrics alongside the holistic mark. The completed rubrics were provided for each piece of work. These resulted in a record of associated levels of performance (Fail, Poor, Satisfactory, Good, and Excellent) against each criterion in the rubric (see
As a result, the data set consisted of marks with associated judgements of criteria and different levels of performance that had all been formulated during marking. Whilst an extensive data set was gathered for all assessments across the department, this case study focuses on essay-based exam questions in years 2 and 3, during the same academic year. Year 1 exams do not include essay based questions and were not relevant.
A description of the data collected during marking and used in this analysis is below (
Essay based questions, exams, students and markers.
Total essay questions (exams) completed by students | 1,305 | 2,131 |
Students | 294 | 264 |
Assessors | 14 | 29 |
Mean number essay based questions from exams answered by a student | 5 (in 5 exams) | 12 (in 6 exams) |
Mean number of markers per student | 4 |
8 |
Mode (markers per student) | 5 | 10 |
Taking into account that the criteria were common according to assessment type, each year's sample was considered for analysis. All marks in each year have been treated as one big data set (i.e., not as nested variables), despite there being multiple observations per students and a group of markers and across different modules. The total sample contained marks and judgements that belonged to multiple markers and students repeated times. As marking was anonymous, each marker treated each piece of work as an individual case, and made their judgment accordingly and our goal was to model the markers' holistic judgment as they made it. As multiple different markers graded different exams by one student, marker biases would be distributed across different students and different essays, preventing marker bias being modeled as an overall effect. Additionally, the basis of holistic judgement is that it incorporates an individual's opinion, and we want to try to understand that judgement, not control for the subjectivity. Moreover, individual student characteristics were not relevant to our model. The data analysis section fully explains how the sampling method of our non-linear approach to analysis would distribute marker effects to prevent these being modeled.
General guidance on construction of analytic rules advises that experts allocate weightings or other rules (e.g., threshold criteria) based on their understanding and experience (Suskie,
The aim of the study is to provide initial insights into the significance of criteria when formulating holistic judgements (Sadler,
Meta data (e.g., course, marker and student variables) were not input into the model as studying marker and student level effects were not in the scope of this study. In addition, if they were modeled the random forest would have treated them as predictors and they would have interfered with the interpretation of the variable importance, destabilizing the insights into the different criteria.
The reliability of marks was assumed to be established via the existing departmental procedures for post-marking moderation. Usually, a proportion of work, decided by module convenors, graded by each marker is reviewed per module. Our study is primarily concerned with understanding the relationship between criteria and marks. The departmental checks for inter-marker agreement, whilst limited, was deemed sufficient for the purpose of our analysis. The marks were treated as true marks and this is something that future studies may address by collecting multiple judgements on each piece of work to arrive at a true mark. This is a limitation of conducting a study in a real setting.
The analytic information gathered per criterion during marking with the analytic rubric, were transformed to the numerical values associated with each level of quality (Excellent through to Fail) as discussed with the design team (see results). We obtained weights for each criterion from fitting a prediction model to the data and extracting the variable importance marks (i.e. how useful each criterion was at helping the model to make the prediction—see below for more detail). The model predicted the overall holistic mark using the numerical marks applied to each criterion (e.g., Excellent = 85). The prediction model we used was a random forest algorithm (Breiman et al.,
A random forest is an algorithm that produces many decision trees using different samples of the data. Decision trees make predictions by splitting the data using binary decisions to reduce the most variance within the data or subsample (Breiman et al.,
Random forests allow us to quantify how useful each of the different criteria are at making predictions using an inherent process called variable importance (Grömping,
Because of the nature of random forests, each time the algorithm is run the prediction error can vary slightly, and each tree will have a slightly different path it took to get to the prediction, depending on what data it is modeling and what variables are available for it to select. To combat this potential instability of the model, we used 300 trees in the random forest. Three hundred trees were selected as it is the point at which error does not decrease any further when more trees are added to the forest (see
Mean square error (MSE) plotted against the number of trees in the random forest.
Whilst marking data was gathered as part of marking, the proposed extra analyses were conducted only with anonymized data. An ethical committee considered anonymization procedures prior to analyses being conducted. No ethical threats were posed by the proposal and therefore procedures were compliant with ethical conduct.
This section provides the results of our exploration of the contribution of criteria according, first, to the interpretation of practitioners (experts). Secondly, the results of the machine learning analyses of marking data gathered throughout the year on exam essay-based questions provide a data-derived insight into how individual criteria relate to holistic marks.
The department design team assigned weightings according to their views of what should attract marks for different criteria, at different stages, in the study program (see
Expected criterion weightings (essay-based questions in exams).
Critical thinking | Critical reflection | 25 | 30 |
Development of the argument | 25 | 30 | |
Knowledge and understanding | Descriptions and explanations of concepts | 25 | 20 |
Relevance and range of the literature | 20 | 15 | |
Writing skills | In text citation | 5 | 5 |
Structure of sentences/paragraphs | 0 | 0 | |
Use of scientific language | 0 | 0 |
The expectations of the department design team were that markers would alter the way in which they valued criteria across different years. In earlier years, it was perceived that markers are looking out for knowledge and the ability to articulate that knowledge. Later in the degree (year 3), it is expected that students have that skill and so markers then put more value on other criteria, such as critical reflection. These thoughts were guided by the structure of the observed learning outcome (SOLO) taxonomy (Biggs and Collis,
In addition, the design team also provided numerical values for the defined levels of quality for each criterion which appeared as: Fail, Poor, Satisfactory, Good, and Excellent. The department design team used grade boundaries from degree classifications as guidance. These were: Excellent = 85, Good = 65, Satisfactory = 55, Poor = 45, and Fail = 35.
The random forest analysis generated insights into the variable importance of different criteria in relation to holistic marks. In our analysis we have converted the variable importance of individual criteria to weightings associated with individual criteria. This has been done bearing in mind our practitioner context and the need to offer the results of this ranking to be compared with the practitioner recommended weightings. The main result, however, is the resulting ranking of the criteria, rather than the exact weightings. The resulting weightings associated with each criterion are below (
Data-derived criterion weightings (essay-based questions in exams).
Critical thinking | Critical reflection | 13 | 15 |
Development of the argument | 19 | 16 | |
Knowledge and understanding | Descriptions and explanations of concepts | 24 | 25 |
Relevance and range of the literature | 27 | 27 | |
Writing skills | In text citation | 8 | 8 |
Structure of sentences/paragraphs | 3 | 3 | |
Use of scientific language | 6 | 6 |
Decisions about students' overall marks were influenced by knowledge and understanding related criteria (i.e., descriptions and explanations; relevance and range of literature), those criteria were more highly weighted by markers (
In recognition that the conversion to percentage weightings of criteria is limited in revealing the complex interactions of criteria we also fit a single decision tree to year 2 and 3 data to better understand and break down the holistic judgments. The tree for year 2 that gave the most accurate prediction (measured using cross validation error) was a tree with 60 splits. However, after 15 splits, error did not dramatically decrease (
Effect of the tree size (number of splits) along the top compared with the relative cross validated error. Cp along the bottom refers to the control parameter used in the R package rpart to determine the size of the tree (Year 2).
A single decision tree was also fit to the year 3 data to try to better understand and break down the average holistic judgment being made. The tree which was the most accurate (measured using cross validation error) was a tree with 66 splits. However, after around 15 splits, error did not decrease (
Effect of the tree size (number of splits) along the top compared with the relative cross validated error. Cp along the bottom refers to the control parameter used in the R package rpart to determine the size of the tree (Year 3).
Decision trees are included here to illustrate the non-linear relations between criteria and holistic marks, the important criteria and points to predict the resulting holistic marks. Decision trees illustrate how most important criteria split the data and related to higher or lower marks. Decision trees show average marks and the proportion of the sample and the criterion cut-off point. This tree has been pruned to include the first set of binary decisions and predicts 10 different marks (in the range of 23–74%). Each decision point in the tree shows the tree's prediction (mark) at that point in the structure (i.e., 62) and the percentage is how many people from the sample is in each group or “node” (i.e., 100% of the sample at the top).
The illustrative decision tree for year 2 data (
Simplified decision tree for year 2 exams, pruned after 9 splits for illustrative purposes. The top number in each node is the percentage mark prediction at that point in the tree, and the bottom number is the percentage of the sample contained within that node. Literature: Relevance and range of the literature; Concepts: Descriptions and explanations of concepts and relation between them; Argument: Development of the argument; Reflection: Critical reflection on the work of others.
Simplified decision tree for year 3 exams, pruned after 10 splits for illustrative purposes. The top number in each node is the percentage mark prediction at that point in the tree, and the bottom number is the percentage of the sample contained within that node. Literature: Relevance and range of the literature; Concepts: Descriptions and explanations of concepts and relation between them; Argument: Development of the argument; Language: Use of scientific language and style.
Both trees (year 2 and 3) use the same first split (Literature > 60?), then if YES whether the descriptions and explanations of concepts >50 or not. The tree for year 3 does not use critical reflection on theory and the work of others in the top 10 splits. In both trees relevance and range of the literature predicts highest and lowest marks. The decision trees are not complete but illustrate how the random forest algorithm uses criteria to split the data and make predictions of holistic marks.
The graphs (
Year 2 Essay questions (exams): data-derived and expected weightings.
Year 3 Essay questions (exams): data-derived and expected weightings.
A department-wide project sought to implement rubrics and enhance their alignment with intended learning outcomes. The ultimate goal was to ensure transparency in the communication of expectations and uses of rubrics in marking with feedback. A review of the literature informed several decisions on rubric design: use of analytic design of rubrics for greater detail in feedback and more transparent use of criteria. The review also highlighted gaps in research. Holistic or analytic approaches to deriving marks are left to practitioners' choice. Whilst holistic marking was the custom in the particular context of the department, questions over its alignment and nature in relation to analytic rubrics were raised. The literature on the matter offered a complex set of perspectives with a limited evidence base mainly drawn from comparative studies which remains inconclusive presenting somewhat contradictory findings, often from small scale studies.
Research to date warning of challenges to validity associated with holistic approaches (e.g., use of irrelevant criteria, idiosyncratic rules) seemed to contradict some claims in favor of holistic approaches. Most guidance on the use of holistic or analytic approaches to deriving marks, leave the decision to practitioners as optional. A sense of an absence of a clear rationale (Dawson,
With view to advancing our understanding of the rationale for either approach, the role of these approaches for the wider promulgation of standards in a collective (multiple markers and students) remains under-investigated. For example, proponents of holistic approaches make claims about the intrinsic validity of holistic judgement to capture complex relations amongst criteria as well as to promulgate standards in communities of practice involving students (Sadler,
The present study has provided some initial insights into the workings of holistic marking that may support further examination of claims made about the extent to which holistic approaches are adequate vehicles for the definition and promulgation of standards in departments. The study provided some insights into the purported alignment of holistic marking with instructional intended learning outcomes. In order to elicit insights into the cohesion between these aspects, analytic and holistic approaches in decision-making, marking and feedback have been combined. The case study has collected multiple sources of information. Firstly, assumptions about relevant criteria and expected weightings by a department design team were captured. Secondly, marking data were collected using an analytic rubric during an entire academic year alongside the customary marking (holistic marks). Machine learning techniques have enabled the exploration of relevant desired instructional qualities (criteria and levels) related to tutor decisions on overall marks.
The study provided initial insights into how holistic marks relate to relevant task criteria and, by extension, with learning outcomes. Secondly, contrasts with assumptions made by a department design team provide an initial basis for discussions about alignment between marks and relevant criteria. Whilst the project encompassed all assessments across a department, the report has focused on marking of essay-based questions in exams in two years of study (year 2 and 3).
In the design of the analytic rubrics, a department team of three members was asked to allocate weightings based on their experience and understanding of the instructional learning outcomes. In other words, this is expert judgement which is a valid approach discussed in the reference literature to deciding combination rules associated with analytic approaches (Messick,
The department team expressed their views of a ranking of importance of the criteria in the form of percentage weightings which is customary practice. The department team made assumptions about the greater importance of critical thinking overall. Also, further assumptions were made about progression across years of study. More advanced years of study would see the increased difficulty also reflected in the increased weighting by awarding a higher proportion of marks according to performing better on criteria such as critical thinking.
The study has explored holistic judgement by deploying analytic rubrics as ancillary during marking. These have provided a basis for quantifying holistic marks, narrowing down the breadth of criteria used and retaining holistic formulation of marks, both in line with recommendations from the literature. Percentage weightings, elicited using random forests analyses, indicate the existence of a ranking of importance and contribution of different criteria to overall marks. The percentage weightings provide insights into the ranking of the contributions by different criteria deemed more important when deciding marks holistically. Decision trees illustrate the non-linear relations further evidencing the ranking of criterion contributions.
The nature of the rules underlying marks, reveals that criteria relating to knowledge and understanding (i.e., descriptions and explanations of concepts; relevance and range of the literature) contributed more highly toward the overall mark than critical thinking related criteria (i.e., development of argument, reflection on theory). Style and writing related criteria contributed much less. Also, year 2 and 3 results were quite similar in terms of the contribution of different criteria toward the final mark.
The study illustrates how analytic criteria and information elicited during marking, may be used to gain insights into the implicit weightings associated with holistic marks.
Percentage weightings are a common way to indicate rankings of importance for relevant assessment criteria. Rather than the exact weightings, our main interest is in the different rankings of criteria elicited by using these two different approaches. The second main contrast relates to assumptions made about progression between years and increasing the reward for more difficult outcomes, according to practitioners' (expert) views.
Critical thinking was expected to attract more marks by the department team. In the final year of study, an increase in its weighting would account for a higher demand in the performance. The study reveals misalignments between the assumptions and expectations made about the contribution of certain criteria and how these were weighted, in effect, during marking holistically. The first important comparison is that generally, what the department team deemed to be of greater importance (i.e., critical thinking), indeed ranked lower to criteria that were considered easier (e.g., describing concepts, explanations, selecting appropriate literature) when compared to the statistical analysis.
The year 2 and 3 data analysis also showed that the rankings of the criteria according to the data-derived weightings remain stable across years of study. The assumption that more important criteria attract more marks, expected by staff, is not really substantiated by the analysis of marking data in relation to holistic marks. Secondly, how progression occurs would need to be further explored as the criteria weightings in different years is stable.
The study aimed primarily to provide a basis for discussions in practice concerning the status of combination rules in marking. The findings from this study provide a complementary perspective on the role and rationale to use both holistic or analytic approaches to marking. The findings, in an authentic marking setting and based on a large marking data sample, give a new perspective. Previous smaller scale studies alerted to the threats posed to content validity derived from the lack of transparency associated with holistic approaches to marking.
The study builds from previous research showing that irrelevant criteria may be used during marking holistically therefore weakening the validity of holistic marking (content) (Harsch and Martin,
Firstly, the study has shown how in an authentic marking situation, marking data from teams may be exploited to gain insights into holistic judgement through the use of ancillary analytic criteria. In the specific setting of the study, the findings have provided a basis for revisions of their assumptions and more detailed discussions about the nature and purpose of the assessments. More widely, extending to other contexts, the study has exemplified a promising avenue for both further enquiry and to enable faculty teams to develop understanding of the rules underlying holistic marking practices.
Secondly, expected (practitioners) and
Thirdly, beyond the particular focus of the study, the results enable a different perspective that transcends the dominant perception of holistic and analytic as opposites and revealing the potential to use them as complementary tools as has been highlighted in some literature (e.g., Hunter et al.,
a) Retain holistic marking practice but publish the “tacit” relationship with marks, that is, retain holistic marking but declaring the verified rules (i.e., what is awarded more marks).
b) Introduction of analytic marking, applying explicit formulae that are also published to the community, in line with expected learning outcomes.
Lastly, in the context of the department in which the study was conducted, the review led to the replacement of holistic marking with a fully analytic method (marking and feedback). Analytic rubrics have now been introduced not only as a feedback tool but also to derive marks introducing a given formula. The analysis presented here provided the basis for a discussion with markers who trusted the introduction of analytic rules based on the
The case study, set deliberately in an authentic marking context, provides insights that may transfer to other contexts and conveys a sense of the combined potential of holistic and analytic approaches in marking. Despite the contributions, the case study also has limitations to be considered in future investigations. The study is part of a project that followed a stepped approach given sensitivities in the change of the assessment and marking culture in a real setting. The study has provided valuable initial exploratory insights but future studies can expand and further challenge the findings from this study.
Future studies may consider whether marker training, more intensive use of exemplars and discussing explicitly criteria, may have affected holistic marking and use of criteria as reflected in the analysis of marking data. The present study used a small team of three members keeping in line with a natural approach in context. This could be addressed in future studies. Explicit training discussing relevant weightings of criteria was not seen to be appropriate in the context of the present study given that holistic marking had been the tradition and that was left unchanged during the trial. However, future studies might introduce explicit discussions about expected weightings and perhaps training to explore the influence of such measures on marking outputs to identify whether more intense training might have achieved greater alignment between expectations and
The reliability of marks was established with existing departmental moderation mechanisms. As explained, we were interested in the relationship between criteria and overall marks, and reliability in this context was not central. Other ways of strengthening reliability and use of more robust
Future analyses, also drawing from our case, will address aspects of assessment type and context. For example, exam and coursework settings may impact on the role of different criteria and holistic marks. The exam conditions under which students wrote essay-based questions might play a role. Contrasts with coursework conditions for writing essay-based questions will be reported in follow up publications.
Additional important questions for practitioners remain unanswered. Our study has explored how the combined use of analytic and holistic approaches can offer new perspectives to uncover the underlying nature of holistic marking. Whilst we have, in the context of use, opted to use a conversion to weightings, many more combination rules should be explored.
Further investigation of different combination rules and their implications for overall marks are significant aspects that future research should take up. Our exploratory study, attempting to model holistic marking, has elicited rankings of importance of criteria, translated these into criterion weightings and used illustrative decision trees. Many more perspectives are yet to be understood and brought to bear on this subject. For example, our next analysis will consider Rasch measurements which provide insights into the discriminating power of different criteria (Suskie,
Furthermore, the criteria that were selected by the random forests as being the most important predictors were the criteria that were best for dividing up the data over the whole span of different marks (i.e. for fail to excellent), and the relevance and range of the literature criterion was the most successful at this. However, some criteria may be better at predicting a pass from a fail, but not be good at predicting a fail from excellent. Therefore, it is important to recognize this as a limitation. For example, it could be that critical reflection can predict a good from an excellent, but is not as useful at predicting lower grades. Further analyses are planned to address which criteria are best at predicting between different grade boundaries, investigating the possible non-linearity and interdependencies of the rubric criteria. Further investigation of decision trees may offer insights into which criteria could be interpreted as threshold criteria at different levels. Threshold criteria and complex rules would be difficult for markers to define without a basis. Our decision trees offer an initial exploration of how interdependencies may be explored in future studies. Future analyses plan to pick apart the dependencies between the rubric criteria, and to evaluate whether hierarchical rules may be more appropriate. For example, questions such as whether it is valid to assess critical reflection when knowledge and understanding does not meet a certain level should be addressed.
Lastly, analyses are being re-run with similar data from new cohorts to correct from possible biases. The study highlights the existence of rules underlying holistic marking and provides evidence for a potential misalignment between assumptions made by practitioners and actual rules underlying collective holistic marking. Important aspects of validity come under threat (structural validity) if these underlying rules and assumptions are not made visible. As highlighted above, future studies should further explore how training might moderate the findings from our study. Moreover, uncovering further non-linear interactions amongst criteria might be further explored in the future.
In sum, the case study has provided complementary insights into the nature of holistic marking. The potential for misalignment with expectations, from an instructional point of view, warrant further consideration of the implications for validity in its widest sense (structural, consequential). The study concludes that holistic marking and analytic criteria can offer a productive perspective on marking. Rather than arguing that either is the better option, we argue that their combination, to investigate the nature of criteria in relation to overall marks, can be enlightening for practitioners. The results should encourage practitioners to check such underlying rules in their own contexts to ensure clarity and alignment with the communication of expectations involving both markers and students. Further pointers for research have also been discussed to productively advance understanding to date.
The datasets for this study will not be made publicly available because this is sensitive marking data gathered from a department in a university which cannot be published.
The study and collection of data procedures were approved by the Ethics Committee of the University of Nottingham. The Head of Department granted permission to the analyses of marking data which were provided to the authors in an anonymized format.
CT and EW have led the conception and design of the study, data gathering, data analysis, and interpretation. CT has led the writing of the manuscript with section contributions by EW, RL-H, and KS. RL-H and KS have led the data analysis. All authors contributed to manuscript revision, read, and approved the submitted version.
The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.
We would like to thank Dr. Chris Brignell (School of Mathematical Sciences, University of Nottingham) for his support and advice on the analyses in this study.
1This also makes the use of linear models unfit for our analysis since highly related variables will explain a large percentage of the same variance in the data and this would interfere with interpretation of importance of the variables (Allison,