# ADVANCEMENTS IN TECHNOLOGY-BASED ASSESSMENT: EMERGING ITEM FORMATS, TEST DESIGNS, AND DATA SOURCES

EDITED BY : Fank Goldhammer, Ronny Scherer and Samuel Greiff PUBLISHED IN : Frontiers in Psychology and Frontiers in Education

#### Frontiers eBook Copyright Statement

The copyright in the text of individual articles in this eBook is the property of their respective authors or their respective institutions or funders. The copyright in graphics and images within each article may be subject to copyright of other parties. In both cases this is subject to a license granted to Frontiers. The compilation of articles constituting this eBook is the property of Frontiers.

Each article within this eBook, and the eBook itself, are published under the most recent version of the Creative Commons CC-BY licence. The version current at the date of publication of this eBook is CC-BY 4.0. If the CC-BY licence is updated, the licence granted by Frontiers is automatically updated to the new version.

When exercising any right under the CC-BY licence, Frontiers must be attributed as the original publisher of the article or eBook, as applicable.

Authors have the responsibility of ensuring that any graphics or other materials which are the property of others may be included in the CC-BY licence, but this should be checked before relying on the CC-BY licence to reproduce those materials. Any copyright notices relating to those materials must be complied with.

Copyright and source acknowledgement notices may not be removed and must be displayed in any copy, derivative work or partial copy which includes the elements in question.

All copyright, and all rights therein, are protected by national and international copyright laws. The above represents a summary only. For further information please read Frontiers' Conditions for Website Use and Copyright Statement, and the applicable CC-BY licence.

ISSN 1664-8714 ISBN 978-2-88963-506-1 DOI 10.3389/978-2-88963-506-1

#### About Frontiers

Frontiers is more than just an open-access publisher of scholarly articles: it is a pioneering approach to the world of academia, radically improving the way scholarly research is managed. The grand vision of Frontiers is a world where all people have an equal opportunity to seek, share and generate knowledge. Frontiers provides immediate and permanent online open access to all its publications, but this alone is not enough to realize our grand goals.

#### Frontiers Journal Series

The Frontiers Journal Series is a multi-tier and interdisciplinary set of open-access, online journals, promising a paradigm shift from the current review, selection and dissemination processes in academic publishing. All Frontiers journals are driven by researchers for researchers; therefore, they constitute a service to the scholarly community. At the same time, the Frontiers Journal Series operates on a revolutionary invention, the tiered publishing system, initially addressing specific communities of scholars, and gradually climbing up to broader public understanding, thus serving the interests of the lay society, too.

#### Dedication to Quality

Each Frontiers article is a landmark of the highest quality, thanks to genuinely collaborative interactions between authors and review editors, who include some of the world's best academicians. Research must be certified by peers before entering a stream of knowledge that may eventually reach the public - and shape society; therefore, Frontiers only applies the most rigorous and unbiased reviews.

Frontiers revolutionizes research publishing by freely delivering the most outstanding research, evaluated with no bias from both the academic and social point of view. By applying the most advanced information technologies, Frontiers is catapulting scholarly publishing into a new generation.

#### What are Frontiers Research Topics?

Frontiers Research Topics are very popular trademarks of the Frontiers Journals Series: they are collections of at least ten articles, all centered on a particular subject. With their unique mix of varied contributions from Original Research to Review Articles, Frontiers Research Topics unify the most influential researchers, the latest key findings and historical advances in a hot research area! Find out more on how to host your own Frontiers Research Topic or contribute to one as an author by contacting the Frontiers Editorial Office: researchtopics@frontiersin.org

# ADVANCEMENTS IN TECHNOLOGY-BASED ASSESSMENT: EMERGING ITEM FORMATS, TEST DESIGNS, AND DATA SOURCES

#### Topic Editors:

Frank Goldhammer, Leibniz Institute for Research and Information in Education (DIPF), Germany Ronny Scherer, Centre for Educational Measurement, Faculty of Educational Sciences, University of Oslo, Norway Samuel Greiff, University of Luxembourg, Luxembourg

Citation: Goldhammer, F., Scherer, R., Greiff, S., eds. (2020). Advancements in Technology-Based Assessment: Emerging Item Formats, Test Designs, and Data Sources. Lausanne: Frontiers Media SA. doi: 10.3389/978-2-88963-506-1

# Table of Contents


Pietro Cipresso, Elisa Pedroli, Silvia Serino, Michelle Semonella, Cosimo Tuena, Desirée Colombo, Federica Pallavicini and Giuseppe Riva


Andrea Horbach and Torsten Zesch

*56 Multiple-Choice Item Distractor Development Using Topic Modeling Approaches*

Jinnie Shin, Qi Guo and Mark J. Gierl

*70 The Expanded Evidence-Centered Design (e-ECD) for Learning and Assessment Systems: A Framework for Incorporating Learning Goals and Processes Within Assessment Design*

Meirav Arieli-Attali, Sue Ward, Jay Thomas, Benjamin Deonovic and Alina A. von Davier


Andreas Rausch, Kristina Kögler and Jürgen Seifried

*115 Evaluating Different Equating Setups in the Continuous Item Pool Calibration for Computerized Adaptive Testing*

Sebastian Born, Aron Fink, Christian Spoden and Andreas Frey

*129 Collaborative Problem Solving: Processing Actions, Time, and Performance*

Paul De Boeck and Kathleen Scalise

*138 Making the Psychological Dimension of Learning Visible: Using Technology-Based Assessment to Monitor Students' Cognitive Development*

Gyöngyvér Molnár and Benő Csapó

*154 The Skilled, the Knowledgeable, and the Motivated: Investigating the Strategic Allocation of Time on Task in a Computer-Based Assessment* Johannes Naumann


Vanessa R. Simmering, Lu Ou and Maria Bolsinova

*247 Combining Text Mining of Long Constructed Responses and Item-Based Measures: A Hybrid Test Design to Screen for Posttraumatic Stress Disorder (PTSD)*

Qiwei He, Bernard P. Veldkamp, Cees A. W. Glas and Stéphanie M. van den Berg

*259 Predictive Feature Generation and Selection Using Process Data From PISA Interactive Problem-Solving Items: An Application of Random Forests*

Zhuangzhuang Han, Qiwei He and Matthias von Davier

# Editorial: Advancements in Technology-Based Assessment: Emerging Item Formats, Test Designs, and Data Sources

#### Frank Goldhammer 1,2 \*, Ronny Scherer <sup>3</sup> and Samuel Greiff <sup>4</sup>

*<sup>1</sup> Educational Quality and Evaluation, DIPF - Leibniz Institute for Research and Information in Education, Frankfurt, Germany, <sup>2</sup> Centre for International Student Assessment (ZIB), Frankfurt, Germany, <sup>3</sup> Centre for Educational Measurement (CEMO), University of Oslo, Oslo, Norway, <sup>4</sup> Cognitive Science & Assessment, University of Luxembourg, Esch-sur-Alzette, Luxembourg*

Keywords: technology-based assessment, item design, test design, automatic scoring, process data, assessment of/for learning

#### **Editorial on the Research Topic**

#### **Advancements in Technology-Based Assessment: Emerging Item Formats, Test Designs, and Data Sources**

Technology has become an indispensable tool for educational and psychological assessment in today's world. Individual researchers and large-scale assessment programs alike are increasingly using digital technology (e.g., laptops, tablets, and smartphones) to collect behavioral data beyond the mere correctness of item responses. Along these lines, technology innovates and enhances assessments in terms of item and test design, methods of test delivery, data collection and analysis, and the reporting of test results.

The aim of this Research Topic is to present recent developments in technology-based assessment and in the advancements of knowledge associated with it. Our focus is on cognitive assessments, including the measurement of abilities, competences, knowledge, and skills, but also includes non-cognitive aspects of assessment (Rausch et al.; Simmering et al.). In the area of (cognitive) assessments, the innovations driven by technology are manifold, and the topics covered in this collection are, accordingly, wide and comprehensive: Digital assessments facilitate the creation of new types of stimuli and response formats that were out of reach for assessments using paper; for instance, interactive simulations may include multimedia elements, as well as virtual or augmented realities (Cipresso et al.; de-Juan-Ripoll et al.). These types of assessments also allow for the widening of the construct coverage in an assessment; for instance, through stimulating and making visible certain problem-solving strategies that represent new forms of problem solving (Han et al.; Kroeze et al.). Moreover, technology allows for the automated generation of items based on specific item models (Shin et al.). Such items can be assembled into tests in a more flexible way than what is possible in paper-and-pencil tests and can even be created on the fly; for instance, tailoring item difficulty to individual ability (adaptive testing) while assuring that multiple content constraints are met (Born et al.; Zhang et al.). As a requirement for adaptive testing, or to lower the burden of raters who code item responses manually, computers enable the automatic scoring of constructed responses; for instance, text responses can be coded automatically by using natural language processing and text mining (He et al.; Horbach and Zesch).

Technology-based assessments provide not only response data (e.g., correct vs. incorrect responses) but also process data (e.g., frequencies and sequences of test-taking strategies, including navigation behavior) that reflect the course of solving a test item and gives information on the

Edited and reviewed by:

*Yenchun Jim Wu, National Taiwan Normal University, Taiwan*

> \*Correspondence: *Frank Goldhammer goldhammer@dipf.de*

#### Specialty section:

*This article was submitted to Educational Psychology, a section of the journal Frontiers in Psychology*

Received: *19 December 2019* Accepted: *23 December 2019* Published: *20 January 2020*

#### Citation:

*Goldhammer F, Scherer R and Greiff S (2020) Editorial: Advancements in Technology-Based Assessment: Emerging Item Formats, Test Designs, and Data Sources. Front. Psychol. 10:3047. doi: 10.3389/fpsyg.2019.03047*

**5**

#### TABLE 1 | Overview of the papers.


*(Continued)*

#### TABLE 1 | Continued


path toward the solution (Han et al.). Process data, among others, have been used successfully to evaluate and explain data quality (Lindner et al.), to define process-oriented latent variables (De Boeck and Scalise), to improve measurement precision, and to address substantial research questions (Naumann). Large-scale result and process data also call for data-driven computational approaches in addition to traditional psychometrics and new concepts for storing and managing data (von Davier et al.).

The contributions of this Research Topic address how technology can further improve and enhance educational and psychological assessment from various perspectives. Regarding educational testing, not only is research presented on the assessment of learning, that is, the summative assessment of learning outcomes (Molnár and Csapó), but a number of studies on this topic also focus conceptually and empirically on the assessment for learning, that is, the formative assessment providing feedback to support the learning process (Arieli-Attali et al.; Blaauw et al.; Csapó and Molnár; den Ouden et al.; Kroeze et al.).

**Table 1** gives an overview of all the papers included in this Research Topic and summarizes them with respect to their key features. Reflecting the scope of the Research Topic, we used four major categories to classify the papers: (1) papers focusing on the use of new data types and sources, (2) innovative item designs, (3) innovative test designs, and (4) statistical approaches. We refrained from multiple category assignments of papers, which was possible, and focused on their core contribution. The papers' key findings and advancements impressively represent the current state-of-the-art in the field of technology-based assessment in (standardized) educational testing, and, as topic editors, we were happy to receive such a great collection of papers with various foci.

Regarding the future of technology-based assessment, we assume that inferences about the individual's or learner's knowledge, skills, or other attributes will increasingly be based on empirical (multimodal) data from less- or nonstandardized testing situations. Typical examples are stealth assessments in digital games (Shute and Ventura, 2013; Shute, 2015), digital learning environments (Nguyen et al., 2018), or online activities (Kosinski et al., 2013). Such new kinds of unobtrusive, continuous assessments will further extend the traditional assessment paradigm and enhance our understanding of what an item, a test, and the empirical evidence for inferring

### REFERENCES


attributes can be (Mislevy, 2019). Major challenges lie in the identification and synthesis of evidence from the situations the individual encounters in these non-standardized settings, as well as in validating the interpretation of derived measures. This Research Topic provides much input for these questions. We hope that you will enjoy reading the contributions as much as we did.

### AUTHOR CONTRIBUTIONS

All authors listed have made a substantial, direct and intellectual contribution to the work, and approved it for publication.

### ACKNOWLEDGMENTS

This work was funded by the Centre for International Student Assessment (ZIB) in Germany. We thank all authors who have contributed to this Research Topic and the reviewers for their valuable feedback on the manuscript.

Shute, V. J., and Ventura, M. (2013). Stealth Assessment: Measuring and Supporting Learning in Video Games. Cambridge, MA: MIT Press.

**Conflict of Interest:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2020 Goldhammer, Scherer and Greiff. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

# Assessment of Unilateral Spatial Neglect Using a Free Mobile Application for Italian Clinicians

Pietro Cipresso1,2† , Elisa Pedroli<sup>1</sup> \* † , Silvia Serino1,2, Michelle Semonella<sup>1</sup> , Cosimo Tuena<sup>1</sup> , Desirée Colombo<sup>3</sup> , Federica Pallavicini<sup>4</sup> and Giuseppe Riva1,2

<sup>1</sup> Applied Technology for Neuro-Psychology Lab, Istituto Auxologico Italiano, Milan, Italy, <sup>2</sup> Department of Psychology, Università Cattolica del Sacro Cuore, Milan, Italy, <sup>3</sup> Department of Basic Psychology, Clinic and Psychobiology, Universitat Jaume I, Castellón de la Plana, Spain, <sup>4</sup> "Riccardo Massa" Department of Human Sciences for Education, University of Milano-Bicocca, Milan, Italy

#### Edited by:

Ronny Scherer, University of Oslo, Norway

#### Reviewed by:

Timothy R. Brick, Pennsylvania State University, United States Nelson Silva Filho, Universidade Estadual Paulista Júlio de Mesquita Filho (UNESP), Brazil

\*Correspondence:

Elisa Pedroli e.pedroli@auxologico.it †These authors have contributed equally to this work

#### Specialty section:

This article was submitted to Quantitative Psychology and Measurement, a section of the journal Frontiers in Psychology

Received: 30 April 2018 Accepted: 29 October 2018 Published: 22 November 2018

#### Citation:

Cipresso P, Pedroli E, Serino S, Semonella M, Tuena C, Colombo D, Pallavicini F and Riva G (2018) Assessment of Unilateral Spatial Neglect Using a Free Mobile Application for Italian Clinicians. Front. Psychol. 9:2241. doi: 10.3389/fpsyg.2018.02241 Background: Unilateral Spatial Neglect (USN) is traditionally assessed with paper-andpencil tests or computer-based tests. Thanks to the wide-spreading of mobile devices, and the extensive capabilities that they have in dealing complex elements, it is possible to provide clinicians with tools for cognitive assessment. Contemporary 3D engine is, in general generally, able to deploy complex 3D environments for iOS, Android and Windows mobile, i.e., most of the mobile phone and tablet operative systems.

Results: This brand-new scenario and pressing requests from professionals, pushed us to build an application for the assessment of USN. Our first attempt was to replicate the classic cognitive tests, traditionally used at this purpose. Ecological assessment is difficult in real scenarios so we implemented virtual environments to assess patients' abilities in realistic situations. At the moment, the application is available only for iPad and iPhone for free, from the Apple Store, under the name of "Neglect App." The App contains traditional tests (e.g., barrage with and without distractors) and ecological tests (e.g., to distribute the tea in a table to close people). Scoring of each test is available to the clinicians through a database with the executed ecological tasks, that are stored locally.

Conclusion: In conclusion, Neglect App is an advanced mobile platform for the assessment of Neglect.

Keywords: neglect, psychometrics, computational psychometrics, ecological assessment, mobile virtual reality, mHealth, pervasive computing, mental health

### INTRODUCTION

The Unilateral Spatial Neglect (USN) or Neglect manifests in about 2/3 of patients during the acute phase following a stroke. Stroke is an occurrence of cerebral vascular disease resulting in acute disruption of the focal or generalized brain function. Every year, there are approximately 500,000 stroke patients in Europe. This is the third leading cause of death in Western countries after cardiovascular diseases and malignancies (Sudlow and Warlow, 1997; Pendlebury et al., 2009; Roger et al., 2012; Go et al., 2014; Mozaffarian et al., 2015).

**9**

A stroke is a catastrophic and often unexpected event with a wide range of physical and psychological consequences in the long term for both patients and their families.

The long-term effects of stroke depend on the type, severity, and location of the occlusion: it is important to identify as soon as possible which part of the brain and how severely it has been affected. In general, two basic categories of impairments or disabilities can be identified: cognitive disability, which includes memory problems, difficulty in executive functions and aphasia, and motor disabilities, which includes the inability to walk and problems with coordination and balance (ataxia), mobility difficulties with arms, hemiparesis or hemiplegia, spasticity and contractures.

In particular USN can be defined as a disorder because the patient has difficulties to explore, pay attention, perceive, and act within the space opposite the region of the brain lesion. Often, there is also a difficulty in elaborating mental images in the opposite side of the damaged one. It is important to underline that the problems shown by patients are not caused by primary sensory or motor deficit, although they are often associated with hemiplegia and hemianopia (Ducros, 2012; Vocat et al., 2013; Heilman, 2014; Saj et al., 2014). These problems occur mainly following a damage to the right brain hemisphere, but there are patients in which the syndrome arose after a left-sided lesion; right neglect is considered less severe and less enduring (Stone et al., 1991; Halligan and Robertson, 2014). Regardless the side of the lesion, this disorder can be caused by the damage of several areas; the most typical one is the parietal lobe, specifically the inferior parietal lobule, followed by the frontal lobe and other sub-cortical structures such as the thalamus and the basal ganglia (Moretti et al., 2012; Saj et al., 2012; Antal et al., 2014).

In the acute phase and in the more severe form, the patient appears with the head and the gaze turned to the ipsilesional side, insensitive to any stimulation coming from the contralesional side. Over time, symptoms may ease, although more and more studies are showing that the disorder can last even for years (Kerkhoff and Schenk, 2012).

The neglect can be accompanied by several phenomena:


(Treccani et al., 2012; Antoniello and Gottesman, 2013; Marshall et al., 2013; Bartolomeo, 2014).

The standard neuropsychological tests for the analysis of extra personal neglect can be divided into:


What may be of great help to clinical placement is real exploration of space such as the room, where the patient is hospitalized to or the one where tests are conducted to have more complete picture of the patient's spatial abilities. Unfortunately, it is difficult to make these tests in a clinical setting because of the higher requested time and human resources. Finally, it is difficult to standardize these tests due to the heterogeneity of the experimental situations.

The purpose of this App was to include the described tests for a portable and electronic use, including also an automated score recording, that can also help in simplifying the difficult process of neuropsychological assessment. In one hand several tests have been included, as it is shown in the following sections. On the other hand, we made the effort of including new paradigms and tests that are difficult to be made in paper and pencil mode. In particular, navigation tasks and ecological tests represent our effort integrating current paradigms for neuropsychological assessment. At the moment, a plaint of possible features and indexes are probably still missing, however, the App represent the first effort ever in integrating many tests and tasks in a mobile application. This could be the first step toward future integrations.

### IMPLEMENTATION

Neglect App is the first application for mobile devices which makes use of the huge potential of virtual environments for the assessment of the USN for which an evaluation as effective and prompt as possible are crucial (Pallavicini et al., 2015; Pedroli et al., 2015a,b, 2016).

During the process of the design of the App, we also exploited the potential of 3D interactive applications for preventing and/or improving cognitive impairments related to USN, on the basis of a series of advantages amply documented by scientific literature:

### Neuroplasticity

Neuroplasticity: the App permits to use scenarios specifically designed following principles that regulate and facilitate neuroplasticity (the neurobiological process basis of recovery of cognitive and motor functions), such as exercise intensity, exercise frequency, "enriched stimulation" (Cheung et al., 2014; Ekman et al., 2018).

### Personalized Training

fpsyg-09-02241 November 20, 2018 Time: 19:27 # 3

Personalized training: the App is based on highly automated functioning mechanisms that requires a minimal contribution by the clinical therapists, who have the possibility to customize the intensity and the difficulty of the training based on the specific needs of the patients; Engaging tasks: in the App, the content of training exercises are based on defining some tasks to retrain specific abilities (for example, increasing complexity time by time), and in the same time integrating in the scenario some recreational elements to maintain a high level of engagement and compliance of the older participant. Specifically, ecological simulations can be particularly engaging by supporting a process known as "transformation of flow," defined as a person's ability to exploit an optimal (flow) experience to identify and use new and unexpected psychological resources as sources of involvement (Riva et al., 2006; Pedroli et al., 2018). Also, presence is a key point of the engagement in the use of technology. Presence is usually defined as the "sense of being there" or the "feeling of being in a world that exists outside the self." The ability to interact actively with the environment greatly improves the possibility of experiencing presence (Riva et al., 2007; Villani et al., 2012b).

### Tracking and Objective/Quantitative Measure

Tracking and objective/quantitative measure: it is possible to record a high quantity of data and use them to create some indexes of performance in order to measure in a quantitative and objective way the improvement of the performances observable in the course of possible rehabilitative process.

### Transferring of the Training in Activity of Daily Living (ADL)

Transferring of the training in activity of daily living (ADL): many studies suggested the potential offered by ecological tasks to transfer the results of re-learning of cognitive and motor abilities that were damaged in ADL. Positive impact of ecological tasks on ADL is documented by many studies (Laver et al., 2015; Chiang et al., 2017).

A previous pilot study investigating the correlation between Neglect App test and classic test in order to understand the usability and ecologicity of our app. Results showed that the cancelation tests of Neglect App were equally effective to the traditional tests in the screening of symptoms between patients with and without neglect. Moreover, the Neglect App Card Dealing task was more sensitive in detecting neglect symptoms than traditional functional task (Pallavicini et al., 2015).

Neglect App contains a series of trials for neglect evaluation through classic tests and virtually interactive environments with the double advantage of automating and making more ecological the evaluation of neglect patients, who thus show a difficulty and/or incapacity to explore, pay attention, perceive, and act in the space region opposite to the area of the brain lesion. Thanks to Neglect App, it is possible to evaluate the explorative behavior of the patient in a fast and simple way, inside ecological environments and receiving all the data, from the performed sessions, included in a database. Neglect App can be downloaded for free at: https://itunes.apple.com/it/app/neglectapp/id788480837?mt=8.

### RESULTS AND DISCUSSION

Evaluation is composed of nine exercises divided in two groups: ecological tasks and barrage. The first group comprises ecological tasks, some of which inspired by the ecological battery by Zoccolotti et al., 1994), some others created by starting from real life situations and tasks used clinically but lacking standardization. The app always provides all the score dividing the results in left, right, center, and total areas. Moreover a screenshot with the results is always recorded and generated in the report.

## Ecological Tasks

#### Serve Tea

The patient is required to distribute tea to himself and people sitting at the table with him using objects placed in the center of the table (**Figure 1**).

The task is commonly used by clinicians in real settings, however, the experience has been replicated in the tablet to be more usable, keeping its own ecological validity.

The patients can be used their finger to drag and drop the single objects for taking the task as requested by the App.

The App already contains the instructions that have to be followed, so the clinicians have just to give the tablet to the patient observing the correct use while executing the task.

Clinicians are not required to take note of the performed actions since the App is able to record every significant action consequently calculating the standard scores that can be used and integrated in a clinical protocol.

FIGURE 1 | Laying the table tapping on the iPad using the Neglect App.

The score is assigned on the basis of proper and wrong objects placed to the right, in the center and to the left. Time employed and unconsidered objects are also signaled (**Figure 2**).

In any time, clinicians are able to access to the patients' score directly from the App, visualizing each score in each task assigned at any time.

#### Card Dealing

The second exercise requires the patient to hand out playing cards to himself and people sitting at the table with him (**Figure 3**). The score is assigned on the basis of correctly given cards, omitted cards, wrong cards (i.e., those in excess) to the right, to the left and in the middle and the time employed to compete the exercise.

#### Controlling an Orders List

In this task, the patient is required to check an orders list to verify if the dishes noted herein are on the shelves; if they are, he/she will have to select the dish on the shelf and the note on the list (**Figure 4**).

Score is assigned on the basis of: the dishes selected correctly; those selected wrongly; the correct dishes omitted; the correct selections and omissions on the list; and the time taken (**Figure 5**).

#### Exploration

Within this environment the patient finds him/herself in a room in which he/she can move freely to left or right describing all the objects that are in the room and touching them accordingly (**Figure 6**). The app calculates automatically, as the patient

FIGURE 3 | Distributing cards tapping on the iPad using the Neglect App.

moves, if the selected object was on the right or the left. The report indicates selected objects on the left, the ones on the right, repetitions, time employed, and omitted elements.

#### Apples Pursuit

Within this environment the patient finds him/herself in an office in which he/she can move freely to left or right to identifying and touching all the apples inside (**Figure 7**). The app calculates automatically, as the patient moves, if the selected apple was on the right or the left. The report indicates selected apples on the left, the ones on the right, repetitions, time employed, and omitted apples.

### Barrage Tasks

Barrage tests take the cue from classical cancelation tasks commonly used clinically (Zoccolotti et al., 1994) and comprehend four exercises, described below.

#### Simple Barrage

Patient is required to select all objects (hammers) in the room. There are no distractors (**Figure 8**). The number of selected objects, repetitions, objects omitted on the left and on the right and time employed are considered (**Figure 9**).

#### Simple Barrage With Distractors

Patient is required to select all target objects (screwdrivers) in the room, which are mixed with distractors (**Figure 10**). The number

FIGURE 7 | Apples pursuit task.

FIGURE 8 | Simple barrage task.

of selected target objects, repetitions, target objects omitted and the distractors selected on the left and on the right and time employed are considered (**Figure 11**).

#### Dynamic Barrage

Patient is required to select all objects (balloons) in the sky. There are no distractors. The peculiarity here is that the objects are moving (**Figure 12**). The number of selected objects, repetitions, objects omitted on the left and on the right and time employed are considered (**Figure 13**).

#### Dynamic Barrage With Distractors

Patient is required to select all target objects (kites) in the room, which are mixed with distractors (**Figure 14**). The number of selected target objects, repetitions, target objects omitted and the distractors selected on the left and on the right and time employed are considered (**Figure 15**).

A qualitative analysis of the barrage tasks may give information about dysexecutive behaviors because it is possible

to select multiple times every single item and the target in the simple version of both barrage tasks (simple and dynamic) is in the environment of the barrage with distractions tasks.

### Data Management

FIGURE 10 | Simple barrage with distractors task.

All data can be downloaded in a unique file by connecting the iPad to a Computer or a Mac equipped with iTunes software. Once downloaded, the file can be easily read with a client

software able to interact with SQL Databases (**Figure 16**). All data, including images, are exportable to be computed for the statistical analysis.

FIGURE 14 | Dynamic barrage with distractors task (all the elements are in a continuous movement).

### CONCLUSION

Neglect may influence the behaviors of the patient in everyday life activity: they can constantly hit the objects placed on his left, not paying attention to the left side of the road when he crosses. In severe cases he can ignore the food in the left half of the plate. So, it has a sufficiently serious framework that allows the patient to cope independently.

The functions such as memory, speech, or attention in neuropsychological research were traditionally assessed through program of standardized tests, which have clear psychometric advantages, but often measure behaviors that are very different from those of everyday life (Chaytor and Schmitter-Edgecombe, 2003).

In recent years, there has been a growing interest in the development of tools that allows ecological and functional assessment above all by using mobile device (Villani et al., 2012a, 2013; Carbonaro et al., 2014; Pedroli et al., 2015b). The results of a meta-analytic review of Negu¸t et al. (2016) support the sensitivity of virtual reality tools in detecting cognitive deficit. One of the areas where emerges this need is the assessment of neglect. We decided to diffuse the application in the Italian market with a future intention to extend worldwide a possible English version. The Neglect App temporal cycle concern from the moment of the patients into the Clinique to the continuous assessment at the patient's home and back to the Clinique in a closed loop for the continuous assessment (**Figure 17**).

Assessment by using a mobile tool and virtual environments might represent a great challenge for very sophisticated methods able to assess in a way before unthinkable and sometimes impossible in real settings. In particular, navigation tasks allow the system to identifying if an object in the space is located in left

FIGURE 16 | Database management to brows and analyze the data collected.


or right side when selected. On the other hand, in real settings to do this navigation task is too expensive, requiring eye-tracking glasses. Moreover a computational approach can be easily used to provide more feedback to the patients and to model behaviors (Cipresso, 2015; Cipresso et al., 2015).

One of the limitations of the App is the screen dimension, that does not provide any direct advantage compared to paper and pencil test. Actually, this limitation has been recently overcome by the iPad Pro 12,9<sup>00</sup> that can be effectively used with our App, being totally compatible. Another limitation is the lack of normative data available for a quantitative analysis of the results; only a qualitative analysis is recommended. A future study could be able to fill this gap.

At the moment we have not implement some additional information and indexes that could help the clinicians to better understand the characteristics of their explorative behaviors in order to program a more personalized rehabilitations. In particular, it could be interesting to report the starting point and the path of the exploration made by patients or some other indexes like the ones reported in the Chung et al.'s (2016) article.

Additionally, to create some tasks for rehabilitations could make our application completer and more interesting. Provide some tasks for make exercises in a virtual environment could help patients and clinicians to improve clinical practice.

The future development will have directed to fill these limitations with the addition of some specific tasks both for assessment and rehabilitation. A manipulation of the cognitive complexity of the barrage tasks according to the criteria proposed by Ricci et al. (2016) and Sarri et al. (2009) could help to have a more precise assessment process. To aim this scope the introduction of a 3D version of line bisection task are also consider because some patients may show neglect symptoms in this kind of task and not in the barrage one.

Also, a new version developed to take advantage of the immersive technology could be designed in order to reach a higher degree of ecologicity.

After all these modifications a validation study will be necessary in order to prove the validity of our system. Also, a clinical trial for the rehabilitation session could be done in order to prove the usefulness of a computerizing protocol.

#### REFERENCES


Both, convergent and discriminant validity, need to be verified comparing current tools accordingly. At this purpose can be used current neuropsychological battery and specific test, such as barrage test, front assessment battery, real task (e.g., lay the table in real context), and so on.

We are so providing the scientific and clinical communities a free advanced tool able to be a practical and flexible way for the assessment directly in the patients' place but also a brand-new way for the assessment of Neglect.

#### Availability and Requirements


### AUTHOR CONTRIBUTIONS

PC and EP wrote the manuscript. PC, EP, SS, MS, CT, and DC collected the literature materials. GR supervised the study. PC, EP, SS, FP, and GR conceived the idea of the study, established the software requirements, and supervised the technological, clinical, and scientific aspects. All authors read and approved the final manuscript.

### FUNDING

This study was partially supported by the Italian funded project "VRehab. Virtual Reality in the Assessment and TeleRehabilitation of Parkinson's Disease and Post-Stroke Disabilities" – RF-2009-1472190.



the potentiality of mobile virtual reality. Technol. Health Care 23, 795–807. doi: 10.3233/THC-151039


cancer patients: a preliminary controlled study. Stud. Health Technol. Inform. 173, 524–528.


mapping. J. Neurol. Neurosur. Psychiatry 82, 862–868. doi: 10.1136/jnnp.2010. 224261

Zoccolotti, P., Pizzamiglio, L., Pittau, P., and Galati, G. (1994). Batteria di Test Per L'esame dell'Attenzione. Roma: Psytest.

**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2018 Cipresso, Pedroli, Serino, Semonella, Tuena, Colombo, Pallavicini and Riva. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

# Virtual Reality as a New Approach for Risk Taking Assessment

Carla de-Juan-Ripoll\*, José L. Soler-Domínguez, Jaime Guixeres, Manuel Contero, Noemi Álvarez Gutiérrez and Mariano Alcañiz

Instituto de Investigación e Innovación en Bioingeniería (i3B), Universitat Politècnica de València, Valencia, Spain

Understanding how people behave when facing hazardous situations, how intrinsic and extrinsic factors influence the risk taking (RT) decision making process and to what extent it is possible to modify their reactions externally, are questions that have long interested academics and society in general. In the spheres, among others, of Occupational Safety and Health (OSH), the military, finance and sociology, this topic has multidisciplinary implications because we all constantly face RT situations. Researchers have hitherto assessed RT profiles by conducting questionnaires prior to and after the presentation of stimuli; however, this can lead to the production of biased, nonrealistic, RT profiles. This is due to the reflexive nature of choosing an answer in a questionnaire, which is remote from the reactive, emotional and impulsive decision making processes inherent to real, risky situations. One way to address this question is to exploit VR capabilities to generate immersive environments that recreate realistic seeming but simulated hazardous situations. We propose VR as the next-generation tool to study RT processes, taking advantage of the big four families of metrics which can provide objective assessment methods with high ecological validity: the realworld risks approach (high presence VR environments triggering real-world reactions), embodied interactions (more natural interactions eliciting more natural behaviors), stealth assessment (unnoticed real-time assessments offering efficient behavioral metrics) and physiological real-time measurement (physiological signals avoiding subjective bias). Additionally, VR can provide an invaluable tool, after the assessment phase, to train in skills related to RT due to its transferability to real-world situations.

Keywords: virtual reality, risk taking, occupational risks, risk attitude, risk perception, stealth assessment, psychophysiological assessment, embodiment

## INTRODUCTION

Each year, deficient Occupational Safety and Health (OSH) practices cause a global cost of approximately 2680 billion euros (Elsler et al., 2017). Although OSH training has shown positive impacts in the workplace, its effectiveness is below expectations (Robson et al., 2012). It has been demonstrated that the natural differences between individuals can appreciably influence this low effectiveness at several levels, cognitive, motivational and functional, among others (Motowildo et al., 1997). Risk propensity, defined as the "willingness to take risks" (MacCrimmon and Wehrung, 1990) and risk perception, defined as the individual's assessment of how risky a

#### Edited by:

Samuel Greiff, University of Luxembourg, Luxembourg

#### Reviewed by:

Pedro Gamito, Universidade Lusófona, Portugal Valerie Shute, Florida State University, United States

> \*Correspondence: Carla de-Juan-Ripoll cardejua@i3b.upv.es

#### Specialty section:

This article was submitted to Quantitative Psychology and Measurement, a section of the journal Frontiers in Psychology

Received: 01 September 2018 Accepted: 27 November 2018 Published: 12 December 2018

#### Citation:

de-Juan-Ripoll C, Soler-Domínguez JL, Guixeres J, Contero M, Álvarez Gutiérrez N and Alcañiz M (2018) Virtual Reality as a New Approach for Risk Taking Assessment. Front. Psychol. 9:2532. doi: 10.3389/fpsyg.2018.02532

**20**

situation is (Baird and Thomas, 1985), have been shown to have strong influence on risky decision making behaviors (Sitkin and Weingart, 1995). The measurement of risk taking (RT) attitudes is a recognized challenge for researchers and practitioners. Researchers have mostly employed self-report instruments to assess individual constructs based on theoretical psychological models (Brockhaus Sr, 1980; Ford et al., 1990; Gullone et al., 2000; Portell and Solé, 2001; Steinberg, 2004; Gardner and Steinberg, 2005; Sneddon et al., 2013; Rodríguez-Garzón et al., 2015). We have not found any one model that defines RT, thus its measurement requires further investigation. Lejuez et al. (2002) developed and validated a laboratory-based behavioral measure of RT (Balloon Analog Risk Task – BART). While this is a validated tool that has been used in several studies, we believe that it is desirable to develop a more ecological system to measure RT. VR provides the capability of creating interactive environments in which users can perform while their behavioral responses are recorded (Parsons, 2015). Accordingly, we propose that virtual environment based assessments are tools that can enhance the ecological validity of the evaluation of the responses evoked (Parsey and Schmitter-Edgecombe, 2013).

In this article we focus on the measurement of RT using physiological and behavioral metrics, with VR being employed as a tool to create immersive situations. We propose to use VR to assess RT attitudes under the paradigm of stealth assessment. VR can provide engaging virtual worlds which will allow real time measurement of RT behaviors.

This paper is comprised of four sections. In the first we review the theoretical framework of RT in the previous literature. In the second we summarize the extant instruments for the measurement of RT behaviors and discuss the current issues that make us believe that there is a need to establish a new approach. In the third we propose VR as a step forward in the assessment of RT. The fourth section briefly discusses the substantial implications raised by the article and our proposals for future research in this field.

### RESEARCH INTO RISK TAKING

RT research can be said to have started with the nuclear debate of the sixties. It was focused on risk acceptance and dealt with factors such as benefits and voluntariness. Since then, several more factors have been proposed for the explanation of RT: trust, trustworthiness and trust propensity (Colquitt et al., 2007); supportive supervision, job autonomy and communication quality (Parker et al., 2001); problem framing and outcome history (Sitkin and Weingart, 1995); expected utility (Kahneman and Tversky, 1986); genre (Byrnes et al., 1999) and boredom (Schroeter et al., 2014).

While these factors have been demonstrated to influence RT, individual differences constitute a key element in decision making processes (see **Figure 1**). According to Rundmo, 1996, a biased perception of risk – understood as the subjective evaluation of a risk - can lead to misjudgements of potentially hazardous risk sources. Therefore, if the subjective evaluation of a risk differs from the objective risk, this should be corrected (Risk Research Committee, 1980). Personality traits influence attitude toward risk, prompting risk seeking or risk aversion behaviors. This set of personal, innate, basic characteristics associated with risk were named Intrinsic Risk Attitude (IRA) by Schoemaker (1993) and have been shown to be consistent in various situations and contexts (Dohmen et al., 2011). Additionally, cognitive and affective states are also considered to be key influencers in the decision making process. We highlight mood and cognitive load as two main representative factors in this category. Mood has a strong influence on RT. People in a positive mood tend to focus on the benefits of a risky situation, much more so than those in neutral mood, making them more susceptible to undertake risky behaviors (Forgas, 1982, 1995; Forgas and Bower, 1987; Yuen and Lee, 2003). On the other hand, people in a negative mood overestimate risks and try to avoid potential loss and, therefore, think and act more carefully (Jorgensen, 1996). Cognitive load, the amount of mental activity involved in working memory, might also play a role in risk perception, since some kind of decisions, based on utilitarian judgments, require additional cognitive resources (Greene et al., 2008).

### RISK TAKING MEASURES: CURRENT ISSUES

RT measurement is a non-deterministic and non-standardized process based on different perspectives. Traditionally, most theories of human behavior are based on a model of the human mind that assumes that humans can think and verbalize accurately about their attitudes, emotions and behaviors (Simon, 1976; Brief, 1998). To date, most of the theoretical constructs used in RT assessment are based on explicit measures such as self-reports. However, recent advances in neuroscience have demonstrated that most of the brain processes that regulate our emotions, attitudes and behaviors are not conscious. That is, they are implicit processes that, in contrast to explicit processes, humans cannot verbalize (Barsade et al., 2009; George, 2009; Becker et al., 2011).

Several explicit measures of RT, oriented to evaluate attitude to risk, deferred risk perception or expected risk behavior, have been proposed in the last fifty years. Some authors have employed self-report measures based on questionnaires on compliance with safety practices in the workplace (Parker et al., 2001; Mohamed et al., 2009; Seo et al., 2015), attitude toward risk and organizational commitment (Kivimäki and Kalimo, 1993) and in studies into decision making (Sitkin and Weingart, 1995). On the other hand, some works have drawn on theoretical multidimensional models based on psychological constructs, such as personality (Lejuez et al., 2002; Skeel et al., 2007), impulsivity (Lejuez et al., 2002), sensation seeking (Horvath and Zuckerman, 1993; Lejuez et al., 2002) and situational awareness (Lejuez et al., 2002).

However, as in many other disciplines, pre- and postexperiment questionnaires have an important intrinsic bias

TABLE 1 | VR features and benefits of risk taking measurement.


since individuals' cognitive and psychological states will be different when they answer the questionnaires to when they actually underwent the experiences that the researchers wish to analyse (Kivikangas et al., 2011). As stated in (Wang et al., 2015), this tendency is primarily due to "social desirability effects," which can lead to untrue accounts of behavior, attitudes and beliefs (Paulhus, 1991). In addition, there may be different interpretations of specific self-report items, resulting in unreliability and poorer validity (Lanyon and Goodstein, 1997). Lastly, some self-reporting questions need people to possess overt knowledge of their dispositions (Schmitt, 1994) and this does not always run true.

To our knowledge, the BART (Lejuez et al., 2002) constitutes, to date, the only tool for RT measurement using implicit measures. The authors developed and validated a laboratorybased behavioral measure of risky behaviors. In this task, a balloon was presented in the middle of the screen. Subjects were asked to pump it as much as possible, knowing that it could exploit at any time. Participants were told that they would obtain a financial reward the more they could inflate the balloon without breaking it. Although the reliability of this tool has been retested (White et al., 2008), extensive investigations have demonstrated that the correspondence between performance in neuropsychological tests and real-life behaviors is very weak (Manchester et al., 2004; Sbordone, 2008; Bottari et al., 2009).

In the BART validation study, researchers employed measures of impulsivity, sensation seeking and behavioral constraint. We consider this a good basis to build on, since each of these constructs has been investigated independently and associated with RT. Firstly, impulsivity has been associated with RT in terms of drug use, drink driving and seatbelt use (de Wit, 2009; Stanford et al., 1996). Some authors have also demonstrated its connection with emotional self-control, inhibition and, especially, the management of frustrating situations (Cooper et al., 2000; Boyer, 2006). In addition, researchers have studied the relationship between the sensation seeking trait and RT in several domains, such as recreation, health, career, finance, safety and social life (Nicholson et al., 2005). Donohew et al. (1999) concluded that sensation seeking is an important factor in sexual RT. According to Tellegen's (1985), model behavioral constraint is one of the dimensions that composes personality. The behavioral constraint factor encompasses control, harm avoidance and traditionalism facets. In the same way, there is empirical evidence of the influence of personality traits on RT attitudes, in particular punishment avoidance (Paulus et al., 2003). We can find an interesting study from Wills et al. (2006) supporting this idea in the substance abuse field.

### LIMITATIONS OF CURRENT RISK TAKING MEASURES

As mentioned previously, to date the majority of RT assessment tools has been based on explicit measures and the use of questionnaires.

BART, with its multi-dimensional set of psycho-cognitive influences, represents the only alternative to explicit measures of RT behavior, but its design has some intrinsic limitations that current technologies could help to overcome.

In this regard, we believe that the existing measurement instruments do not reflect real situations, in which the subjects can perform as in real life, which leads to skewed results. In the laboratory the controlled stimuli given to subjects often do not include variables that are present in real life situations. Thus, the ecological validity of these methodologies, such as BART, is quite limited. Furthermore, these measurement tools do not involve any strong physical interaction, but require only simple actions, such as clicking a mouse, ignoring the

de-Juan-Ripoll et al. VR for Risk Taking Assessment

influence of the reactions of the rest of the body. In addition, when an individual is submitted to the currently available tests, (s)he is aware that (s)he is being assessed and can alter the outcomes; so we propose stealth assessment as a means of obtaining reliable results about real behaviors unnoticed by the subject. Lastly, we suggest that physiological processes must be considered as important measures of RT, as these measurements are uncontaminated by the participant's answering style, social desirability, interpretations of questionnaire item wording, the limits of his or her memory or by observer bias (Kivikangas et al., 2011). Thus, we propose an alternative measurement method which aims to advance in four specific aspects:


et al., 2012) that argues that taking a meaningful action enhances learning in comparison to passively perceiving that action. This idea has been strongly supported for decades by classical learning theorists such as Piaget and Cook (1952) and Vygotsky (1978). We propose to take advantage of the ideas underlying embodied learning theory and use high level cognitive experiences, involving sensing, acting and thinking, to measure and change attitudes in a deeper, more effective way.


### VIRTUAL REALITY AND RISK TAKING ASSESSMENT

Virtual Reality is a 3D synthetic environment able to simulate real experiences in which subjects can interact as if they were in the real world (Alcañiz et al., 2003). VR provides

greater immersion, fidelity and higher level of active user involvement than traditional methods of assessment and training (Hedberg and Alexander, 1994). In our view, VR constitutes a suitable tool for behavioral measurement, since it complies with the requirements (see **Table 1**) of the four specific aspects discussed in the previous section: (1) the real-world risks approach, (2) embodied learning, (3) stealth assessment and (4) physiological real-time measurement.

(1) According to Slater (2009), the result of immersion through technology is the psychological state of "being there," where the subject essentially forgets that (s)he is in a virtual reality setting. This produces a sense of presence and a "plausibility illusion" which evoke the perception that what is happening in the VR is actual and allows subjects to interact and behave as they might in real life. VR is being used increasingly for natural phenomena and social interactions simulation, since it has been demonstrated that neural mechanisms in humans when they are immersed in a virtual environment are similar to those in real life (Alcañiz et al., 2009). When we talk about training and learning, failure is a necessary ingredient. There is evidence that people who have faced real hazards have a more cautious attitude toward OSH (Cavalcanti and Soares, 2012). Hazards in real life can involve serious danger. This is why VR emerges as a potential medium for RT assessment and training, allowing users to operate, without risks, in a quasi-real environment (Amokrane et al., 2008). VR allows the exposure of a person to a risky situation and the activation of high fidelity cognitive processes and behaviors due to the plausibility of the immersion. (2) VR environments allow users to take part in an embodied learning experience, mainly through physical interactions (Kilteni et al., 2012). Going further with this concept (Dourish, 1999, unpublished), we consider a virtual interaction to be fully embodied when it is believable, in the sense of using our body coherently as we do in the real world. The dual-process theory of moral judgment, when it refers to moral dilemmas, makes a distinction between personal and impersonal dilemmas (Greene et al., 2001; Greene, 2009): personal dilemmas are conflicts in which the subject experiences the situation in the first person and actions are carried out physically – e.g., pushing. Conversely, impersonal dilemmas are seen from the outside, and the subjects do not take overt physical actions, but make only minor responses, such as pressing switches or levers. Based on this distinction, it has been demonstrated that when actions are based on the first person perspective and involve physical acts, the subjects tend to make more emotional decisions (Greene et al., 2001; Amit et al., 2014). (3) Stealth assessment can be also defined as a performancebased method, in which what is evaluated is latent (Rupp et al., 2010). Under this paradigm, embedding assessments in immersive virtual worlds is an innovative approach (Shute and Spector, 2008) that, in our view, is an improvement from the standpoint of ecological validity. (4) Regarding physiological real-time measurement, VR provides interactive and multimodal sensorial stimuli that provide unique advantages over other methodologies in neuroscientific investigation (Bohil et al., 2011). Thus, due to technological advances, researchers can now use accurate, affordable devices to obtain physiological measures which have been found to be more effective than self-reported measures as they (a) are not intrusive, (b) do no rely on participants' self-assessment of their emotional or cognitive experience, and (c) can detect changes in participants in real time. We have previous experience in combining VR technology with brain activity measures, and these results have shown that interactive virtual environments allow the measurement of emotional responses (Marín-Morales et al., 2018).

For these reasons, customizable, domain independent VR environments, in which individuals can, to a certain extent, act freely and react naturally to different risks or hazards, open to researchers an uncharted field of information about RT attitudes and behaviors. The set of these requirements may result in an application that includes a virtual environment, with a specific narrative that face the users with risky situations. This should be designed following stealth assessment methodology, and would allow physiological and behavioral measurement to provide information about individual decision making in the field of RT. We will show an example of how this tool might perform: the user could be in a virtual environment that consists in a path which (s)he must cover from start to finish, within the shortest possible time. Suddenly, (s)he meets a bifurcation, where (s)he has to choose whether a safe but log way – less risk, less potential benefit -, or a dangerous but short path – higher risk, higher potential benefit -. During this decision making process, we could take measures of galvanic skin response to assess emotional activation, and behavioral measures such as reaction time and the decision made by the user. As a result, we could obtain information about specific weight of emotional processes in RT, and its influence on behavior.

Our future research aims to study to what extent a VR tool is able to measure the cognitive and affective processes that influence RT. Furthermore, we would focus on how virtual interactions and narratives weight on the decision making process.

### CONCLUSION

RT measurement is a major challenge for companies and researchers. Investigations into behavioral measurement are at a turning point as, due to the potential of technological advances, we can generate virtual worlds to evaluate and, going further, train people in certain skills and competences. We suggest that virtual reality is the most appropriate medium for assessing attitudes to risk and risk perception, conditioning factors in the RT process, due to their immersive capabilities. We propose to undertake future investigations into realworld risks, embodied interactions, stealth assessment and physiological real-time measurement as differentiating elements in RT assessment. If we can study and measure the real, unbiased reactions of people facing risky or hazardous situations, it will be possible to create customized training programs to fit their individual characteristics. This can be expected to contribute to the improvement of OSH training programs, reducing work-related incidents and, consequently, costs for companies.

### AUTHOR CONTRIBUTIONS

fpsyg-09-02532 December 11, 2018 Time: 17:57 # 6

MA, CdJ, and NÁ were responsible for the general idea of the paper. CdJ and JS participated in drafting the work, while JG and MC revised it in-depth and provided new ideas thanks to their previous experience. MA supervised the entire work, revised the manuscript and approved the final version to be submitted. All

### REFERENCES


authors made substantial contributions to the conception and development of the work.

### FUNDING

This work was supported by the Spanish Ministry of Economy, Industry and Competitiveness funded projects "Advanced Therapeutic Tools for Mental Health" (DPI2016-77396-R), and "Assessment and Training on Decision Making in Risk Environments" (RTC-2017-6523-6) (MINECO/AEI/FEDER,UE).




**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2018 de-Juan-Ripoll, Soler-Domínguez, Guixeres, Contero, Álvarez Gutiérrez and Alcañiz. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

# Automated Feedback Can Improve Hypothesis Quality

Karel A. Kroeze1,2 \*, Stéphanie M. van den Berg<sup>2</sup> , Ard W. Lazonder <sup>3</sup> , Bernard P. Veldkamp<sup>2</sup> and Ton de Jong<sup>1</sup>

*<sup>1</sup> Department of Instructional Technology, University of Twente, Enschede, Netherlands, <sup>2</sup> Department of Research Methodology, Measurement and Data Analysis, University of Twente, Enschede, Netherlands, <sup>3</sup> Behavioural Science Institute, Radboud University, Nijmegen, Netherlands*

Stating a hypothesis is one of the central processes in inquiry learning, and often forms the starting point of the inquiry process. We designed, implemented, and evaluated an automated parsing and feedback system that informed students about the quality of hypotheses they had created in an online tool, the hypothesis scratchpad. In two pilot studies in different domains ("supply and demand" from economics and "electrical circuits" from physics) we determined the parser's accuracy by comparing its judgments with those of human experts. A satisfactory to high accuracy was reached. In the main study (in the "electrical circuits" domain), students were assigned to one of two conditions: no feedback (control) and automated feedback. We found that the subset of students in the experimental condition who asked for automated feedback on their hypotheses were much more likely to create a syntactically correct hypothesis than students in either condition who did not ask for feedback.

#### Edited by:

*Samuel Greiff, University of Luxembourg, Luxembourg*

#### Reviewed by:

*Trude Nilsen, University of Oslo, Norway Jessica Andrews-Todd,*

*Educational Testing Service, United States*

\*Correspondence:

*Karel A. Kroeze k.a.kroeze@utwente.nl*

#### Specialty section:

*This article was submitted to Educational Psychology, a section of the journal Frontiers in Education*

Received: *14 September 2018* Accepted: *11 December 2018* Published: *04 January 2019*

#### Citation:

*Kroeze KA, van den Berg SM, Lazonder AW, Veldkamp BP and de Jong T (2019) Automated Feedback Can Improve Hypothesis Quality. Front. Educ. 3:116. doi: 10.3389/feduc.2018.00116* Keywords: automated feedback, hypotheses, inquiry learning, context-free grammars, online learning environment

### INTRODUCTION

Active forms of learning are seen as key to acquiring deep conceptual knowledge, especially in science domains (Hake, 1998; Freeman et al., 2014). One of the active forms of learning is inquiry learning. Inquiry learning has been defined in many different ways with as its kernel that the method starts from questions for which students need to find answers [see e.g., (Prince and Felder, 2007)]. In the current work, we focus on one of the ways inquiry is used in instruction, namely "learning science by doing science": students are expected to form and test hypotheses by performing experiments and analyzing data. In following an inquiry cycle, students learn both science content and the scientific method. In this study, we focus on the practice of the scientific method, and in particular on the creation of hypotheses.

Most models of inquiry-based learning encompass an orientation and conceptualization phase that enables students to familiarize themselves with the topic of investigation. Common activities during orientation are studying background information and conducting a few explorative experiments with the equipment at hand. The intended outcome of these initial explorations is the formation of theories and ideas, formalized in hypotheses (Pedaste et al., 2015). Hypotheses are integral to the inquiry cycle: they direct students' attention to specific aspects of the research problem and, hence, facilitate experimental design and data interpretation (Klahr and Dunbar, 1988; Zimmerman, 2007). In a classic study, Tschirgi (1980) found that both children and adults design more conclusive experiments when trying to test a hypothesis that contradicts prior

**28**

evidence. Hypothesis testing also increases the amount of domain knowledge students gain from an inquiry (Burns and Vollmeyer, 2002; Brod et al., 2018), which is probably due to the fact that hypotheses, regardless of their specificity and truth value, provide direction to students' inquiry process (Lazonder et al., 2009).

The importance of hypothesizing nevertheless stands in marked contrast with its occurrence in high school science classes. Research has consistently shown that inquiry is a complex process in which students make mistakes (Mulder et al., 2010). Specifically, students of all ages have problems in formulating hypotheses, particularly when they are unfamiliar with the topic of inquiry (Gijlers and de Jong, 2005; Mulder et al., 2010), and when experimental data is anomalous (Lazonder, 2014). As a consequence, few students generate hypotheses on their own account, and when they do, they often stick to a single hypothesis that is known to be true (i.e., confirmation bias) or formulate imprecise statements that cannot be tested in research. These natural tendencies demonstrate that unguided inquiry learning is likely to be ineffective (Mayer, 2004; Kirschner et al., 2006; de Jong and Lazonder, 2014). However, guided inquiry learning has been shown to compare favorably to both direct instruction (D'Angelo et al., 2014) and unguided inquiry learning (Furtak et al., 2012), and helps foster a deeper conceptual understanding (Alfieri et al., 2011).

Inspired by these positive findings we set out to design and evaluate a software scaffold that presented students with automatically generated feedback on the quality of their hypotheses.

### THEORETICAL FRAMEWORK

### Adaptive and Automated Scaffolding

Inquiry learning often takes place in virtual or remote laboratories and, to be successful, should be supplemented with guidance (de Jong and Lazonder, 2014). Furthermore, de Jong and Lazonder (2014) postulated that different types of students require different types of guidance. Recent work on differentiated guidance lends credence to this argument, finding a moderating effect of students' age (Lazonder and Harmsen, 2016) and prior knowledge (van Riesen et al., 2018) on learning activities and knowledge gains. Moreover, Furtak et al. (2012) showed teacherled inquiry activities to be more effective than student-led inquiry, implying that teachers are effective suppliers of guidance. However, given that teachers' time is an increasingly valuable resource, several adaptive software agents have recently been developed to support teachers on specific tasks and that adapt the guidance to students' characteristics. While Belland et al. (2016) found no added effect of limited adaptive scaffolding over static scaffolding, intelligent tutoring systems (Nye et al., 2014), adaptive environments (Durlach and Ray, 2011; Vandewaetere et al., 2011), and automated feedback (Gerard et al., 2015, 2016) have all shown promising results. The common-sense conclusion appears to be that the more guidance is adapted to the individual student, the better the guidance—and thus the student—performs. Indeed, Pedaste et al. (2015) recently identified the development of "virtual teacher assistants that analyse and respond to individual learners to create meaningful learning activities" as one of the main challenges in the field.

Although adaptive and automated elements are increasingly common in online learning environments (e.g., Aleven et al., 2010; Lukasenko et al., 2010; Vandewaetere et al., 2011; Gerard et al., 2015, 2016; Ryoo and Linn, 2016), they have typically been designed and implemented for a single learning activity in a specific domain. The reason for this is simple; even adaptive guidance for a single well-defined learning task generally requires years of research and development. Data must be gathered and coded, models have to be trained and fitted, appropriate feedback has to be fine-tuned and a digital environment has to be developed. Each of these steps involves the input of experts from different fields; teachers, statisticians, educational researchers, and computer scientists. As a result, scaffolds in multi-domain environments such as Go-Lab (de Jong et al., 2014) and WISE (Linn et al., 2003) generally do not adapt to the individual student, nor can they automatically assess products or provide context-sensitive feedback. The hypothesis scaffold we describe and test in this paper aims to fill this gap.

We have been unable to find any existing literature on the automated scoring of and feedback on free-text hypotheses. In contrast, a variety of increasingly sophisticated natural language processing (NLP) techniques have been employed for automated essay scoring. However, the techniques applied to scoring essays typically require a large amount of training data, and even when training data is available they are unlikely to provide the level of detail on the underlying structure of hypotheses required to give meaningful feedback. Training data is not readily available for hypotheses, and would be expensive to gather (Shermis and Burstein, 2013).

Anjewierden et al. (2015) noted that the "language" of hypotheses is a subset of natural language with a specific structure. They suggested using a domain-specific list of variables and categorical values (the lexicon), in conjunction with a grammar of hypotheses. Together, the lexicon and grammar could be used to create a hypothesis parses that is robust, and can be adapted to different domains with relative ease. The work reported here attempts to implement such a context-free grammar.

### Feedback

The informative tutoring feedback model [ITF, (Narciss, 2006, 2008)] distinguishes between internal feedback and external feedback, and a wide variety of feedback types. Internal feedback is provided by individual cognitive monitoring processes (Ifenthaler, 2011), external feedback can be provided by for example; teachers, peers, or automated scaffolds. Both types of feedback may conflict with or reinforce an internal reference value. Careful feedback design can help students regulate their learning process, particularly when internal and external feedback conflict (Narciss, 2008).

The function of feedback may be cognitive, meta-cognitive, or motivational, and a distinction can be made between simple (e.g., knowledge of performance, correct result) and elaborated (e.g., knowledge about task constraints, mistakes, and concepts) forms of feedback. These components broadly overlap with outcome, corrective and explanatory feedback types (e.g., Johnson and Priest, 2014). In a second-order meta-analysis on the effects of feedback, Hattie and Timperley (2007) prescribed that good feedback should set clear goals (feed up), inform the student of their progress (feed back), and provide steps to improve (feed forward). Finally, immediate feedback has been shown to give larger benefits than delayed feedback (Van der Kleij et al., 2015).

### Research Goal and Context

This project is performed in the Go-Lab ecosystem (de Jong et al., 2014). Go-Lab is an online environment where teachers and authors can share online and remote laboratories (Labs) and scaffolding applications (Apps). Apps and Labs can, together with multimedia material, be combined to create Inquiry Learning Spaces (ILS), which can also be shared on the Go-Lab environment. **Figure 1** shows a screenshot of a typical ILS. This ILS is organized in six phases that follow an inquiry cycle (in this case; Orientation, Conceptualization, Investigation, Interpretation, Conclusion, and Discussion), and can be navigated freely.

The hypothesis scratchpad app [**Figure 2**; (Bollen and Sikken, 2018)] is used to support students with hypothesis generation. This study aimed to create an adaptive version of the hypothesis scratchpad that can scaffold the individual student in hypothesizing in any domain, with a minimum of set-up time for teachers. This new version will need to (1) identify mistakes in students' hypotheses, and (2) provide students with appropriate feedback to correct these mistakes. If the app achieves both of these goals, it will be a considerable step toward "empowering science teachers using technology-enhanced scaffolding to improve inquiry learning" (Pedaste et al., 2015).

### DESIGN

For this project the hypothesis scratchpad currently available in Go-Lab has been extended. An automated feedback system was developed that can identify flaws in students' hypotheses and provide tailored feedback that enables students to correct their mistakes. The aim is to improve the quality of students' hypotheses.

The following sections will (1) describe the main components of hypotheses and the criteria used to assess them, (2) introduce the process of parsing hypotheses and applying criteria, (3) present the feedback given to students, and (4) formalize the outcome measures and statistical analyses used.

### Criteria

Quinn and George (1975) were the first to formally define a set of criteria for evaluating hypotheses: (1) it makes sense; (2) it is empirical, a (partial) scientific relation; (3) it is adequate, a scientific relation between at least two variables; (4) it is precise a qualified and/or quantified relation; and (5) it states a test, an explicit statement of a test. Subsequent research on hypothesis generation has broadly followed the same criteria, or a subset thereof. Van Joolingen and De Jong (1991, 1993) used a "syntax" and a "precision" measure, that correspond roughly with the "it makes sense" and "precise' criteria of Quinn and George. Mulder et al. (2010) used a "specificity" scale, using criteria comparable to those of Quinn and George.

Based on the criteria used by Quinn and George, and the measures used by Van Joolingen and de Jong, we developed a set of criteria that could be implemented in automated feedback. **Table 1** lists these criteria, providing a short explanation and examples from the electrical circuits domain for each criterion. In the automated feedback, the first two criteria are straightforward in that they rely on the presence of certain words. The remaining criteria are established using a context-free grammar parser, which is described in the next section.

### Parser

To detect mistakes, the automated system needs to interpret hypotheses on the criteria listed in **Table 1**. Given the observation that hypotheses are a relatively structured subset of natural language (Anjewierden et al., 2015), we can define a context-free grammar [CFG, (Chomsky, 1956)] that covers all well-structured hypotheses.

CFGs can be used to define natural languages, and are ideally suited to define heavily structured languages [e.g., programming languages, (Chomsky, 1956)]. A CFG is comprised of a set of production rules. All the sentences that can be produced by the repeated application of these rules are the formal language of that grammar.

The grammar that defines hypotheses looks something like the following<sup>1</sup> ;

```
HYPOTHESIS -> if ACTION then ACTION
HYPOTHESIS -> ACTION if ACTION
ACTION -> VAR INTERACTOR VAR
ACTION -> VAR MODIFIER
ACTION -> MODIFIER VAR
ACTION -> ACTION and ACTION
VAR -> PROPERTY VAR
VAR -> bulbs
VAR -> voltage
VAR -> brightness
INTERACTOR -> is greater than
INTERACTOR -> is smaller than
INTERACTOR -> is equal to
MODIFIER -> increases
MODIFIER -> decreases
QUALIFIER -> series circuit
QUALIFIER -> parallel circuit
```
Each line is a production rule, the left-hand side of the rule can be replaced by the right-hand side. Uppercase words refer to further rules (they are non-terminal) and lowercase words refer to tokens (they are terminal). A token can be anything, but in our case, they are (sets of) words, e.g., "voltage" or "is greater than."

Consider the following hypothesis; "if the number of bulbs in a series circuit increases, the brightness of the bulbs decreases." If we were to apply our grammar, we can decompose this hypothesis

<sup>1</sup>For the complete grammar, see https://github.com/Karel-Kroeze/adaptivehypothesis-grammars.

as per **Figure 3**. Although this decomposition provides the structure of the hypothesis, it still does not contain the semantic information necessary to evaluate the criteria.

If we add semantic information to each of the tokens, and rules on how to unify this information to each of the production rules, we can extract all relevant information from the hypothesis (Knuth, 1968; Shieber, 2003). **Figure 4** shows an example of the final parse result<sup>2</sup> which contains all the information needed to evaluate the criteria discussed.

<sup>2</sup>The parser was created using the Nearley.js package (Hardmath123., 2017), which implements the Earley context-free parsing algorithm (Earley, 1970). The source code of the parser is available on GitHub; https://github.com/Karel-Kroeze/ adaptive-hypothesis-utils/.

TABLE 1 | Scoring criteria.


### Feedback

The automated hypothesis scratchpad gives students the opportunity to request feedback. **Figure 5** shows an example of the automated hypothesis scratchpad, with the feedback button highlighted (the highlight is not part of the interface).

**Table 2** gives an overview of the feedback used. The feedback follows the guidelines set by Hattie and Timperley (2007) in that it informs students of their progress, is specific about the mistakes made, and—where relevant—suggests modes of improvement. The first three criteria from **Table 2** are required conditions; if a hypothesis does not have variables, a modifier or cannot be parsed, the other criteria are not shown. Conversely, if these criteria are met, feedback is presented only on the other relevant criteria.

Feedback was presented to the student in textual form in a pop-up window and was shown immediately after a student

requested it by clicking the feedback button. Feedback was never presented automatically. After receiving feedback, students could revise their hypothesis, and ask for feedback again. No explicit limits were placed on the amount of times students could ask for feedback.

### Measures

Three outcome measures are of interest; (1) do students use the feedback tool, (2) does the parser correctly classify mistakes, and (3) do students' hypotheses improve after receiving feedback.

All student actions within a Go-Lab inquiry learning space are logged to a database. Specifically, the history of all hypotheses is tracked, including requests for feedback, and the feedback received. Feedback counts can thus be readily determined from the log files. A snapshot of a hypothesis is made whenever a student asks for feedback, and of the final state of the hypothesis. The collection of snapshots for a hypothesis creates a "story" for that hypothesis, tracking it over time.

The validity of classifications made by the parser is evaluated by calculating an inter-rater reliability between the results of the parser and human coders. The human coders were instructed to code as a teacher, ignoring small mistakes in spelling and syntax if the intention of a hypothesis was clear. To train the human coders, a sample of snapshots was coded, and any disagreements were discussed. After reaching agreement, each coder independently coded the remaining snapshots. Agreement is calculated using Cohens' κ, and interpreted using rules of thumb Landis and Koch (1977) .

Each snapshot is given a score based on the number of criteria passed, resulting in a score in a 0−k range, where k is the number of criteria used (three in the first pilot, six in the second pilot and final experiment). Improvement of hypotheses is evaluated by comparing the score for a snapshot to the score for the previous snapshot. The quality of a hypothesis is the quality of the final snapshot of that hypothesis.

If feedback is effective, we expect to see that students who have feedback available create higher quality hypotheses, and that hypothesis quality increases after students ask for feedback: each consecutive snapshot should have a higher quality than the last.

During the study, it became apparent that the aggregate score does not follow a parametric distribution, and therefore could not be used as an outcome measure. The variables and modifier criteria were satisfied by almost all students in our samples. The syntax criterion was often indicative for success on the manipulation, CVS and qualified criteria. Thus, even though the variables, modifier and CVS criteria might be important from a science education perspective, the syntactically correct criterion was used as an indicator for hypothesis quality.

Multilevel logistic models (i.e., generalized linear mixed models) were used to account for the inherent group structure in the data, controlling for student and class effects where appropriate. The models used were comprised of two levels, students and classes. All reported effects are on the student level. To perform the models, we used R (R Core Team, 2018) and the package "lme4" (Bates et al., 2015). The scripts used in analyses are deposited along with the raw and generated datasets at DANS (Kroeze, 2018).

### FIELD STUDIES

Three field studies were conducted. An initial pilot study was conducted with an early version of the hypothesis parser to assess the feasibility of automated parsing of hypotheses using a context-free grammar. Following that, a second pilot study was conducted with the complete version of the parser to identify any remaining issues with the parser and ILS before moving on to the final experiment. The final experiment used a quasi-experimental design to assess the benefit of the tool in improving students' hypotheses. Each of these studies is described in more detail in the following sections.

### First Pilot Study

#### Participants

Four classes of 13- to 14-year-old secondary education students (n = 99), spread over three HAVO classes (preparing for a university of applied science, n = 76) and one VMBO class (preparing for vocational education, n = 23) at a local high school participated in the pilot. Students had already studied the subject matter (supply and demand) as part of their regular curriculum and had previously participated in studies using Go-Lab ILSs and a version of the hypothesis scratchpad that did not provide feedback.

#### Materials and Procedure

The pilot revolved around a short ILS set in the supply & demand domain, where students were introduced to the interactions between price, supply, and demand. The ILS was created in collaboration with a participating economics teacher. Each class performed the study in a single 50-min session. At the beginning of a session, students were given an oral introduction detailing how to use the environment and refreshing them on what a hypothesis is. They were then asked to open the inquiry learning space, where they were first presented with information on the domain. They were then asked to create as many hypotheses about this domain as possible in the automated hypothesis scratchpad, and to use the feedback mechanism when they were stuck or wanted to check their hypothesis. An initial version of the parser was used that could detect the first three criteria: it has two variables, it has a modifier, and it is a syntactically correct sentence. Students were regularly encouraged and reminded to create as many hypotheses as

#### TABLE 2 | Feedback for each criterion.


*[HYPOTHESIS], [INDEPENDENT], [DEPENDENT], and [VARIABLE] will be dynamically replaced with the actual hypothesis and variables used by the student and recognized by the parser. The feedback has been translated from the Dutch original used in the experiments.*

*<sup>a</sup>Used when a hypothesis starts valid but is incomplete (partial parse).*

*<sup>b</sup>Used when a hypothesis cannot be parsed (nonsense, or syntax error).*

*<sup>c</sup>Used when the independent variable is not manipulated.*

*<sup>d</sup>Used when multiple independent variables are manipulated.*

possible<sup>3</sup> , but no attempt was made to force the creation of hypotheses or the use of the feedback tool. The session was concluded with a small user satisfaction questionnaire. During each session, the researcher and the classroom teacher monitored the class, answering process-related questions, and eliciting feedback if any out of the ordinary situations or interactions were encountered.

#### Results

A total of 979 hypotheses were collected from 96 students. Most students created three to five hypotheses and asked for feedback multiple times over the course of the experiment. One student asked for feedback 84 times and was removed as an outlier.

Inter-rater reliability between the parser and two human experts was almost perfect on all three criteria (Cohen's κ = 0.81 − 0.96), showing high parser accuracy. Hypotheses for which students requested and received feedback at least once were more likely to be correct on all criteria. This relation is visible in **Figure 6**, and statistically significant using a multilevel logistic model estimating the probability of a syntactically correct

<sup>3</sup>Unfortunately, during one of the HAVO sessions the teacher instructed students to create 'at least 4' hypotheses, which was immediately interpreted as 'create 4 hypotheses'.

hypothesis by the number of feedback requests, corrected for student and class effects, gender, and age (βfeedbackCount = 1.00, SE<sup>β</sup> = 0.17, CIOR = 1.93 − 3.83, p < 0.001), where βfeedbackCount is the effect of each additional feedback request, and CIOR the confidence interval of the Odds Ratio.

#### Discussion

The first pilot took place under test conditions; students were told to create as many hypotheses as possible, and the learning space was only there to provide a setting for hypotheses to be created. Such conditions are different from usual educational practice. Nevertheless, high parser accuracy and significantly increased quality of hypotheses showed that a parser is feasible, and that a hypothesis scratchpad enhanced with automated scoring and feedback is promising.

Therefore, a second pilot study was conducted using an expanded version of the context-free grammar that included all criteria listed in **Table 1**. In addition, the automated scratchpad was embedded in a full ILS, aligning much closer to how the tool is likely to be used in practice.

## Second Pilot Study

#### Participants

Participants came from one HAVO class of 13 to 14-year-old secondary educations students (n = 27), at a local high school. The students had recently been introduced to electrical circuits as part of their regular curriculum but were familiar with neither Go-Lab environments nor the hypothesis scratchpad prior to the experiment.

#### Materials and Procedure

A short ILS in the electrical circuits domain that could be completed in a single 50-min session was created in collaboration with participating teachers. At the beginning of a session, students were given an oral introduction detailing how to use the tools in the ILS and refreshing them on what a hypothesis is. They were then asked to open the ILS, where they were presented with a short pre-test, followed by some information on the domain. To guide students' hypothesis construction, they were asked to enter two predictions about the change in brightness of lightbulbs in series and parallel circuits after adding another bulb. In the next steps, students were asked to turn these predictions into hypotheses in the automated hypothesis scratchpad, and design an experiment in the Experiment Design app [see e.g., (van Riesen et al., 2018)] to test their hypotheses. Finally, students were given time to create an experimental setup in the Circuit Lab virtual laboratory, test their hypotheses, and enter their conclusions.

All student actions took place in the ILS, which encompassed a full inquiry cycle, from orientation to conclusion. This created an environment more likely to occur in real educational settings. An expanded version of the automated hypothesis scratchpad was used, designed to be able to classify and give feedback to all the relevant criteria.

During the session, the researcher and the classroom teacher monitored the class, answering process-related questions and eliciting feedback if any out of the ordinary situations or interactions were encountered.

#### Results

Both the researcher and the classroom teacher noticed that students had problems working with the ILS and staying on-task. These problems were process related (e.g., students got distracted, skipped steps) and tool related (i.e., students did not know how to work with the tool). Attempts to provide instructions during the experiment were largely ineffective because students were at different stages of the ILS (making group instructions difficult), and there were too many students to provide individual instructions.

In addition, some of the written instructions were too long. For example, upon seeing the instructions, one student immediately uttered: "too long, won't read." It seems likely that his sentiments were shared by other students, highlighting the need for verbal (or at least more interactive) instructions.

A total of 50 hypotheses were collected from 27 students. The plurality (13) of students created two hypotheses each, 7 students did not create any hypotheses. Most (16) students asked for feedback at least once, 11 students did not ask for feedback. One student asked for feedback 23 times and was removed as an outlier.

Parser accuracy was below expectations, achieving a Cohens' κ of 0.91, 0.90, and 0.40 on the contains at least two variables, contains a modifier, and is a syntactically correct sentence criterion, respectively. Accuracy for the manipulates exactly one variable and is qualified criteria is not reported, as the parser failed to recognize 30 out of 46 syntactically correct snapshots, leaving only 16 parsed snapshots.

Although there does appear to be a positive effect of feedback on hypothesis quality (see **Figure 7**), this effect was not statistically significant, as shown by a multilevel logistic model estimating the probability of a syntactically correct hypothesis by the number of feedback requests, correcting for student effects, gender and age (βfeedbackCount = 0.46, SE<sup>β</sup> = 0.24, CIOR = 0.98 − 2.57, p = .058).

#### Discussion

The number of collected hypotheses per student was lower than in the first pilot. In part, that was by design: the first pilot was specifically set up to encourage students to create as many hypotheses as possible, whereas, in this pilot students were guided to create two hypotheses. The participants in this pilot also had less experience working in an ILS, which caused several processrelated issues during the session that likely influenced the number of hypotheses created. A more structured lesson plan where students start and end each step in the inquiry cycle at the same time will allow for verbal instructions to be given before starting each section.

Many students failed to distinguish between series and parallel circuits in their hypotheses, even when their predictions did show they understood the differences between the types of circuits. This does seem to indicate the need for supporting the creation of hypotheses while at the same time highlighting that the currently implemented support is insufficient.

Poor parser accuracy can be attributed to students' difficulties in working with the ILS, additional criteria introducing more complexity to the grammar, and a lack of training data for the Electrical Circuits domain in the target language (Dutch) to calibrate the parser. Using the data gathered in the pilot, we were able to make improvements to the grammar used by the parser. When applying this new grammar to the gathered hypotheses, inter-rater agreement on the syntax criterion was raised to moderate (Cohens' κ = 0.53).

## Main Study

#### Participants

Six classes of 13- to 15-year-old secondary education students (n = 132), from two local high schools participated in the study. Six students used incorrect login credentials and were left out of the analyses. The remaining participants came from 4 HAVO classes (n = 78), and 2 VWO classes (n = 48). Students were randomly assigned to one of two conditions. Students in the experimental condition (n = 68) used the automated hypothesis scratchpad, while those in the control condition (n = 58) used a version of the hypothesis scratchpad that did not provide feedback. No significant differences were present in the distribution of age, gender, and current physics grade across conditions (**Table 3**).

#### Materials and procedure

A single 50-min session was used, covering the same material as that of the second pilot study. The ILS used in the second pilot study was used again, with some minor changes to ameliorate some of the process-related issues students encountered. In particular, written descriptions and instructions were shortened. Instead, at the outset of the session and each phase, students were given a short oral introduction.

Students received a link to a randomizer<sup>4</sup> that assigned each student to one of two conditions and redirected them to the corresponding ILS. Students were instructed not to move to the next phase until told to do so.

At pre-set intervals during the sessions, the researcher gave an oral introduction to the next phase of the inquiry cycle, and the corresponding tools in the ILS. Students where then encouraged to start with that phase. In each session, the researcher and the class teacher monitored the students, answering process-related questions, and eliciting feedback if any extra-ordinary situations or interactions were encountered.

#### Results

Most students were already familiar with the GoLab environment and its tools and encountered no significant difficulties. Based on observations during the sessions, oral introductions prior to each phase of the ILS appeared to keep most students on task, most of the time.

Students in the experimental condition created 201 hypotheses, for 56 of which feedback was requested. Of the 68 students in the experimental condition, exactly half never asked for feedback.

Parser accuracy was moderate to almost perfect, achieving a Cohens' κ of 0.84, 0.70, and 0.59 on the contains at least two variables, contains a modifier, and is a syntactically correct sentence criterion, respectively, and > 0.80 for the manipulates exactly one variable and is qualified criteria.

**Figure 8** appears to show that on average the hypotheses generated in the experimental condition scored higher on all criteria. In addition, **Figure 9** suggests a positive relation between the number of feedback requests and the quality of hypotheses. In particular, hypotheses for which feedback was requested at least once appear to be of higher quality.

To test the effect of our tool on hypothesis quality, we fitted a multilevel logistic model, controlling for student and class effects, as well as gender, age, physics grade, and academic level. We found no significant effect from being assigned to the experimental condition (βcondition = 0.25, SE<sup>β</sup> = 0.34, CIOR = 0.66 − 2.50, p = 0.472). Given that half of all participants in the experimental group never requested feedback, this outcome was not unexpected.

However, when we split the experimental group in two, based on whether students requested feedback or not (n = 34 in both groups, **Figure 10**), and contrast those who requested feedback against those who did not or could not, controlling for student and class effects, as well as gender, age, physics grade and academic level, the effect of requesting feedback is significant (βfeedbackCount = 1.47, SE<sup>β</sup> = 0.42, CIOR = 1.92 − 9.89, p < 0.001).

It could be argued that students who did not request feedback when it was made available to them are less proficient students. However, a contrast analysis comparing students in the control condition (who could not ask for feedback) and those in the experimental condition who did not request feedback found no significant difference between the two groups on the syntactically correct criterion (βcondition = −0.30, SE = 0.39, CIOR = 0.34 − 1.60, p = 0.445). We thus found no evidence to suggest that there was a difference between students who could have asked for feedback but did not do so, and students who did not have the option to ask for feedback.

#### GENERAL DISCUSSION

The creation of hypotheses is a critical step in the inquiry cycle (Zimmerman, 2007), yet students of all ages experience difficulties creating informative hypotheses (Mulder et al., 2010). Automated scaffolds can help students create informative hypotheses, but their implementation in the regular curriculum is often cost-prohibitive, especially since they can typically only be used in one specific domain and language. This study set out to create a hypothesis scratchpad that can automatically evaluate and score hypotheses and provide students with immediate feedback. We use a flexible Context-Free Grammar approach that can relatively easily be adapted and extended for other languages and domains. We described the development process of this tool over two pilot studies and evaluated its instructional effectiveness in a controlled experiment.

Across three studies, we showed that a hypothesis parser based on a context-free-grammar is feasible, attaining moderate to almost perfect levels of agreement with human coders. The required complexity of the parser is directly linked to the syntactical complexity of the domain. For example, the electrical circuits domain requires a more complex parser than the supply and demand domain. Further development of the context-freegrammar used in the parser will contribute to higher reliability and may extend it to other languages and domains.

The second pilot study illustrated that a lack of familiarity of students with the online environment and the tools used can have a negative effect on their performance. Students were distracted by technical and process related issues, and had difficulty remaining on-task. In the final experiment, we used a largely identical learning environment, but students were verbally introduced to each phase. These introductions allowed students to focus on the content of the learning environment, rather than on how to use the learning environment itself.

Nevertheless, when using the automated hypothesis scratchpad in a "typical" ILS, students often did not request

<sup>4</sup>A separate ILS was created for each condition. The randomizer forwarded the students browser to one of these conditions. Randomization was weighted to ensure a roughly equal distribution across conditions in each session.


#### TABLE 3 | Participant characteristics, by condition.

feedback. Timmers et al. (2015) found a relation between gender and the willingness to ask for feedback, but such a relation was not present in our sample. In fact, none of the background variables collected (age, gender, physics grade and educational level) were significantly related to feedback requests or the quality of hypotheses.

If the goal was to obtain as many hypotheses as possible and assess the performance of the parser alone, we would have been better off following the approach taken in the first pilot. However, we deliberately chose to embed the automated hypothesis scratchpad in a typical ILS in the second pilot and main study, with the aim of replicating "real-world" conditions. In doing so, we can draw conclusions that are likely to be applicable to educational practice, rather than in laboratory conditions alone.

In the first pilot, the number of feedback requests was significantly related to the quality of hypotheses. This result was confirmed in a controlled experiment, where students who requested feedback were significantly more likely to create syntactically valid hypotheses than those who did not. The effects of feedback were immediate; hypotheses for which feedback was requested once where more likely to be correct.

To the best of our knowledge, no other tool exists that can reliably score hypotheses, can easily be adapted to different domains, and that allows students to create free-text hypotheses. The automated hypothesis scratchpad we present here can provide a clear and immediate benefit in science learning, provided students request feedback. By increasing the quality of students' hypotheses, we may assume that students are able to engage in more targeted inquiries, positively impacting their learning outcomes. How students can best be encouraged to request (and use) feedback is an open problem, and out of scope for this project. The automated hypothesis scratchpad could also be adapted to be a monitoring tool, highlighting students that may have difficulties creating hypotheses, allowing teachers to intervene directly.

The ability to reliably score hypotheses presents possibilities besides giving feedback. For example, hypothesis scores could serve as an indicator of inquiry skill. As such, they can be part of student models in adaptive inquiry learning environments. Crucially, obtaining an estimate from students' inquiry products

is less obtrusive than doing so with a pre-test, and likely to be more reliable than estimates obtained from students' inquiry processes.

The aggregate hypothesis score computed for students did not have a known parametric distribution. This represents a serious limitation, as the score could not be used in statistical analyses. As a result, we chose to only test statistical significance based on the syntax criterion. Investigating alternative modeling techniques to arrive at a statistically valid conclusion based on multiple interdependent criteria will be part of our future work.

An automated hypothesis scratchpad providing students with immediate feedback on the quality of their hypotheses was implemented using context-free grammars. The automated scratchpad was shown to be effective; students who used its feedback function created better hypotheses than those who did not. The use of context-free grammars makes it relatively straightforward to separate the basic syntax of hypotheses, language specific constructs, and domain specific implementations. This separation allows for the quick adaptation of the tool to new languages and domains, allowing configuration by teachers, and inclusion in a broad range of inquiry environments.

### ETHICAL STATEMENT

All participating schools have obtained written and informed consent from students' parents to perform research activities that fall within the regular curriculum. Parents were not asked to give consent for this study specifically. The experiments we performed were embedded in the students' curriculum, and the collected data was limited to learning processes and outcomes. Students were briefed that their activities in the online learning environment would be logged, and

### REFERENCES


that this data would be used in anonymized form. Both the research protocol and consent procedures followed were approved by the ethical board of the faculty of Behavioural, Management and Social Sciences of the University of Twente (ref # 17029).

### AUTHOR CONTRIBUTIONS

KK, TdJ, AL, and SvdB designed the experiment. KK and AL designed the intervention. KK, SvdB, and BV performed statistical analyses, TdJ and AL helped put experimental results into context. KK wrote the manuscript, aided by TdJ, SvdB, AL, and BV.


**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2019 Kroeze, van den Berg, Lazonder, Veldkamp and de Jong. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

# The Influence of Variance in Learner Answers on Automatic Content Scoring

Andrea Horbach\* and Torsten Zesch

Language Technology Lab, University Duisburg-Essen, Duisburg, Germany

Automatic content scoring is an important application in the area of automatic educational assessment. Short texts written by learners are scored based on their content while spelling and grammar mistakes are usually ignored. The difficulty of automatically scoring such texts varies according to the variance within the learner answers. In this paper, we first discuss factors that influence variance in learner answers, so that practitioners can better estimate if automatic scoring might be applicable to their usage scenario. We then compare the two main paradigms in content scoring: (i) similarity-based and (ii) instance-based methods, and discuss how well they can deal with each of the variance-inducing factors described before.

#### Edited by:

Ronny Scherer, Department of Teacher Education and School Research, Faculty of Educational Sciences, University of Oslo, Norway

#### Reviewed by:

Mark Gierl, University of Alberta, Canada Dirk Ifenthaler, Universität Mannheim, Germany

> \*Correspondence: Andrea Horbach andrea.horbach@uni-due.de

#### Specialty section:

This article was submitted to Educational Psychology, a section of the journal Frontiers in Education

Received: 07 December 2018 Accepted: 12 March 2019 Published: 04 April 2019

#### Citation:

Horbach A and Zesch T (2019) The Influence of Variance in Learner Answers on Automatic Content Scoring. Front. Educ. 4:28. doi: 10.3389/feduc.2019.00028 Keywords: automatic content scoring, short-answer questions, natural language processing, linguistic variance, machine learning

## 1. INTRODUCTION

Automatic content scoring is a task from the field of educational natural language processing (NLP). In this task, a free-text answer written by students should be automatically assigned a score or correctness label in the same way as a human teacher would do. Content scoring tasks have been a popular exercise type for a variety of subjects and educational scenarios, such as listening or reading comprehension (in language learning) or definition questions (in science education). In a traditional classroom-setting, answers to such exercises are manually scored by a teacher, but in recent years, their automatic scoring has received growing attention as well (for an overview, see e.g., Ziai et al., 2012 and Burrows et al. (2014)). Automatic content scoring may decrease the manual scoring workload (Burstein et al., 2001) as well as offer more consistency in scoring (Haley et al., 2007). Additionally, automatic scoring provides the advantage that evaluation can happen in the absence of a teacher so that students may receive feedback immediately without having to wait for human scoring. With the increasing popularity of MOOCS and other online learning platforms, automatic scoring has become a topic of growing importance for educators in general.

In this paper, we restrict ourselves to short-answer questions as one instance of free-form assessment. While other test types, such as multiple choice items, are much easier to score, free-text items have an advantage from a testing perspective. They require active formulation instead of just selecting the correct answer from a set of alternatives, i.e., they test production instead of recognition.

Answers to short-answer questions have a typical length between a single phrase and two to three sentences. This places them in length between gap-filling exercises, which often ask for single words, and essays, where learners write longer texts. We do not cover automatic essay scoring in this article, even if it is related to short-answer scoring, and to some extent even the same methods might be applied. The main reason is that scoring essays usually takes into consideration the form

**42**

of the essay (style, grammar, spelling, etc.) in addition to content (Burstein et al., 2013), which introduces many additional factors of influence that are beyond our scope.

**Figure 1** shows examples from three different content scoring datasets (ASAP, POWERGRADING and SEMEVAL) and highlights the main components of a content scoring scenario: a prompt, a set of learner answers with scoring labels, and (one or several) reference answers.


Numeric or binary scoring labels, as we see in ASAP and POWERGRADING, can be easily summed up and compared. They are thus often used in summative feedback, where the goal is to inform teachers, e.g., about the performance of students in a homework assignment. For formative feedback, which is directed toward the learner, in contrast, a more informative categorical label might be preferable, e.g., to inform a student of their learning progress. The SEMEVAL data is an example for scoring labels aiming into that direction.

• In addition to learner answers, datasets often include teacherspecified **reference answers** for each label. A reference answer showcases a representative answer for a given score and can be used for (human or automatic) comparison with a learner answer. Alternatively, scoring guidelines describing properties of answers with a certain score can be provided. This is often the case when answers are so complex that just providing a small number of reference answers does not nearly cover the conceptual range of possible correct answers and misconceptions. This is for example the case for the ASAP dataset. When reference answers are given, many datasets only provide reference answers for correct answers and not for incorrect ones, e.g., POWERGRADING and SEMEVAL.

The content scoring scenario with its interrelated textual components – a prompt, learner answers, and a reference answer – render automatic content scoring a challenging application of Natural Language Processing which bears strong resemblances to various core NLP fields like paraphrasing (Bhagat and Hovy, 2013), textual entailment (Dagan et al., 2013), and textual similarity (Bär et al., 2012). In all those fields, the semantic relation between two texts is assessed, a method that directly transfers to the comparison between learner and reference answers, as we will see later.

During recent years, many approaches for automatic content scoring have been published on various datasets (see Burrows et al. (2014) for an overview). A practitioner who is considering using automatic scoring for their own educational data might easily feel overwhelmed. They might find it hard to compare approaches and draw conclusions for their applicability on their specific scoring scenario. In particular, approaches often apply various machine learning methods with a variety of features and are trained and evaluated using different datasets. Thus, comparing any two approaches from the literature can be difficult.

This paper aims to shed light on the individual factors influencing automatic content scoring and identifies the variance in the answers as one key factor that makes scoring difficult. We start in section 2 by discussing the nature of this variance, followed by a discussion of datasets and their parameters that influence variance. We discuss in section 3 properties of automatic scoring methods and review existing approaches, especially with respect to whether they score answers based on features extracted from the answers themselves or based on a comparison with a reference answer. We then discuss in section 4 how these factors can be isolated in scoring experiments. We either provide own experiments, discuss relevant studies from the literature, or formulate requirements for datasets that would make currently infeasible experiments possible.

### 2. VARIANCE IN LEARNER ANSWERS

Variance is the reasons why automatic scoring has to go beyond simply matching learner answers to reference answers. The more variance we find in the learner answers, the more complex the scoring model has to be and therefore the harder is the content scoring task (Padó, 2016). Thus, in this section, we discuss why variance increases the difficulty of automatic scoring and analyze publicly available datasets with respect to the variance-inducing properties.

### 2.1. Sources of Variance

From an NLP perspective, automating content scoring of freetext prompts is a challenging task, mainly due to the textual variance of answers given by the learner. Variance can occur on several levels, as highlighted in **Figure 2**. It can occur both on the conceptual level as well as on the realization level, whereas variance in realization can mean variance of the linguistic expression as well as orthographic variance.

### 2.1.1. Conceptual Variance

Conceptual variance occurs when a prompt asks for multiple aspects or has more than one correct solution. For example, in the prompt Name one state that borders Mexico from the POWERGRADING dataset, there are four different correct solutions: California, Arizona, New Mexico, and Texas. A scoring method needs to take all of them into account. However, conceptually different correct solutions are not the main problem, as their number is usually rather small. The


FIGURE 1 | Exemplary content scoring prompts from three different datasets with reference answers (if available) as well as several learner answers with their scoring labels.

much bigger problem is variance within incorrect answers, as there are usually many ways for a learner to get an answer wrong so that incorrect answers often correspond to several misconceptions. For the Powergrading example prompt in **Table 1** (asking What is the economic system in the United States?), frequent misconceptions center around democracy or US dollar, but there also is a long tail of infrequent other misconceptions.

### 2.1.2. Variance in Realization

In contrast to the conceptual variance we have just discussed, which covers different ways of conceptually answering a question, variance in realization means different ways of formulating the same conceptual answer. We consider variance in linguistic expression as well as variance on the orthographic level.

### **2.1.2.1. Variance of linguistic expression**

This refers to the fact that natural language provides many possibilities to express roughly the same meaning (Meecham and Rees-Miller, 2005; Bhagat and Hovy, 2013). This variance of expression makes it in most cases impossible to preemptively enumerate all correct solutions to a prompt and score new learner answers by string comparison alone. For example consider the

#### TABLE 1 | Dataset statistics.


Tokens per answer are counted individually across all answers for one prompt and the minimum, median<sup>1</sup> , and maximum of these values reported. i.e., the prompt with the shortest answers in ASAP has on average 26.5 tokens.

following three sentences. They all come from the SEMEVAL prompt in **Figure 1**. The first is a reference answer, while the other two are learner answers.


While the first learner answer in the example above shares many words with the reference answer, the second learner answer has much lower overlap. The term difference is replaced by the related term measurement. For such cases of lexical variance, we need some form of external knowledge to decide that difference and measurement are similar.

#### **2.1.2.2. Orthographic variance**

A property of (especially non-native) learner data that also contributes toward high realization variance in the data is the orthographic variability and occurrence of linguistic deviations from the standard (Ellis and Barkhuizen, 2005), which can also make it hard for humans to understand what was intended (Reznicek et al., 2013). For example in the learner answer L<sup>1</sup> above, the learner misspelled electrical state as electrial stat. The number of spelling errors – and thus how pronounced this deviation is – depends on a number of factors, such as whether answers have been written by language learners or native speakers or whether answers refer to a text visually available to the learner at the time of writing the answer or not.

### 2.2. Content Scoring Datasets

In the following, we introduce publicly available datasets for content scoring. Afterwards, we categorize all datasets in **Tables 1**, **2** according to various factors that influence variance. The datasets come from different research contexts, we present them here in alphabetical order:


<sup>1</sup>For the median, we report the lower median if there is an even number of items, so that the value corresponds to the average number of tokens per answer of a specific prompt.

<sup>2</sup>https://www.kaggle.com/c/asap-sas

TABLE 2 | Overview of content scoring datasets.


### 2.3. Dataset Properties Influencing Variance

We now discuss dataset-inherent properties that can help us to estimate the amount of variance to be expected in data.

#### 2.3.1. Prompt Type

The type of prompt has a strong influence on the expected answer variance. Imagine, for example, a factual question like Where was Mozart born? and a reading comprehension question such as What conclusion can you draw from the text? For the first question, there is no variance in the correct answers (Salzburg) and probably only little variance in the misconceptions (Vienna). For the second question, a very high variance is to be expected. In general, the more open-ended a question is, the harder it will be to automatize its scoring.

Different answer taxonomies have been proposed to classify questions in the classroom according to the cognitive processes involved for the student and they provide also clues about ease of automatic scoring. Anderson et al. (2001) provide a classification scheme according to the cognitive skills that are involved in solving an exercise: remembering, understanding, applying, analyzing, evaluating, and creating in ascending order of difficulty for the student. This taxonomy could of course also be applied to content scoring prompts. Padó (2017) annotates questions in the CSSAG dataset according to this taxonomy and finds that questions from the lower categories are not only easier for students, but produce also less variance and need less elaborate methods for automatic scoring. She also finds that the instructional context of a question needs to be considered when assigning a level (e.g., to differentiate between a real analyzing question and one that is actually a remembering question because the analysis has been explicitly made in the course). Therefore it is hard to apply such a taxonomy to a dataset where the instructional context is unknown.

A taxonomy specifically for reading comprehension questions has been developed by Day and Park (2005). It classifies questions by comprehension as literal, reorganization, inference, prediction, evaluation, and personal response (again ordered from easy to hard). Literal questions are the easiest because their answers can be found verbatim in the text. Such questions tend to have lower variance, especially when given to low-proficiency learners, as they often lift their answers from the text. Also for this taxonomy, it has been found that reading comprehension prompts for language learners focus on the lower comprehension types (Meurers et al., 2011b) and that among these literal questions are easier to score than reorganization and inference questions. We argue that questions with comprehension types higher in the taxonomy contain so much variance that they are difficult to handle automatically. An example for a personal response question from Day and Park (2005) is What do you like or dislike about this article? We argue that answers to such questions go beyond content-based evaluation and rather touch the area of essay scoring, as how an opinion is expressed it might be more important than its actual content.

The modality of a prompt also plays a role. By modality, we mean whether a question refers to a written or a spoken text. Especially for non-native speakers, listening comprehension exercises will yield a much higher variance as learners cannot copy material from the text based on the written form, but mostly write what they think they understood auditorily. This leads especially to a high orthographic variance and makes scoring harder compared to a similar prompt administered as reading comprehension exercise.

**Table 2** shows that existing datasets cover very diverse prompts from reading comprehension for language learning over science question to biology and literature questions, but that they do not nearly cover all possible prompt types.

#### 2.3.2. Answer Length

Answer length of course is strongly related to the type of question asked. Where or when questions usually require only a phrasal answer, whereas why questions are often answered with complete sentences. Shorter answers consisting of only a few words often correspond only to a single concept mentioned in the answer (see the example from the POWERGRADING dataset in **Table 1**), whereas longer answers (as we saw in the ASAP example) tend to be also conceptually more complex. It seems intuitive that this conceptual complexity is accompanied by a higher variance in the data. In a longer answer, there are more options how to phrase and order ideas in different ways.

Answer length is a measure that can be easily determined for a new dataset once the learner answers are collected, so it can serve as a quick indicator for the ease of scoring. In general, shorter answers can be scored better than longer answers. Of course, also datasets with answers of the same length can display different types of complexity and variance. Nevertheless, we consider answer length as a good and at the same time cheap indicator.

**Table 1** presents some core answer length statistics for each dataset. A dataset usually consists of several individual prompts and different prompts in a dataset might differ more or less from each other. To characterize the variance between prompts in a dataset better we give the average answer length in tokens, as well as the minimum, median, and maximum value across the different prompts. **Figure 3** visualizes for each dataset the distribution of the average answer length per prompt. We see that the individual datasets span a wide range of lengths from very short phrasal answers in POWERGRADING to long answers almost resembling short essays in ASAP. We also see that the number of different prompts and individual learner answers and thus also the number of learner answers for each prompt varies considerably, from datasets with only a very restricted number of answers for each question, such as in

CREE and CREG, to several thousand answers per prompt in ASAP.

#### 2.3.3. Language

The language that is used to answer a prompt, such as English, German, or Chinese, is also an important factor influencing the answer variance. Methods that work well for one language may not be directly transferable to other languages. This is due both to the linguistic properties of individual languages as well as to the availability of language-specific NLP resources used for scoring. By linguistic properties we mean especially the morphological richness of a language and the restrictiveness of word order. If an answer given in English talks about a red apple, it might be sufficient to look for the term red apple, while in German, depending on the grammatical context, terms such as (ein) roter Apfel, (der) rote Apfel, (einen) roten Apfel, or (des) roten Apfels might occur. Thus, a scoring approach based on token n-grams usually needs fewer training instances in English compared to German, as an English n-gram often corresponds to several German n-grams. For morphologically-richer languages such as Finnish or Turkish, approaches developed for English might completely fail.

Freeness of word order is related to morphological richness. Highly inflected languages, such as German, have usually a less restricted word order than English. Thus, n-gram models work well for the mainly linear grammatical structures in English, but less so for German with freer word-order and more long-distance dependencies (Andresen and Zinsmeister, 2017).

As for language resources used in content scoring methods, there are two main areas which have to be considered: linguistic processing tools as well as external resources. Many scoring methods rely on some sort of linguistic processing. The automatic detection of word and sentence boundaries (tokenization) is a minimal requirement necessary for almost all approaches, while some methods additionally use for example lemmatization (detecting the base form of a word), part-of-speech-tagging (labeling words as nouns, verbs or adjectives), or parsing sentences into syntax trees, which represent the internal linguistic structure of a sentence. External resources can be, for example, dictionaries used for spellchecking, but also resources providing information about the similarity between words in a language. Coming back to the example above, to know that measurement and difference are related, one would either need an ontology crafted by an expert, such as WordNet (Fellbaum, 1998), or would need similarity information derived from large corpora, based on the core observation in distributional semantics that words are similar if they often appear in similar contexts (Firth, 1957). The availability of such tools and resources has to be taken into consideration when planning automatic scoring for a new language.

#### 2.3.4. Learner Population and Language Proficiency

The learner population is another important factor to consider, as it defines the language proficiency of the learners, i.e., whether they are beginning foreign language learners or highly proficient native speakers. Language proficiency can have two, at first glance contradicting, effects: A low language proficiency might lead to a high variance in terms of orthography, because beginners are more likely to make spelling or grammatical errors. At the same time, being a low-proficiency learner, can equally reduce variance, but on the lexical and syntactic level. This is because such a learner will have a more restricted vocabulary and has acquired fewer grammatical constructions than a native speaker. Moreover, low-proficiency learners might stay closer to the formulations in the prompt, especially when dealing with reading comprehension exercises, where the process of re-using material from the text for an answer is known as "lifting."

Beginning language learners and fully proficient students are of course only the far points of the scale, while students from different grades in school would rank somewhere in between. **Table 2** shows that the discussed datasets indeed cover a wide range of language proficiencies.

Also the homogeneity of the learner population plays a role: Learners from a homogeneous population can be expected to produce more homogeneous answers. It has, for example, been shown that the native language of a language learner influences the errors a learner makes (Ringbom and Jarvis, 2009). A German learner of English might be more inclined to misspell the word marmalade as marmelade because of the German cognate Marmelade (Beinborn et al., 2016). An automatic scoring engine trained on learner answers given by German learners might thus encounter the misspelling marmelade often enough to learn that an answer containing this word is as good as an answer containing the right spelling. However, a model trained on answers by learners from many different countries might not be able to learn (partially overlapping) error patterns for each individual first language of the learners. In a slightly different way, this also applies to native speakers. Consider e.g., answers by students from one university which all attended the same lecture and used the same slides and textbooks for studying (low variance) vs. answers by students from different universities using different learning materials (high variance).

#### 2.3.5. Other Factors

The following factors do not directly influence the variance found in the data, but are other data-inherent factors that influence the difficulty of automatic scoring.

#### **2.3.5.1. Dataset size**

When using machine learning models to perform content scoring, as do all the approaches we discuss in this article, the availability of already-scored answers from which the scoring method can learn is an important parameter (Heilman and Madnani, 2015). The more answers there are to learn from, the better we can usually model what a correct or incorrect answer looks like. The range of available answers covered varies between less than 10 answers for a prompt (as for example in the CREG dataset where a model across individual questions is learnt by most approaches dealing with this dataset) and over 3,000 answers per prompt in the ASAP dataset.

In many practical settings, only a small part of the available data is manually scored and used for training. It has been shown that the choice of training data heavily influences scoring performance and that the variance within the instances selected for training is a major influencing factor (Zesch et al., 2015a; Horbach and Palmer, 2016).

#### **2.3.5.2. Label set**

Different label sets have been proposed for different content scoring datasets. The educational purpose of the scoring scenario is the main determining factor for this choice. Some datasets such as CREG and SRA have even more than one label set so that different usage scenarios can be addressed. This purpose can either be to generate summative or formative feedback (Scriven, 1967). The recipient of summative feedback is the teacher who wants to get an overview of the performance of a number of learners, for example in a placement test or exam situation. In this case, it is important that scores are comparable and can be aggregated so that there is an overall result for a test consisting of several prompts. Binary or numeric scores fit this purpose well. Formative feedback in contrast, as given through the categorical labels in SRA, CREG, and CREE, is directed toward the learner and meant to inform learners about their progress and the problems they might have had with answering a question. This type of feedback in content scoring is, for example, used in automatic tutoring systems. For a learner, the information that she scored 3.5 out of 5 points might be not as informative as a more meaningful feedback message stating that she missed an important concept required in a correct answer. Thus, datasets meant for formative feedback often use categorical labels rather than numeric ones.

The kind of label that is to be predicted obviously influences the scoring difficulty. In general, the more fine-grained the labels, the harder they are to predict given the same overall amount of training data. Also the conceptual spread covered by the labels can make the task more or less difficult. If the labels intend to make very subtle distinctions between similar concepts, the task is more complex than a scoring scheme that differentiates between coarser categories and considers everything as correct that is somewhat related to the correct answer.

#### **2.3.5.3. Difficulty of the scoring task for humans**

All machine learning algorithms learn from a gold-standard produced by having human experts (such as teachers) label the data. If the scoring task is difficult, humans will make errors and label data inconsistently. This noise in the data impedes performance of a machine learning algorithm. If the gold-standard dataset is constructed from two trained human annotators, the inter-annotator-agreement between these two is considered to be an upper bound of the performance that can be expected from a machine. If two teachers agree only in 90% of the scores they assign for the same task, 90% agreement with the gold-standard is also considered the best possible result obtainable by automatic scoring (Gale et al., 1992; Resnik and Lin, 2010). The same argument can be applied for self-consistency. If a teacher labels the same data twice and can reproduce his own cores only for 90% of all answers, we can consider this 90% an upper bound for machine learning. This influence parameter obviously depends on most of the others and cannot be considered in isolation, but it helps to estimate which level of performance is to be expected for a particular prompt.

### 2.4. Summary

In this section, we have discussed several factors that are influencing the variance to be found in learner answers: the prompt type, answer length, language and learner population. We also introduced dataset size, the label set and the scoring difficulty for human scorers as additional parameters that influence the suitability of a dataset for human scoring. In the next section, we first give an overview of content scoring methods and then present a set of experiments that show the influence of some of the discussed factors on content scoring.

## 3. AUTOMATIC CONTENT SCORING

As explained in the introduction, the overall aim of content scoring is to mimic a teacher's scoring behavior by assigning labels to a learner answers indicating how good the answer is content-wise.

A very large number of automatic content scoring methods have been proposed (see Burrows et al., 2014 for an overview), but we argue that most existing methods can be categorized into two main paradigms: similarity-based and instance-based scoring. Hence, instead of analyzing the properties of single scoring methods, we can draw interesting conclusions by comparing the two paradigms.

### 3.1. Similarity-Based Approaches

**Figure 4** gives a schematic overview of similarity-based scoring. The learner answer is compared with a reference answer (or a high-scoring learner answer) based on a similarity metric. If the similarity surpasses a certain threshold (exemplified by 0.7 in **Figure 4**), the learner answer is considered as correct. Note that reference answers are always examples for correct answers. In the datasets discussed in section 2.2, there are no samples for incorrect answers, although we have seen earlier that also incorrect answers might form groups of answers expressing the same content.

An important factor in the performance of such similaritybased approaches is how the similarity between answers is computed. In the simplest form, it can be computed based on surface overlap, such as token overlap, where the amount of words or characters shared between answers is measured or edit distance, where the number of editing steps necessary to transform one answer into another is counted. These methods work well when different correct answers can be expected to mainly employ the same lexical material. However, when paraphrases are expected to be lexically diverse, surface-based methods might not be optimal. Consider the hypothetical sentence pair Paul presented his mother with a book - Mary received a novel from her son as a gift. In such a case the overlap between the two sentences on the surface is low, while it is clear to human readers that the two sentences convey a very similar meaning. To retrieve the information that present and gift from the above example are highly similar, semantic similarity methods make use of ontologies like WordNet Fellbaum (1998) or large background corpora [e.g., latent semantic analysis (Landauer and Dumais, 1997)].

In the content scoring literature, all these kinds of similarities are used. While Meurers et al. (2011c) mainly rely on similarity on the surface level for different linguistic units (tokens, chunks, dependency triples), methods such as Mohler and Mihalcea (2009) rely on external knowledge about semantic similarity between words.

## 3.2. Instance-Based Approaches

In instance-based approaches, lexical properties of correct answers (words, phrases, or even parts of words) are learned from other learner answers labeled as correct, while commonalities between incorrect answers inform the classifier about common misconceptions in learner answers. One would, for example, as depicted in **Figure 5**, learn that certain n-grams, such as electrical states, are indicators for correct answers while others, such as battery, are indicators for incorrect answers. For the scoring process, learner answers are then represented as feature vectors where each feature represents the occurrence of one such ngram. The information about good n-grams is prompt-specific. For a different prompt, such as one asking for the power source in a certain experiment, battery might indicate a good answer, while answers containing the bigram electrical states would likely be wrong.

As the knowledge used for classification usually comes from the dataset itself and, in many approaches, no external knowledge is used in the scoring process (in contrast to similarity-based scoring), instance-based methods tend to need more training data and do not generalize as well across prompts. Instancebased methods have been used, for example, for various work on the ASAP dataset (Higgins et al., 2014; Zesch et al., 2015b), including all the top-performing systems from the ASAP scoring competition (Conort, 2012; Jesensky, 2012; Tandalla, 2012; Zbontar, 2012), as well as in commercially used systems.

### 3.3. Comparison

We presented two conceptually different ways of content scoring, one relying on the similarity with a reference answer (similaritybased) and the other on information about lexical material in the learner answers (instance-based). While we have presented the

FIGURE 5 | Schematic overview of instance-based scoring.

paradigmatic case for each side, there are of course less clearcut cases. For example, an instance-based k-nearest-neighbor classifier scores new unlabeled answers by assigning them the label of the closest labeled learner answer. By doing so the classier inherently exploits similarities between answers.

#### 3.3.1. Associated Machine Learning Approaches

Classical supervised machine learning approaches have been associated with both types of scoring paradigms. Instance-based approaches often work on feature vectors representing lexical items, while similarity-based approaches (Meurers et al., 2011c; Mohler et al., 2011) use various overlap measures as features or rely on just one similarity metric (Mohler and Mihalcea, 2009). Deep learning methods have been applied for instancebased scoring Riordan et al. (2017) as well as similarity-based scoring Patil and Agrawal (2018). As content scoring datasets are often rather small, the performance gain by using deep learning methods has far not been as in other NLP areas, if there was a reported gain at all.

#### 3.3.2. Source of Knowledge

In general, instance-based approaches mainly use lexical material present in the answers while similarity-based methods often leverage external knowledge resources like WordNet or distributional semantics to bridge the vocabulary gap between differently phrased answers. Deep learning approaches usually also make use of external knowledge in the form of embeddings that also encode similarity between words.

#### 3.3.3. Prompt Transfer

Another aspect to consider when comparing scoring paradigms is the transferability of models to new prompts. As similaritymethods learn about a relation between two texts rather than the occurrence of certain words or word combinations, such a model can also be transferred to new prompts for which it has not been trained. For instance-based approaches, a particular word combination indicating a good answer for one prompt might not have the same importance for another prompt. We can therefore generally expect that similarity-based models transfer more easily to new prompts.

### 4. EXPERIMENTS AND DISCUSSION

In the previous sections, we have introduced (i) the factors influencing the variance of learner answers and the overall difficulty of the scoring task, and (ii) the two major paradigms in automatic content scoring: similarity-based and instance-based scoring. In this section, we bring both together. In the few cases where empirical evidence already exists, we direct the reader to experiments in the literature that address these influences. We design and conduct a set of experiments to explore those sources of variance that have been experimentally examined yet. However, for some dimensions of variance we have no empirical basis as evaluation datasets are sparse and do not cover the full range of necessary properties. In these cases, we instead describe desiderata for datasets that would be needed to investigate such influences. The discussion in this section is aimed at providing guidance for matching paradigms with use-cases in order to allow a practitioner to choose a setup according to the needs of their automatic scoring scenario.

### 4.1. Experimental Setup

Our experiments (instance-based as well as similarity-based) build on the Escrito scoring toolkit (Zesch and Horbach, 2018) (in version 0.0.1) that is implemented based on DKPro TC (Daxenberger et al., 2014) (in version 1.0.1). For preprocessing, we use DKPro Core.<sup>3</sup> We apply sentence splitting, tokenization, POS-tagging and lemmatization. We did not spellcheck the data, as Horbach et al. (2017) found that the amount of spelling errors in the ASAP data did not impede scoring performance in an experimental setup similar to ours.

We use a standard machine learning setup, variants of which have been used widely. We extract token and lemma n-gram features, using uni- to trigrams for tokens and bi- to four-grams for characters. We train a support vector machine using the Weka SVM classifier with SMO optimization in its standard configuration, i.e., without standard parameter tuning.

#### 4.1.1. Datasets

We select datasets from those discussed above (see section 2.2). The main selection criterion is, that a dataset contains a high number of learner answers per prompt, so that we can investigate the influence of training data size in prompt-specific models. To meet this criterion we use POWERGRADING, ASAP, and SEMEVAL.

#### 4.1.2. Evaluation Metric

One common type of evaluation measure applicable for all label sets in short answer scoring is accuracy, i.e., the percentage of correctly classified items. This often goes together with a per-class evaluation of precision, recall, and F-score. Kappa values, taking into account the chance agreement between the machine learning outcome and the gold standard also are quite popular. This holds especially for Quadratically Weighted Kappa (QWK) for numeric scores, as it not only considers whether an answer is correctly classified or not, but also how far of an incorrect answer is. As QWK became a quasi-standard through its usage in the Kaggle ASAP challenge, we use it for our experiments as well.

#### 4.1.3. Learning Curves

We listed the amount of available training data as one important influence factor for scoring performance. We can simulate datasets of different sizes by using random subsamples of a dataset. By doing this iteratively several times and for several amounts of training data, we obtain a learning curve. If a classifier learns from more data results usually improve until the learning curve approximates a flat line. When we provide learning curve experiments, we always sample 100 times for each amount of training data and average over the results.

### 4.2. Answer Length

As to our knowledge answer length has not been examined as an influencing factor so far, we test the hypothesis that shorter answers are easier to score, as they should have less variance in general. For this purpose, we conduct experiments with increasing amounts of training data and plot the resulting learning curves. Prompts from datasets with shorter answers should converge faster and at a higher kappa than prompts with longer answers. Note that we restrict ourselves to instance-based experiments here, as there is an insufficient number of datasets providing the necessary reference answers. However, we expect the general results to also hold similarity-based methods, as the similarity of longer answers is harder to compute than for shorter answers.

**Figure 6** shows the results for instance-based scoring for a number of prompts covering a wide variety of different average lengths, selected from POWERGRADING (short answers), SRA (medium length answers), and ASAP (long answers, split in two prompts with on-average about 25 tokens per answer as well as eight prompts with more than 45 tokens per answer). We observe that (as expected) shorter prompts are easier to score, but the results between individual prompts (thin lines) within a dataset vary considerably. Thus, we also present the average over all prompts from the dataset (thick line), that clearly support the hypothesis.

These experiments also tell us something about the influence of the number of training data. An obvious finding is that more data yields, for most prompts, better results. A more interesting observation is that the curves for the SRA answers level off earlier than for the ASAP and POWERGRADING datasets. This means we could not learn much more given the current machine learning algorithm, parameter settings and feature set even if we had more training data. The ASAP and POWERGRADING curves, in contrast, are still raising: if we had more training data available, we could expect a better scoring performance.

### 4.3. Prompt Type

In our experiments regarding answer length, we cannot fully isolate effects originating from the length of the answers from other effects like the prompt type (as some prompts require longer answers than others) and learner population (as certain prompts are suitable only for a certain learner population). Therefore, we now try to isolate the effect of the prompt type by choosing prompts with answers of the same length and coming from the same dataset, thus from the same learner population and language.

We select four different prompts from POWERGRADING with a mean length between 3.3 and 4.8 tokens per answers and three different prompts from the ASAP dataset with an average length between 45 and 53 tokens and show the resulting learning curves for an instance-based setup in **Figure 7**. We observe that these prompts behave very differently despite a comparable length of the answers. Especially for the POWERGRADING data, performance with

<sup>3</sup>https://dkpro.github.io/dkpro-core/

very few training data instances varies considerably showing other factors than length contribute to the performance. We assume that for these prompts (with often repetitive answers) the label distribution plays a role, as performance with few training instances suffers because chances are high that only members of the majority class are selected for scoring. For the ASAP prompts, those differences are less pronounced.

With the currently available data, we cannot make any claims about the influence of the prompt type itself, e.g., regarding domain (like biology prompts are easier than literature prompts) or modality of the prompt (as this would require having comparable prompts for example as listening and reading comprehension).

### 4.4. Language

In order to compare approaches solely based on the language involved, one would need the same prompts administered to comparable learner population but in different languages. The only such available datasets we know about are ASAP and ASAP-DE. ASAP-DE uses a subset of the prompts of ASAP translated to German and provides answers from Germanspeaking crowdworkers (Horbach et al., 2018). These answers were annotated according to the same annotation guidelines. So, while trying to be as comparable as possible, the datasets still differ in the learner population, in addition to the language. Horbach et al. (2018) compared instance-based automatic scoring on the two datasets and found results to be in a similar range with a slight performance benefit for the German data. However, they also reported differences in the nature of the data – resulting potentially from the different learner populations – , such as a different label distribution and considerably shorter answers for German, which they attribute to crowdworkers being potentially less motivated then school students in an assessment situation. Therefore, it is unclear whether any of those differences can be blamed on the language difference or the difference in learner population. More controlled data collections would be possible to get results that are specific to the language difference only. One such data collection with answers from students from different countries and thus various language backgrounds is the data from the PISA studies.<sup>4</sup> Such data would be an ideal testbed to compare learner populations with different native languages on the same prompt administered in various languages.

### 4.5. Learner Population

The results mentioned above for the different languages might equally be used as a potential example for the influence of different learner populations. In order to fully isolate the effect of learner population, one would need to collect the same dataset from two different learner groups such as native speakers vs. language learners or high-school vs. university students. To the best of our knowledge, such data is currently not available.

However, one aspect of different learner population is their tendency to make spelling errors. In experiments on the ASAP dataset, Horbach et al. (2017) found that the amount of spelling errors present in the data did not negatively influence content scoring performance. Only if the amount of spelling errors per answer was artificially increased, scoring performance decreased, especially, if errors followed a random pattern (unlikely to occur in real data) and if scoring methods relied on the occurrence of certain words and ignored sub-word information (i.e., certain character combinations).

### 4.6. Label Set

When discussion influence factors, we assumed that a dataset with more individual labels is harder to score than a dataset with binary labels. The influence of different label sets was already tested in previous work, especially in the SemEval Shared Task "The Joint Student Response Analysis and 8th Recognizing Textual Entailment Challenge" (Dzikovska et al., 2013). The SRA dataset used for this challenge is annotated with three label sets of different granularity: two, three or five labels providing increasing levels of feedback to the learner. The two-way task just informs learners whether their answer was correct or not. The 3-way task additionally distinguishes between contradictory answers (contradicting the learner answers) and other incorrect answers. In the 5-way task, answers classified as incorrect in the 3-way task are classified in an even more fine-grained manner as "partially

<sup>4</sup>http://www.oecd.org/pisa/

correct, but incomplete," "irrelevant for the question," or "not in the domain" (such as I don't know.).

Seven out of nine systems participating the SemEval Shared Task reported results for each of these label sets. For all of them performance was best for the 2-way task (with a mean weighted F-Score of .720 for the best performing system) and worst for the 5-way task (0.547 mean weighted F-Score, again for the best performing system, which was a different one then for the 2-way result). This clearly shows that the expected effect of more fine-grained label sets being more difficult to score automatically.

### 5. CONCLUSION AND FUTURE WORK

In this paper, we discussed the different influence factors that determine how much variance we see in the learner answers toward a specific prompt and how this variance influences automatic scoring performance. These factors include the type of prompt, the language in the data, the average length of answers as well as the number of training instances that are available. Of course, these factors are interdependent and influence each other. It is thus hard to decide based on purely theoretical speculations whether, for example, medium length answers to a factoid question given by German native speakers annotated with binary scoring labels and with a large number of training instances are easier or harder to score than shorter answers in non-native English with numeric labels and a smaller set of training instances. Such questions can only be answered empirically, but the available datasets do not nearly cover the available parameter space exhaustively, so that such experiments are not possible in a straightforward manner. That makes it hard to compare different approaches in the literature and it is also a challenge to estimate the performance on new data. Therefore, we presented experiments that show the influence of some of the discussed factors on content scoring.

Our findings give researchers as well as educational practitioners hints about whether content scoring might work for a certain new dataset. At the same time, our paper also highlights the demand for more systematic research, both in terms of dataset creation and automatic scoring. For a number of influence factors, we were not able to clearly assess their influence because data that would allow to investigate a single influence parameter in isolation does not exist. It would thus be desirable for the automatic scoring community to systematically collect new datasets varying only in specific dimensions, such as to ask the same prompt to different learner populations and in different languages in order to further broaden our knowledge about the full contribution of these factors.

### DATA AVAILABILITY

Publicly available datasets were analyzed in this study. This data can be found here: https: www.kaggle.com/c/asap-sas, https:// www.microsoft.com/en-us/download/details.aspx?id=52397 and https://www.cs.york.ac.uk/semeval-2013/task7/index.php%3Fid =data.html.

## AUTHOR CONTRIBUTIONS

All authors listed have made a substantial, direct and intellectual contribution to the work, and approved it for publication.

### FUNDING

This work has been funded by the German Federal Ministry of Education and Research under grant no. FKZ 01PL16075.

## REFERENCES


**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2019 Horbach and Zesch. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

# Multiple-Choice Item Distractor Development Using Topic Modeling Approaches

Jinnie Shin\*, Qi Guo and Mark J. Gierl

Centre for Research in Applied Measurement and Evaluation, Department of Educational Psychology, University of Alberta, Edmonton, AB, Canada

Writing a high-quality, multiple-choice test item is a complex process. Creating plausible but incorrect options for each item poses significant challenges for the content specialist because this task is often undertaken without implementing a systematic method. In the current study, we describe and demonstrate a systematic method for creating plausible but incorrect options, also called distractors, based on students' misconceptions. These misconceptions are extracted from the labeled written responses. One thousand five hundred and fifteen written responses from an existing constructed-response item in Biology from Grade 10 students were used to demonstrate the method. Using a topic modeling procedure commonly used with machine learning and natural language processing called latent dirichlet allocation, 22 plausible misconceptions from students' written responses were identified and used to produce a list of plausible distractors based on students' responses. These distractors, in turn, were used as part of new multiple-choice items. Implications for item development are discussed.

#### Edited by:

Frank Goldhammer, German Institute for International Educational Research (LG), Germany

#### Reviewed by:

Qiwei He, Educational Testing Service, United States Torsten Zesch, University of Duisburg-Essen, Germany

\*Correspondence:

Jinnie Shin jinnie.shin@ualberta.ca; eshin1@ualberta.ca

#### Specialty section:

This article was submitted to Educational Psychology, a section of the journal Frontiers in Psychology

Received: 06 November 2018 Accepted: 27 March 2019 Published: 25 April 2019

#### Citation:

Shin J, Guo Q and Gierl MJ (2019) Multiple-Choice Item Distractor Development Using Topic Modeling Approaches. Front. Psychol. 10:825. doi: 10.3389/fpsyg.2019.00825 Keywords: multiple-choice items, distractors, misconceptions, distractor generation, latent dirichlet allocation

### INTRODUCTION

Multiple-choice testing is one of the most enduring and successful forms of educational assessment that remains in practice today. Multiple-choice items are used in educational testing because they permit the measurement of diverse types of knowledge, skills, and competencies (Haladyna, 2004; Downing, 2006; Popham, 2008). Multiple-choice items are efficient to administer; they are easy to score objectively; they can be used to sample a wide range of content; they require a relatively short time to administer (Haladyna, 2004; Haladyna and Rodriguez, 2013; Rodriguez, 2016). Downing (2006, p. 288), in his seminal chapter in the Handbook of Test Development, claimed that selected-response items, like multiple choice, are the most appropriate item format for measuring cognitive achievement or ability, especially higher-order cognitive skills, such as problem solving, synthesis, and evaluation. He also stated that this item format is both useful and appropriate for creating exams intended to measure a broad range of knowledge, ability, or cognitive skills across many domains.

Because of these important benefits, multiple-choice items continue to have broad appeal and, hence, application in education, despite some potential disadvantages, such as guessing effects and unintentionally exposing students' to wrong information. North American students take 100s of multiple-choice tests and answer 1000s of multiple-choice items as part of their educational experience. Chingos (2012) reported that one-third of the United States use multiple-choice items

**56**

exclusively for assessing 4th grade and 8th grade students' math and reading skills. In higher education, a multiplechoice test is a common and widely used assessment format for measuring students' knowledge, especially in introductory courses with a large group of students. Multiple-choice testing is also used extensively for international assessments. In the 2015 administration of The Trends in International Mathematics and Science Study (TIMSS), for example, half of the mathematics and science items used the multiple-choice format (Mullis et al., 2016). In the 2015 administration of the Program for International Student Assessment (PISA), two-third of the items in reading, mathematics, and science assessments were multiple choice (OECD, 2016).

A multiple-choice item consists of a stem, options, and auxiliary information. The stem contains context, content, and/or the question the student is required to answer. The options include a set of alternative answers with one correct option and one or more incorrect options or distractors. Auxiliary information includes any additional content, in either the stem or option, required to create an item, including text, images, tables, graphs, diagrams, audio, and/or video. To answer a multiple-choice item, the student is presented with a stem and two or more options that differ in their relative correctness. Students are required to make a distinction among response options, several of which may be partially correct, in order to select the best or most correct option. Hence, the student must use her or his knowledge and problem-solving skills to identify the relationship between the content in the stem and the correct option. The incorrect options are called distractors because they are considered to be "distracting" to students with partial knowledge due to their plausibility to yield the correct option.

Creating multiple-choice items is a challenging task, particular when it comes to distractor development, because of the sheer volume of work that is required. For example, to create 100 multiple-choice items that consists of one correct option and four incorrect options, a content specialist has to create 100 stems and 100 correct options. The content specialist also needs to create 400 plausible but incorrect options. This challenge of distractor development is both daunting and, oftentimes, unsuccessful. Haladyna and Downing (1993) evaluated the distractors from four standardized multiplechoice tests. They evaluated the quality and plausibility of distractors based on the attractiveness of distractors. More specifically, they emphasized that plausible distractors should be able to attract more than 5% of the low-performing students, who failed to identify a correct answer. Based on such criteria, they found that only 8% of the items contained effective distractors.

To overcome the challenge of creating large numbers of effective distractors, researchers and practitioners have explored and implemented different strategies. The most common strategy focuses on a list of plausible but incorrect alternatives linked to common misconceptions or errors in thinking, reasoning, and problem solving (Haladyna and Downing, 1989; Case and Swanson, 2001; Vacc et al., 2001; Collins, 2006; Moreno et al., 2006, 2015; de la Torre, 2009; Tarrant et al., 2009; Rodriguez, 2011, 2016). Haladyna and Rodriguez (2013) in their textbook Developing and Validity Test Items claim that the most effective way to develop plausible distractors using misconceptions is to identify "common errors" elicited by a particular stem in the item prompt. These common errors serve as candidates for plausible distractors. Haladyna and Rodriguez state that common errors can be identified in two ways. First, they can be identified using the judgments of contents specialists who have a good understanding of teaching and learning within a specific content area and who can specify the common errors and misconceptions that arise when students learn a new topic or concept. Second, they can be identified by evaluating student answers to constructedresponse item (i.e., an item that contains a stem by no options) where errors in reasoning, thinking, and problem solving are documented in the student's responses. The second approach extracting student responses from constructed-response items is the preferred strategy for identifying common errors because it is based on the actual response processes from students rather than the expected response processes inferred from the judgment of content specialists about how students respond to test items. However, identifying and extracting common errors and misconceptions from the actual response processes is a daunting task because large amounts of response data must be processes and this data, in turn, must be classified accurately in order to identify outcomes that could be used as distractors.

The purpose of this study is to introduce an augmented intelligence approach for systematically identifying and classifying misconceptions from the students' written responses that are pre-labeled for the purpose of creating distractors that can be used for multiple-choice items. Augmented intelligence is an area within artificial intelligence that deals with how computer systems can emulate and extend human cognitive abilities thereby helping to improve human task performance and to enhance human problem solving (Zheng et al., 2017). It requires the interaction between a human and a computer system in order for the system to produce an output or solution. Augmented intelligence combines the human capacity for judgment with the ability of modern computing using computational analysis and data storage to solve complex and, typically, unstructured problems. Augmented intelligence can therefore be used to characterize any process or system that improves the human capacity for solving complex problems by relying on a partnership between a human and a machine (Pan, 2016; Popenici and Kerr, 2017).

We introduce and demonstrate an augmented intelligence method that can be used for distractor development using latent dirichlet allocation (LDA; Blei et al., 2003). LDA is a statistical model used in machine learning and natural language processing which identifies specific topics and concepts within written texts. Specific words are expected to appear in a written text more or less frequently given a particular topic. LDA can be used to capture this expected outcome in a mathematical framework by focusing on the number of times words appeared in written text for different topics. Using LDA, content specialists can identify actual misconceptions based on students' response processes in order to create lists of plausible distractors.

### Traditional Approach for Distractor Development

fpsyg-10-00825 April 24, 2019 Time: 18:36 # 3

Distractors are one of the key components that affect the overall quality of multiple-choice items as well as the item's statistical characteristics (Gierl et al., 2017). Distractors are intended to distinguish between students who have not yet acquired the knowledge necessary to answer the item correctly from those who understand the content. Therefore, distractors in a multiplechoice item are designed to contain plausible but incorrect answers based on students' common errors or misconceptions so that the option can measure students' level of mastery in a specific content area (e.g., Case and Swanson, 2001; Ascalon et al., 2007; Hoshino, 2013; Towns, 2014; Lai et al., 2016). Creating distractors using common errors and misconceptions result in multiplechoice items with increased diagnostic value as well as higher item quality (Haladyna and Downing, 1989; Case and Swanson, 2001; Briggs et al., 2006; Moreno et al., 2006, 2015; de la Torre, 2009; Tarrant et al., 2009; Rodriguez, 2011, 2016).

Haladyna and Rodriguez (2013) claimed that common errors and misconceptions could be identified using two different approaches. In the first approach, content specialists create individual distractors by hand that contain these common errors and misconceptions. Collins (2006) recommended that content specialists mimic students' problem solving processes by answering questions such as, "what is a common error for solving this problem?" and "what do students usually confuse this concept or idea with?" in order to identify plausible distractors. The most appealing aspect of this method lies in its practicality and ease of implementation. The distractors are created by content specialists familiar with the students and the content area to mimic the typical and the commons problems that are most likely to occur. While this approach is feasible, it is also based on three assumptions. First, plausible algorithms, rules, or sources of information can be specified by content specialists. Second, plausible but incorrect distractors can be produced using these sources. Third, the misconceptions identified by the content specialists from these sources are, in fact, the same misconceptions held by the students. Proper alignment of the assumptions is critical for creating distractors that measure students' actual errors and misconceptions. Moreover, the alignment must occur for each distractor across every multiple-choice item. Using our earlier example, if a content specialist writes 100 multiple-choice items and each item contains five options (i.e., one correct option and four distractors), then the content specialist must identify 400 plausible but incorrect alternatives that satisfy these three assumptions.

In the second approach, students' responses from existing constructed-response items are evaluated to identify common errors and misconceptions. That is, content specialists review students' responses from constructed-response items to identify mistakes, errors, and misunderstanding and then classify these outcomes to create a compiled list of plausible distractors (e.g., Bekkink et al., 2016). This approach addressed the inferential problem associated with the previous approach because it is based on actual student response data rather than judgments about expected response processes. In other words, approach two is data driven. Common errors and misconceptions identified using approach two come from the algorithms, rules, or sources of information used by students to produce incorrect answers. Unfortunately, the second approach suffers from the problem of practicality and ease of implementation because it is neither practical nor easy to use. As it is currently implemented, approach two is daunting because it entails a comprehensive review of students' written responses using a manual process with the goal of identify common errors and misconceptions that occur consistently and systematically. It is also a process fraught with interpretive problems because identifying common errors and misconceptions that occur systematically can be a subjective task (e.g., what are the characteristics of a systematic misconception). And, despite the potential benefits of using a data-driven approach, practically also dictates that the item development process should be relatively quick and efficient, even when large number of multiplechoice items are required. This requirement is challenging to address using the second approach, especially when large amounts of written text are available from a constructedresponse item.

To-date, limited research has been conducted to investigate the application of augmented intelligence for the purpose of distractor development. Researchers have explored the significance of using students' misconceptions and common errors to create distractors. The approach used in these studies was based on identifying misconceptions using students written or verbal responses that, in turn, were manually categorize by content specialists to identify common errors and misconceptions (e.g., Vacc et al., 2001; Haladyna and Rodriguez, 2013; Moreno et al., 2015; Bekkink et al., 2016; Rodriguez, 2016). As noted earlier, a data-drive approach using students' responses is inherently beneficial for identifying the actual errors and misconceptions that students use when they produce incorrect answers. But it is also inherently limited because it is excessively time consuming and labor intensive to identify and classify errors from written text using a manual review process. To overcome this limitation, we introduce and illustrate a data-driven method for creating distractors based on student's common errors and misconceptions using LDA.

### Topic Modeling and Latent Dirichlet Allocation

Locating keywords and topics to understand text is a simple and effective way for humans to classify textual information. To gather information about certain topics, for example, we often start from generating one or two key words to locate relevant documents that share common topics. Unfortunately, this approach quickly becomes unmanageable for humans when the amount of textual information begins to increase. For example, having content specialists manually review 1000s of students' responses to identify and then categorize common errors would be a time consuming and inefficient classification exercise.

To overcome this clustering challenge, topic modeling has been developed and used with machine learning and natural language processing algorithms to uncover the hidden topics in a document (Blei, 2012). These hidden topics can be identified without any pre-labeling, which means that topic models do not require pre-categorized or topic-labeled documents. In machine learning, these problems are described as an unsupervised learning approach, which means the structure of the problem includes targets or outputs which are unknown and hence the primary focus of learning is to understand the structure of the data. Therefore, in topic modeling, we attempt to identify hidden or unobserved target, topics, using the fully observed information, words.

If we assume that a sequence of words in a document is governed by the same unobserved topic, then we could simply compute the likelihood of a document to represent certain topic to determine the underlying topic of a document in an unsupervised setting. To find the common topics, topic modeling uses word occurrence information where certain words are expected to appear in a document more or less frequently depending on a particular topic. LDA is a generative probabilistic topic modeling algorithm (Blei et al., 2003), where each document is perceived as a mixture of several topics. Generative models take the information of how observed data was generated into account to build a model. Suppose, for instance, we have documents that were generated by complex procedures that are unknown.

Latent dirichlet allocation attempts to synthesize an approximated generation procedure and observed information (i.e., words) to uncover hidden topics, without any labels. Moreover, unlike other topic modeling approaches, LDA can not only produce interpretable topics and can handle unseen documents to assign topics. The generative process of LDA consists of three layers of sampling a topic distribution, sampling topics, and sampling words over topics. For example, after the number of words (or document length) and the number of topics are decided, a topic distribution is specified (e.g., 40% biology, 30% kinetics, and 30% psychology). Next, a topic is picked based on the topic mixture distribution and a word is picked based on the distribution over words corresponding to the topic. This process is then repeated until all the words are generated for each documents. **Figure 1** describes a graphical representation of the generative process of LDA.

Given this process, LDA attempts to explore the hidden topics in a document by computing a posterior distribution of the hidden variables given a document. Due to a large number of possible topic structures, computing the probability of certain words under a specific topic (i.e., the distribution over words corresponding to the topic) becomes impossible to compute. To address this problem, LDA uses a method called Gibbs sampling (Porteous et al., 2008) where each word is randomly assigned in the document to one of the topics, which will provide the initial guess of the wordtopic and word-document distribution. LDA assumes that all topic assignments except for the current word in question are correct, and then updates the assignment of the current word. This process is repeated to improve the assignment until a steady state is reached. Once the final assignment is identified, it is used to estimate the topic mixtures of each document.

### Model Evaluation and Augmented Intelligence

While topic models can be used to extract meaningful and interpretable topic assignments, evaluating the final assignment is challenging using an unsupervised approach (Chang et al., 2009). Unsupervised learning tasks do not include pre-labeled targets. Instead human judgment is required to evaluate the practicality and usefulness of the topic modeling performance (Konrad, 2017). For example, the practicality of the topic model could be evaluated using the "human-in-the-loop" augmented intelligence approach, where humans are asked to locate a randomly substituted word or topic (Chang et al., 2009). If the human can reliably tell which one is a random intruder, then we can say that the trained topic yields a coherent and discernible topic (Chang et al., 2009). In addition, intrinsic measures (i.e., statistical measures) should also be considered for model evaluation. Such measures help evaluate how well the model fits the observed data.

Log-likelihood evaluates the probability of the observed data, given the model (Griffiths and Steyvers, 2004). Thus, we can locate the best model by attempting to produce the highest log-likelihood measure. The Kullback-Leibler (KL) divergence measure focuses on measuring the divergence among the topic distributions. KL divergence explicitly focuses on evaluating how much information we lose when we choose a certain model, by computing the symmetric KL divergence between the distribution of variance in the topic-word distribution and the marginal topic distribution (Cao et al., 2008; Arun et al., 2010). Thus, the best model can be determined by locating the point where the KL divergence measure reaches the lowest value (Arun et al., 2010).

Previous research has been conducted to demonstrate the usefulness of LDA for different types of topic modeling assignments. In education, for example, LDA has been used to uncover topics for essay scoring purposes (Meisner, 2018), implementing course recommendation systems (Apaza et al., 2014), and evaluating teachers (Moretti et al., 2015). However, to our knowledge, LDA has never been used to identify students' errors and misconceptions for the purpose of creating distractors that could be used to create multiple-choice items. Therefore, the purpose of the study is to describe a method for creating distractor by identifying students' misconceptions using the LDA topic modeling approach. Unlike the traditional approach where content specialists were responsible for using their judgments to analyze and evaluate students' responses in order to identify plausible misconceptions for distractors development, the current study provided a systematic and data-driven method to cluster students' written responses with similar underlying concepts in order to locate common mistakes. Once clustered, these responses become the basis for creating plausible distractors.

### MATERIALS AND METHODS

#### Data

An open source data set collected and released from the short-answer scoring competition called Automated Student Assessment Prize (ASAP) was used in the study<sup>1</sup> . As the data set is publicly available, ethical approval was not sought in the study. ASAP was held in 2012. The competition was designed to promote the capabilities of effective scoring system using automated essay scoring frameworks and to provide efficient classroom essay scoring tools for practitioners. The competition included two phases. The first phase focused on developing robust automated scoring frameworks for relatively long responses (up to 650 words). The second phase focused on scoring short responses (up to 50 words). Both the competitions significantly contributed to promoting open and rigors model development for automated essay scoring (Shermis, 2014, 2015).

For the short-essay scoring competition, 10 data sets were released and each data set was generated from a single prompt. The responses were produced by students in grade 10. Each data set was based on a unique prompt in different disciplines, such as Language Arts, Biology, and Science. All the responses were pre-labeled, scored by two human-raters. The current study used data set six from Biology to demonstrate the proposed method. This data was chosen to demonstrate the proposed method for three reasons. Fist, the current method requires pre-labeled data set and the data set six consisted of the resolved-score (or final score) based on the agreement of the two human raters. Second, the prompt required students to respond using multiple answers thereby producing a variety of diverse responses from a single prompt. In addition, the original constructed-response prompt could be easily reformatted into a multiple-choice stem.

More specifically, we used 1,515 responses from the original training set, where students were asked to list and describe three processes used by cells to control the movement of substances across the cell membrane (see **Appendix A**). The particular number of training responses were selected based on the score assigned by two independent human raters. The final score corresponded to the number of correctly identified answer and we only selected the responses where students failed to identify any correct answer (i.e., score 0), as the focus of this study is on extracting common errors and misconceptions.

#### Distractor Development Stage 1: Data Preparation

To achieve clear and interpretable clusters of topics, preprocessing is required. First, all of the misspelled words were corrected. Second, words were converted into lower cases and lemmatized using the Python NLTK library (Bird et al., 2009). Lemmatization is the process of grouping the words together so they can be analyzed as a single item based on their dictionary form. For example, the words 'studies' and 'studying' would be lemmatized into 'study.' Third, digits, non-alphabetic words (e.g., #, %, &, @), and stop words (e.g., a, and, but, how) were removed and all punctuation was specified as a separate word. Fourth, responses were separated into sentences allowing each sentence to be denoted as a separate topic.

Pre-processing is also focused on spelling correction using a combination of several approaches. We used the word embedding-based model for spelling correction. Word embedding-based models use the semantic similarities of words to determine the best candidate of a misspelled word (Nagata et al., 2017, see **Appendix C**). We used a list of words provided in the pre-trained GloVe embedding (Pennington et al., 2014), which were trained on six billion words from Wikipedia 2014 and Gigaword 5. We attempted to locate the best candidate of an incorrect word from the Glove embedding word list based on a cosine-similarity score. Using the embedding-based spell

<sup>1</sup>www.kaggle.com/c/asap-sas

correction, we could successfully correct more than 95% of the misspelled words, while some of the remaining misspelled words that could not be fixed with the methods were correctly manually. This approach was chosen after attempting existing spell checkers in Python and the correction results were relatively lower than expected (e.g., NLTK edit-distance with 78% correction). Such cases often included words that were significantly malformed, thus, providing very limited resemblance with a correct form.

#### Distractor Development Stage 2: Topic Clustering and Cluster Evaluation

The LDA model was constructed using the Python library lda 1.0.5. To generate clear and interpretable clusters of topics, model training and evaluation took place simultaneously. To enable flexible and robust learning, it is necessary to identify the ranges of several model parameters so the model with the optimum range can be identified. For example, the number of topic groups must be specified before training begins. The number of Gibbs sampling iteration must also be specified to train the model. To begin, the number of topics and sample iterations ranged from 1 to 50 and up to 800 iterations, respectively. These ranges were selected so that we can extract as many potential misconceptions as possible with a stable estimation. We set our initial range of the number of topics as a relatively large number, 50, so that the model could conduct a comprehensive categorization of common errors and misconceptions. In terms of the number of iterations, we evaluated the negative loglikelihood of the model at every 10 iterations and inspected whether a significant decrease or increase in log-likelihood occurred. The significance was evaluated based on a chosen tolerance value of 0.5. The results indicated that log-likelihood stabilized around 800 iterations. The performance of our initial model was evaluated using the perplexity measure. Perplexity is a commonly used topic-model measure that is computed by dividing a negative log-likelihood by the number of words (see **Appendix C**). As the name suggests, perplexity provides the degree of 'uncertainty' or 'confusion' the model has in assigning probabilities to text. Therefore, we could determine the optimal number of topics by locating the model with the lowest perplexity.

Then, the topic clusters were visualized to evaluate the clustering. Topic clusters were projected in a two-dimensional space by computing the distance between topics using t-distributed stochastic neighbor embedding (t-SNE). T-SNE is a dimensionality reduction algorithm for high-dimensional data visualization. The idea of t-SNE is to find a probability distribution that is a function of the smallest number of coordinates and to create a similar distribution function to reduce the dimensionality. Assume that we want to calculate the probability of finding two points i and j at the squared Euclidean distance between the points, ||x<sup>i</sup> − x<sup>j</sup> ||2 . T-SNE attempts to match the distribution using a Student's-t distribution, while attempting to learn the y coordinates of the points (i.e., y<sup>i</sup> and yj) in the lower dimension. If the visualized clusters are significantly overlapping and malformed, then the number of topics should be adjusted. In addition, the KL divergence was used as an evaluation criterion for the visualization because it helps determine the similarity of the two distributions. The learning algorithm attempts to create a clear visualization of distinctive topic clusters while minimizing KL divergence to locate the optimal model. To do so, several adjustments were necessary to determine the number of iterations, the learning rate, and the perplexity rate. While the number of iterations and the learning rate determines the efficiency and accuracy of model learning through controlling for the weight adjustments, the perplexity rate controls for the effective number of cluster neighbors. Finally, interpretability of the clusters was evaluated by summarizing the clustered sentences using the Python library genism summarization. Gensim summarization conducts a text rank-based summarization using a variation of the TextRank algorithm (Barrios et al., 2016). TextRank attempts to construct a graph from a document, where sentences (or nodes) are connected with each other via edges. Edges represent the similarity between the sentences, which are often computed based on the word overlap between the two sentences. TextRank hypothesizes that the most important sentence in a text as the one that is the most frequently connected in a graph. We chose this approach as previous studies have demonstrated relatively good performance using the method, while it does not require any manual annotation (Mihalcea and Tarau, 2004). The summaries were created so that content specialists could effectively evaluate the plausibility of the extracted common errors and misconceptions.

In the study, we refer to content specialists as the experts who are experienced in item writing in particular subjects. With this type of content expertise, validating the plausibility of summarized common errors and misconceptions could improve the quality of distractors which are generated from each topic cluster. To do so, content specialists could discuss and attempt to identify where each misconception originated from. For example, if the content of a cluster includes morphologically or phonetically similar words with correct answers, the specialists could conclude that the misconception originated from the confusion in recalling certain terminologies or associating a term with a correct definition. Also, content specialists could be encouraged to answer more concrete questions to evaluate the quality of clusters. Such questions could include, "How many of the clusters do you find meaningful?" and "Is the cluster describing a commonly well-identified misconception regarding the topic?" This would help content specialists to evaluate distractors thoroughly, while providing important information to evaluate the capacity of the current system.

#### Distractor Development Stage 3: Item and Distractor Formation

In stage 3, content specialists formulate distractors using the common errors and misconception clusters identified in the previous stage. We propose several methods that could promote more systematic distractor development using students' misconceptions. The distractor generation process can be distinguished based on the question type (or stem) that content specialists pose regarding a topic. First, the content specialists could decide to change the format of the original question

from the constructed-response item to a multiple choice item format, while attempting to measure the same construct of interest (e.g., which of the following procedures is correct about cell movement?). In this case, we could use the cluster summarizations and the key words and phrases directly. In stage 2, we explored how each misconception cluster can be represented using key words and summarization. Thus, using key words or summarized sentences as distractors would be able to attract students with different levels of understanding effectively. Alternatively, content specialists could develop a question that focuses on specific sub-concepts of a topic. Activeor passive-transport could be good examples of sub-concepts to evaluate, that is closely associated with the original question. In this case, distractors could be directly located based on students' responses from the cluster, where students appeared to have trouble understanding the concepts of active- and passivetransport. We will present how the two methods can be utilized more thoroughly using examples in the next section.

Generating distractors using students' misconceptions have been identified as one of the most effective way in developing multiple-choice items (Haladyna and Rodriguez, 2013). However, with our augmented intelligence approach, which require content specialists' judgment in the evolution process, we believe the effectiveness of distractors could still significantly depend on the content specialists judgments. Therefore, while we encourage further studies on the effectiveness of the distractors generated using the proposed methods, it was out of our scope of research to provide empirical results on behaviors of distractors in a real test setting. We will discuss such concerns more thoroughly in the limitation section with several suggestions for future research.

### RESULTS

### Topic Clustering and Cluster Evaluation Results

In the original constructed-response item, students were asked to provide three correct responses to the following item: "List and describe three processes used by cells to control the movement of substances across the cell membrane." The results indicated that the optimal LDA model identified 22 common misconceptions. The number of topic clusters were selected based on the loglikelihood measure as well as the KL divergence. The model achieved a perplexity of 34.76 after 800 iterations and the lowest KL divergence of 40.50 with 22 topics. As discussed earlier, the log-likelihood measure provides the probability of the observed data given the model (Griffiths and Steyvers, 2004).

In addition, the interpretability and plausibility of each topic cluster was evaluated using extracted key words and summaries. A full list of topic key words and summaries can be found in **Appendix B**. Six to eight topic key words were used for each topic cluster. They were chosen based on the strength of association to represent the topic cluster and the strength was measured by weights assigned to each word. In addition, summaries were generated for each cluster to increase their interpretability. This information was designed to help the content specialists to interpret students' common errors and misconceptions and to evaluate the representativeness of the clusters to form plausible distractors. For example, topic 20 included several key words, such as 'mRNA,' 'RNA,' 'tRNA,' 'DNA,' 'information,' 'translation,' 'transcription,' and 'messages.' Content specialists formed their initial impression on each misconception based on these key words. In addition, by reading the summary which states "mRNA carries messages from the nucleus to other organs tRNA transports DNA to places with in the cell rRNA," content specialists can understand specific contexts and associations among the key words more thoroughly so they can make more informed decision about whether the cluster could be used to create a plausible distractor which represents a common error or misconception.

### Item and Distractor Formation Results

A set of distractors were generated using the evaluated clusters of students' common errors and misconceptions. In addition to create distractors for the originally proposed item, where students were required to describe three processes used by cells to control the movement of substances across the cell membrane, we explored the capacity of the current method in generating distractors on additional cluster-specific items. The following examples introduce a step-by-step breakdown of the distractor generation procedures.

#### Example 1: Generating Distractors for the Original Prompt

As shown in **Figure 2**, a multiple-choice item was created from the original constructed-response item. Reflecting the original prompt, the stem was changed to "What are the three processes used by cells to control the movement of substance across the cell membrane?" To generate distractors that could each reflect different common error and misconception, the list of options was created by locating students' responses with key words from the stem, such as 'processes,' 'movement,' or 'substances' from each misconception topic cluster. More specifically, the option g represents the cluster 13 (see **Appendix B**), where students describe the movement of flagellum as part of the movement of substances across the cell membrane. In this example, the correct answer is i, while the other options were produced to represent students' misconceptions.

#### Example 2: Generating Distractors Using Additional Prompts

As shown in **Figure 3**, the proposed method could be extended to generate distractors for cluster-specific items. Cluster-specific items refer to items that are generated to further evaluate students' understanding that reflect the misconceptions captured in a particular content cluster. For example, **Figure 3** introduces two cluster-specific items, which were posed based on students' responses in cluster 2 (see **Appendix B**). In cluster 2, students had trouble correctly explaining and distinguishing between the two concepts of active and passive transports. Therefore, to evaluate students' understanding on active and passive transport, two additional multiple-choice stems were created: "Which of the following is true about active transport?" and "Which of the following is true about the passive

FIGURE 2 | An example question and distractors generated for the original prompt.

transport?" To generate distractors for the cluster-specific items, we implemented the same process where the key words and phrases (i.e., active transport, passive transport) were used to locate students' responses that included these key terms. Unlike the first example, the distractors were only located among the responses in cluster 2 as the items were created based on cluster 2. The correct option is a and b, respectively.

FIGURE 3 | Example questions and distractors generated for the sub-topics of the original prompt.

### DISCUSSION

The recent introduction of different applications of augmented intelligence in educational assessment have brought about dramatic changes in the field by promoting efficient new test development and administration procedures (Popenici and Kerr, 2017). Augmented intelligence, which is a branch of artificial intelligence, helps content experts broaden their capabilities and make more informed decision in a timely manner with appropriate technological support. For instance, with a machine-aided scoring system, experts can score essays more efficiently because the machine can be used to help distinguish problematic essays that fail to map onto a scoring rubric from more coherent essays. Currently, little research has been conducted to investigate the application of augmented intelligence in item development, especially as it relates to creating distractors. Effective distractors can attract students with a partial understanding, in other words, discriminating students who have not yet reached the mastery level of comprehension regarding the concept. Thus, generating effective distractors is directly associated with increasing the quality of an item and its characteristics (i.e., item difficulty and discrimination; DiBattista and Kurzawa, 2011). Studies have been conducted to explore the significance of using students' misconceptions and common errors to create distractors (e.g., Vacc et al., 2001; Moreno et al., 2015; Rodriguez, 2016). Misconceptions are typically gathered using students written or verbal responses on similar or connected topics and content experts manually categorize and identify plausible misconceptions using the written response evidence (Bekkink et al., 2016). In other cases, content experts attempt to mimic students' thought processes in order to identify plausible errors

(Haladyna and Rodriguez, 2013). However, these approaches are unfeasible when large numbers of items must be created. To overcome this limitation, we introduced and illustrated a data-driven method for generating distractors based on misconceptions from students' written responses using the workflow presented in **Figure 4**.

It is important to acknowledge that the current methods attempt to incorporate both machine- or data-driven and experts-driven approaches harmoniously in every stage. While the data-driven approach provides prominent benefits in facilitating a systematic and effective distractor generation process, we believe the intervention from experts could help improving the system, behaving as a gatekeeper for quality insurance of the final product, distractors. Especially in educational assessments, content experts' decisions are often considered a reference or gold-standard in making the ultimate high-stakes decisions. The steps in **Figure 4** workflow were used to identify 22 distinct clusters of common errors and misconceptions using students' written responses from a constructed-response item in Biology. In the first data processing stage, we primarily used the data-driven approach to pre-process the responses (e.g., lemmatization, tokenization, remove punctuations, and non-alphabetic words). Also, while we corrected the majority of misspelled words using the embedding-based approach, it was still required to conduct a few manual corrections. In the response analysis stage, clusters were created automatically using a topic-modeling approach, then, content experts were required to evaluate the interpretability and plausibility of the extracted clusters, the information was used to generate a list of 22 plausible distractors that, in turn, helped create a parallel multiplechoice item. A parallel multiple-choice item refers to an item originally presented as a constructed-response task that has been reformatted into a selective-response task. The quality of generated distractors can be further empirically evaluated by pilot testing in a classroom evaluation setting and we will discuss more details about the evaluation of item characteristics in the next section.

### Implications for Future Research

The current study has implications for distractor writing practices, specifically, and item development, more generally. Topic modeling allows content experts to use student responses in a more adaptive and productive way. Written responses represent an enormous source of valuable information about students' understanding, which is not only related to the construct of interest, but also to misconceptions about that construct. To-date, little effort has been spent exploring the use of machine learning methods for gathering and using information about misconceptions that can be found is students constructed responses. Using the method described and illustrated in this study, researchers and practitioners

can now use the written responses gathered in assignments and tests to plan future lessons and to create more studentadapted learning activities and assessments. The method can also be used to provide evidence for students' developmental level of understanding about certain concepts. For example, by analyzing the responses from the higher-ability group and compare the misconception clusters with the ones from the lower-ability group, more in-depth information can be gathered to create a comprehensive picture of how students' level of understanding develops on specific concepts and within specific content areas.

### Distractor Development and Item Generation

Potentially the most important future application of this method resides in its application to automatic item generation (AIG; Irvine and Kyllonen, 2002; Gierl and Haladyna, 2013). AIG is a relatively new but rapidly evolving research area where cognitive and psychometric modeling practices guide the production of tests that include items generated with the aid of computer technology. Gierl and Lai (2013, 2016) developed a threestep process for AIG. In step 1, content specialists create a cognitive model for AIG.

Currently, distractor development poses a unique and consequential problem in AIG in the step 2 item modeling stage. For the selected-response format, items must not only include a stem with a corresponding correct option, but also include a set of distractors. Distractors in AIG are typically designed from a list of plausible but incorrect alternatives linked to misconceptions identified by content specialists. Because AIG produces 100s of items, strategies are needed to create a correspondingly large number of plausible but erroneous distractors. Distractor development for AIG is now guided by the distractor pool method with random selection (Gierl and Lai, 2016). To identify the content for the distractors, content specialists identify a list of plausible but incorrect options that are appropriate for all possible items generated with a given item model. Then, distractors are randomly selected from this pool of plausible but erroneous content and added to each generated item. This method is based on the assumption that a pool of plausible distractors can be created. A sample of these plausible distractors are selected at random to complete the item generation process. The strength of this method is its simplicity. This method can yield large numbers of distractors. The weakness of this method resides with the strong assumption that all pooled distractors are equally plausible and appropriate for all generated items. Equal plausibility and appropriateness is strong and, in many cases, restrictive assumption. Also, there is little reasoning to guide how distractors are paired with the correct option because pairing is achieved with random assignment.

To improve the plausibility and appropriateness of the distractors, rules, and rationales that yield errors or misconceptions can be used to create distractors. Distractor rationales are short descriptions that specify the reasoning which underlies each option. These rationales are currently provided by content specialists. But the rules can also be created using the method presented in our study to produce distractors that conform to specific, empirically-based, student misconceptions. Hence, distractors can be created systematically so that each distractor matches a rationale. This proposed approach could be called the systematic generation with rationales method. It would be based on the assumption that algorithms, rules, and procedures can first be articulated by content specialists and then used to create plausible but incorrect alternatives linked to students' actual misconceptions or errors in thinking, reasoning, and problem-solving. The strength of this method is that the distractors are much more specific and, hence, plausible and appropriate, especially when compared to the distractor pool method with random assignment. Hence, integrating the outcomes from the topic modeling methods presented in this paper with new developments in AIG should be considered an important area of future research.

### Limitation and Future Research

Even though the study was carefully designed and structured to minimize potential error with results and further interpretations, we found the three key limitations that should be addressed and carefully considered for future research: the main purposes of our study were to introduce a novel method of identifying students' misconceptions in a systematic manner to encourage efficient distractor generation for multiple-choice item development. Thus, our study could not investigate the item behaviors with generated distractors in a real test setting. Investigating the item behaviors in relation to the distractor quality would help us further understand the importance of item development with well-performing distractors. For example, DiBattista and Kurzawa (2011) demonstrated how the plausibility of distractors significantly affects item characteristics (e.g., item discrimination) in classroom assessment. Therefore, we encourage future researchers to evaluate the plausibility and effectiveness of the generated distractors to explore the significance of our proposed method thoroughly. Second, our current method required labeled responses to identify students' responses with incorrect answers. Scoring students' responses manually can be a very expensive and tedious procedure, especially in a large-scale assessment. However, as the current method attempts to extract students' misconceptions that could be located from their incorrect responses, it is necessary to score or use pre-labeled data set to properly implement the proposed method. This could somewhat limit the usability of the proposed method as locating domain specific and pre-labeled data can be a daunting challenge. However, we believe such limitations can be readily overcome by using automated essay scoring systems (see **Appendix C**) to generate labeled responses in advanced to implement the current method. Last, augmented intelligence approach of our method aim to create a systematic method to distractor development supporting content experts to make informed decisions using misconception clusters. Therefore, it is important to investigate whether content specialists, indeed, feel supported to make informed decisions in creating distractors. We encourage future research to carefully evaluate the affective factors of content experts in using this method to fully evaluate the capacity of the current method.

#### AUTHOR CONTRIBUTIONS

fpsyg-10-00825 April 24, 2019 Time: 18:36 # 11

JS, QG, and MG contributed in conceptualization and formalization of research ideas of the study. JS located and organized the data. JS and QG performed the analysis. JS and MG

#### REFERENCES


wrote the first draft of the manuscript. All authors contributed to manuscript revision, read and approved the submitted version.

#### SUPPLEMENTARY MATERIAL

The Supplementary Material for this article can be found online at: https://www.frontiersin.org/articles/10.3389/fpsyg. 2019.00825/full#supplementary-material



Popham, W. J. (2008). Transformative Assessment. Alexandria, VA: ASCD.


**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2019 Shin, Guo and Gierl. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

## APPENDIX A

**Prompt—cell membrane item**

fpsyg-10-00825 April 24, 2019 Time: 18:36 # 13

List and describe three processes used by cells to control the movement of substances across the cell membrane.

#### **Rubric for cell membrane**

Key elements:


Rubric: 3 points Three key elements 2 points Two key elements 1 point One key element 0 points Other.

## APPENDIX B

Representative key words of topic clusters

fpsyg-10-00825 April 24, 2019 Time: 18:36 # 14


# The Expanded Evidence-Centered Design (e-ECD) for Learning and Assessment Systems: A Framework for Incorporating Learning Goals and Processes Within Assessment Design

*Meirav Arieli-Attali1,2 \*, Sue Ward3 , Jay Thomas3 , Benjamin Deonovic1 and Alina A. von Davier1*

#### *Edited by:*

*Frank Goldhammer, German Institute for International Educational Research (LG), Germany*

#### *Reviewed by:*

*Russell G. Almond, Florida State University, United States Gabriels Nagy, Christian-Albrechts-Universität zu Kiel, Germany*

> *\*Correspondence: Meirav Arieli-Attali meirav.attali@act.org*

#### *Specialty section:*

*This article was submitted to Quantitative Psychology and Measurement, a section of the journal Frontiers in Psychology*

*Received: 10 December 2018 Accepted: 01 April 2019 Published: 26 April 2019*

#### *Citation:*

*Arieli-Attali M, Ward S, Thomas J, Deonovic B and von Davier AA (2019) The Expanded Evidence-Centered Design (e-ECD) for Learning and Assessment Systems: A Framework for Incorporating Learning Goals and Processes Within Assessment Design. Front. Psychol. 10:853. doi: 10.3389/fpsyg.2019.00853*

*1 ACTNext, ACT Inc., Iowa City, IA, United States, 2 Fordham University, New York City, NY, United States, 3 ACT Inc., Iowa City, IA, United States*

Evidence-centered design (ECD) is a framework for the design and development of assessments that ensures consideration and collection of validity evidence from the onset of the test design. Blending learning and assessment requires integrating aspects of learning at the same level of rigor as aspects of testing. In this paper, we describe an expansion to the ECD framework (termed e-ECD) such that it includes the specifications of the relevant aspects of learning at each of the three core models in the ECD, as well as making room for specifying the relationship between learning and assessment within the system. The framework proposed here does not assume a specific learning theory or particular learning goals, rather it allows for their inclusion within an assessment framework, such that they can be articulated by researchers or assessment developers that wish to focus on learning.

Keywords: task design, technology-based assessment, blended assessment and learning, development framework, Evidence model

## INTRODUCTION

There is a growing need for the development of assessments that are connected and relevant to learning and teaching, and several attempts have been made in recent years to focus on this topic in conferences and journals. For example, Mark Wilson's 2016 June and September presidential messages in the National Council for Measurement in Education's newsletter addressed Classroom Assessment, and this topic was also the conference theme for the following 2 years, 2017 and 2018. The journal *Assessment in Education: Principles, Policy & Practice* recently devoted a special issue on the link between assessment and learning (volume 24, issue 3, 2017). The issue focused on the developments in the two disciplines which, despite mutual influences, have taken distinctly separate paths over time. In recent years, systems that blend learning and assessment have been proposed all over the world (e.g., Razzaq et al., 2005; Shute et al., 2008; Feng et al., 2009b; Attali and Arieli-Attali, 2014; Straatemeier, 2014). While within the educational measurement field, there are established standards and frameworks for the development of reliable and valid assessments, those rarely take learning aspects into account. As part of our own effort to develop a blended learning and assessment system, we identified a need for a formal framework of development that includes aspects of learning at the same level of detail and rigor as aspects of testing. This paper describes our general approach at expanding an assessment framework, with some examples from our system to better illustrate the abstract concepts.

Our approach at expanding a principled assessment design is primarily concerned with the inclusion of three dimensions: *aspects of learning*, such as the ability to incorporate the change over time in the skills to be measured at the conceptual level; *aspects of interactive and digital instructional content*, such as simulations, games, practice items, feedback, scaffolds, videos, and their associated affordances for the data collection in rich logfiles; and *measurement models for learning* that synthesize the complexities of the digital instruction and data features.

The expanded framework proposed here allows for the design of systems for learning that are principled, valid, and focused on the learner. Systems designed in this framework are intrinsically connected with the assessment of the skills over the time of instruction, as well as at the end, as summative tests, if so desired. This type of systems has an embedded efficacy structure, so that additional tests can be incorporated within. Learning and assessment developers, as well as researchers, can benefit from such a framework, as it requires articulating both the assessment and learning intended goals at the start of the development process, and it then guides the process to ensure validity of the end-product. The framework proposed here does not assume a specific learning theory or particular learning goals, rather it allows for their inclusion within the assessment framework. The measurement perspective, combined with the learning sciences perspective in the development of content, provides a new and significant shift in the modern development of leaning and assessment systems.

We chose to expand the well-known evidence-centered design framework (ECD; Mislevy et al., 1999, 2003, 2006). The ECD formulates the process of test development to ensure consideration and collection of validity evidence from the onset of the test design. The ECD is built on the premise that a test is a measurement instrument with which specific claims about the test scores are associated, and that a good test is a good match of the test items and the test takers' skills. The ECD framework defines several interconnected models, three of which form the core of the framework and are relevant to our discussion: the Student model(s), Evidence model(s), and Task model(s) (the combination of the three models is also called the Conceptual Assessment Framework; CAF; see **Figure 1**). Note that in more recent publications of the ECD, the Student model is termed a Proficiency model (e.g., Almond et al., 2015).

The Student or the Proficiency model(s) specifies the knowledge, skills, and ability (KSA; which are *latent* competencies) that are the target of the test. This model can be as simple as defining one skill (e.g., the ability θ) or a map of interconnected subskills (e.g., fractions addition, subtractions, multiplication, and division are interconnected subskills that form the map of knowing fractions). The latent competencies that are articulated and defined in this model establish the conceptual basis of the system, and they are often based on a theory or previous findings related to the goal of the assessment.

Since we cannot tap directly into the latent competencies, we need to design tasks/test items such that they will elicit behaviors that can reflect on or indicate about the latent competencies. This is the role of the Task model(s). The Task model specifies the *tasks features* that are supposed to elicit the observables, and only them, such that to allow inferences about the latent competencies. For example, if the assessment is intended to measure "knowledge of operating with fractions,"

the tasks should be designed with care such that reading ability is not an obstacle to perform well on the task and express one's fractions knowledge.

The Evidence models then make the connection between the latent competencies [specified by the Student/Proficiency model(s)] and the observables [behaviors elicited by the Task model(s)]. In other words, the Evidence models are the connecting link. The Evidence models include the measurement model, comprised of the rubrics, the scoring method, and the statistical method for obtaining a total score(s). See **Figure 1** for a diagram of the ECD and specifically the three CAF models (note that latent competencies are symbolized as circles, while observables as squares; and the connection between the circles and squares are shown in the Evidence models).

Two important additional models are the Assembly model and the Presentation model (see **Figure 1**). The Assembly model defines how the three models in the CAF (the Student/ Proficiency, Task, and Evidence models) work together and specifically determines the conditions for reliability and validity of the system. As part of the Assembly model, the developers determine the number of items/tasks and their mix ("constraints") such they provide the necessary evidence and are balanced to properly reflect the breadth and diversity of the domain being assessed. The Presentation models are concerned with different ways to present the assessment, whether it is a paper-and-pencil test, a computer-based test, a hands-on activity, etc. We will elaborate on and delve deeper into each of the models as part of the expansion description below; for more details on the original ECD, see Mislevy et al. (2003, 2006).

There are other alternatives frameworks for the design and development of assessment that follow a principled approach, such as the Cognitive Design System (Embretson, 1998), the Assessment Engineering framework (Luecht, 2013), the Principled Design for Efficacy framework (Nichols et al., 2015), or the Principled Assessment Design framework (Nichols et al., 2016). These frameworks may be perceived as alternatives to the ECD, and one might find any of them as a candidate for a similar expansion the way we demonstrate executing for the ECD in this paper. The reason there were several assessment frameworks developed over the years stem from the need to ensure validity of assessment tools. Although traditional assessments were developed for about half a century without a principled approach (i.e., by following an assessment manual and specifications) and validity was verified after development, the advantage of following a principled framework such as the ECD or others is particularly evident when the goal is to assess *complex competencies* (e.g., problem solving, reasoning, collaborative work) and/or when using *complex performance tasks* (e.g., multidimensional tasks such as performance assessment, simulations or games on computer or otherwise). In these cases, it is important to explicitly identify the relevant competencies and behaviors and how they are connected, because the complexity of the focal competencies and/or the rich data that the tasks provide might pose difficulties in making inferences from behaviors to competencies. ECD has been also successfully applied to address the challenges of simulation- and game-based assessment (Rupp et al., 2010a; Mislevy, 2013; Kim et al., 2016).

### MOTIVATION FOR A PRINCIPLED APPROACH TO THE DESIGN AND DEVELOPMENT OF A LEARNING AND ASSESSMENT SYSTEM

Learning and assessment, although both relate to the process of determining whether or not a student has a particular knowledge, skill, or ability (KSA), differ substantially in the way they treat KSAs. The main difference between an assessment tool and a learning tool is in the *assumption* about the focal KSA, whether it is fixed or dynamic at the time of interacting with the tool. The Student/Proficiency model in the ECD describes a map of competencies (KSAs), and as in most psychometric models for testing, the assumption is of a latent trait, which is "fixed" at the time of taking the test. The purpose of an assessment is thus to "detect" or "diagnose" that fixed latent KSA at a certain point in time, similar to any measurement tool (e.g., a scale measuring a person's weight at a particular point in time). On the other hand, the main purpose of a learning tool, such as a computer tutoring system, is to "move" the learner from one state of knowledge to another – that is, the concern is first and foremost with the *change* in KSAs over time, or the *transition*. Of course, an assessment tool *per se* cannot drive the desired change unless deliberate efforts are implemented in the design of the system (similar to a scale which will not help with weight loss unless other actions are taken). Thus, systems that aim at blending assessment and learning cannot implement ECD as is, since ECD is inherently a framework to develop assessments and not learning.

Moreover, the availability of rich data collected *via* technologyenhanced learning and assessment systems (e.g., trial and error as part of the learning process, hint usage) poses challenges, as well as promises, for assessment design and the decision process of which actions to allow and what to record, either to promote efficient learning or to enable the reliable assessment of the learning in order to make valid inferences about KSAs. Computational Psychometrics (von Davier, 2017), an emerging discipline, blends theory-based methods and data-driven algorithms (e.g., data mining and machine learning) for measuring latent KSAs. Computational Psychometrics is a framework for analyzing large and often unstructured data, collected during the learning or performance process, on a theoretical learning and psychometric basis. We also combine aspects of Computational Psychometrics in our expanded design framework, similar to previous accounts that integrated data mining into ECD (e.g., Mislevy et al., 2012; Ventura and Shute, 2013). Combining data-driven algorithms into ECD allows knowledge discovery and models' update from data, thereby informing the theory-based Student/Proficiency model and enriching the Evidence model.

Attempts to develop innovative assessments within games or as part of complex skills assessment and learning also brought about variations or expansions to ECD (e.g., Feng et al., 2009a; Conrad et al., 2014; Grover et al., 2017). One characteristic of ECD variants focuses on the task and its connection to the Evidence model. Since game-play and the rich data from complex assessments often result in sequences of actions, not all of which are relevant to the target competencies, researchers may follow an ECD approach with expansion with respect to the action-data, to specify which actions are relevant and should be included in the Evidence model and in what way (i.e., expansion on the scoring rules or both scoring and Task model). Such an attempt was done by Grover et al. (2017). Grover and her colleagues expanded on the scoring rules by employing data driven techniques (e.g., clustering, pattern recognition) in addition to theory-based hypotheses, to guide the definition of the scoring rules. Another interesting variation is the experiment-centered design by Conrad et al. (2014), which illustrated an expansion on the scoring and the Task model. This approach uses an ECD-like process to simultaneously encode actions of players in one way for game design and another way for assessment design. Because the game design dictates feedback on actions, and subsequent game options may depend on student's actions, the game designer needs to encode the actions differently than a researcher or an assessment designer, who is primarily interested in estimating whether a student possesses the focal skill. In this procedure, the model is first postulated around the task (experiment), and then applied separately as two models (versions), one for the game designer, and one for the researcher, each focused on a different encoding of student actions. However, there is only one Evidence model for inferring KSAs, derived from the researcher's version of the task encoding (the assessment variant scoring rule). In this way, the adaptation of the ECD allowed adding the assessment as a "layer" on top of the game design (stealth assessment), while ensuring coordination between these two layers.

Work by Feng et al. (2009a) is particularly relevant in this context. The authors examine an adaptation of the ECD for learning data (ECDL), applied retroactively to the ASSISTments data (Heffernan and Heffernan, 2014). The ECDL is an ECD with an augmented *pedagogical model*, which has links to all three models of the CAF (Proficiency, Evidence, and Task). The pedagogical model refers to the learning and learners' characteristics, including learning effectiveness and efficiency (e.g., reducing cognitive load, increasing difficulty gradually during presentation, adapting the presentation of content, and decomposing multistep problems to sub-steps), as well as learner engagement factors. Since ASSISTments was initially developed without ECD in mind, the analysis retroactively checks which claims can support a validity argument that an item with its hints and scaffolds serves the learning goal. This is done by identifying (within each item) the KSAs required to answer it correctly, tagging each as "focal" or "unfocal." The focal KSAs are the ones which the hints/scaffolds should address. The relation between the focal and unfocal also serves as an indication of the system's efficacy [a system with a high proportion of unfocal KSAs is less efficient than a system with a low proportion, because this reflects the proportion of KSAs not taught (scaffolded)]. In sum, Feng and his colleagues demonstrated how an existing learning product can be analyzed (and potentially improved) using an ECDL framework.

Common to the various adaptations of ECD is that they were task driven. First came the tasks; then came the ECD analysis, which resulted in adapting the ECD to address the complexity and intuition that were built into the tasks, expressed as an expansion on one of the three models in the CAF. While in the first two examples of Conrad et al. (2014) and Grover et al. (2017), the revised ECD focused on how to encode the task data to feed into the Evidence model, Feng et al.'s (2009a) study goes further, suggesting a pedagogical model that is feeding and being fed by all three CAF models – Proficiency, Evidence, and Task. However, this pedagogical model seems somewhat like a "black box" that retroactively includes the intuitions that specified the product design (e.g., how hints and scaffolds were determined). Additionally, it neither specifies the nature of the connections with the original ECD models nor does it inform how to design a learning product from scratch (i.e., a principled approach to development).

We offer a comprehensive expansion of the ECD framework, such that learning aspects are specified for each of the three models in the CAF and are determined *a priori* to the system design. We describe the expanded full CAF first, followed by a focus on each expanded model with examples. We then discuss the Assembly model, which allows for the specification of the relationship between assessment and learning. We conclude with ramifications of the expanded framework for the development of *adaptive* systems. We include examples to better illustrate the general ideas, along with directions for alternative decisions, to emphasis the generalizability of the expanded framework.

### THE EXPANDED ECD MODEL

In our expanded ECD framework (e-ECD), we find it necessary to expand on all three Student/Proficiency, Evidence, and Task models. We do so by adding a *learning layer*, in parallel to the assessment layer. This learning layer can be viewed as a breakdown of a pedagogical model (Feng et al., 2009a) to three components, the conceptual (student/proficiency), behavioral (task), and statistical (evidence) components. Thus, each original ECD model now has an additional paired learning model, culminating in six models. We call each assessmentlearning pair an expanded model (e-model), i.e., the e-Proficiency model, the e-Task model, and the e-Evidence model (see **Figure 2**). Note that we refer to the original Proficiency model as the KSA model (Knowledge, Skills, and Ability), which is now part of the e-Proficiency model.

Within each e-model, we denote an "observational" layer for the assessment aspect (these are the original ECD models with slight title change; the KSA model, Task model, and Observational-Evidence model) and a "transitional" layer for the learning aspect (these are the new models that address learning). The three new learning models include the following: (1) at the conceptual

latent level and part of the e-Proficiency model – the transitional layer specifies *learning processes* as the latent competency that the system targets. We denote it as the KSA-change model; (2) at the behavioral level and part of the e-Task model – the transitional layer specifies principles and features of *learning support* that guides the design of tasks (customized feedback, scaffolds, hints, solved examples, solution, or guidance to digital instructional content such as animation, simulation, games, and videos). We denote it as the Task-support model; and (3) at the statistical level and part of the e-Evidence model – the transitional layer specifies the links between the learner's support usage and the target learning processes, to allow inferring from behaviors to latent learning (e.g., the efficiency of the support used in achieving learning). The data could be large process data and may reveal behavior patterns that were not identified by the human expert in the original e-Proficiency model. In this framework, the e-Proficiency model and the e-Evidence model are supposed to "learn" in real time (be updated) with the new knowledge inferred from the data. We denote it as the Transitional-Evidence model.

We include also an expansion on the Assembly model, denoted e-Assembly model. In addition to determining the number and mix of tasks, the e-Assembly model also includes the specification about the relationship between the assessment component and the learning component of the system and determines how they all work together. In other words, the assembly model determines the "structure" of the system, e.g., when and how learning materials appear and when and how assessment materials appear, and the rules for switching between the two.

Consider the following situation: a student is using a system for learning and assessment to learn and practice scientific reasoning skills. At some point, the student gets an item wrong. In a regular assessment system, another item will follow (often without any feedback about the correctness of the response) – and if the system is an adaptive testing system, the student will receive an easier item, but not necessarily with the same content as the item with the incorrect response. In a blended learning and assessment system, the approach is different. *Detecting a "weakness" in knowledge is a trigger to foster learning*. How should the system aim at facilitating learning? There are several different options, from providing customized feedback and hints on how to answer that specific item, presenting scaffolds for the steps required or eliciting prior knowledge that is needed to answer that item, addressing specific misconceptions that are known to be prevalent for that specific node of KSA, up to re-teaching the topic and showing worked examples, and/or presenting similar items to practice the skill. In many learning products today, this process of defining the learning options is conducted using content experts according to implicit or explicit learning goals. Using a principled approach to development will dictate that the definition of the options for learning should be *explicitly* articulated at the level of the Task-support model, and these features are to be in line with the explicit conceptual learning/pedagogical model that describes how to make that shift in knowledge, i.e., the KSA-change model. The links between the supports and the conceptual KSA-change are defined in the Transitional-Evidence model *via* statistical models, which provide the validity learning argument for the system.

In the development of an assessment system that blends learning, we wish to help students learn, and to validate the claim that learning occurred, or that the system indeed helped with the learning *as intended*. The KSA-change specifies the type of changes (learning/transitions) the system is targeting, and based on that, the tasks and the task supports are defined. In other words, the first step is to define the "learning shifts" or how to "move" in the KSA model from one level/node to the next. The next step is to define the observables that need to be elicited and the connections between the learning shifts and the observables. We elaborate on each of the expanded models below.

Our expanded framework shows how to incorporate a learning theory or learning principles into the ECD and can be applied using different learning approaches. We illustrate this process by using examples from Knowledge-Learning-Instruction (Koedinger et al., 2012) among others, but this process can be applied using other learning approaches (and we provide some directions).

### Expanded Proficiency Model

In the ECD framework, the Student/Proficiency model defines the Knowledge, Skills, and Ability (KSA) that the assessment is targeting. Although in early publications of the ECD, it is called a Student model, in recent contexts, it is called a "Proficiency model" (e.g., Feng et al., 2009a; Almond et al., 2015), or referred to as a "Competency model" (e.g., Arieli-Attali and Cayton-Hodges, 2014; Kim et al., 2016), and it can also be perceived as a "Construct map" (Wilson, 2009). A similar notion in the field of Intelligence Tutoring Systems is a "Domain model" (Quintana et al., 2000), a "Knowledge model" (Koedinger et al., 2012; Pelánek, 2017), or a "Cognitive model" (Anderson et al., 1995). In the Intelligence Tutoring Systems' literature, the term "Student model" is reserved to a specific map of skills as estimated for a *particular student* – which is an overlay on the domain model (aka the expert model). Within ECD, the Student/Proficiency model includes both the desired skills (that an expert would possess) and the updated level of skills for each particular student following responses on assessment items. To avoid confusion, within our expanded ECD, we refer to it by the general name of a KSA model.

The KSAs are assumed to be latent, and the goal of the assessment is to infer about them from examinee's responses to test items. When the assessment tool is also intended to facilitate learning (i.e., the system provides supports when the student does not know the correct answer), the assumption is that the student's level of KSA is *changing* (presumably becoming higher as a result of learning). In the e-ECD, we define a "KSA-change model" that together with the original KSA model creates the expanded-Proficiency model (e-Proficiency model). The KSA-change model specifies the latent *learning processes* that need to occur in order to achieve specific nodes in the KSA model. Each node in the KSA model should have a corresponding *learning-model* in the KSA-change model, which may include prerequisite knowledge and misconceptions, and/or a progression of skills leading up to that KSA node, with the pedagogical knowledge of how to make the required knowledge-shift. Some examples of learning models are learning progressions (National Research Council (NRC), 2007; e.g., Arieli-Attali et al., 2012) a Dynamic Learning Map (Kingston et al., 2017), or learning models based on the body of work on Pedagogical Content Knowledge (Posner et al., 1982; Koehler and Mishra, 2009; Furtak et al., 2012). The importance of Pedagogical Content Knowledge is in considering the interactions of *content information*, *pedagogy*, and *learning theory*. Another approach from the learning sciences and artificial intelligence is the Knowledge-Learning-Instruction framework (KLI; Koedinger et al., 2012), which provides a taxonomy to connect knowledge components, learning processes, and teaching options. We will illustrate our KSA-change model specification using the KLI framework, but we will define the e-Proficiency model in a general way such that any other learning theory can be applied instead.

Specifying and explicitly articulating the latent learning processes and progressions that are the target of the learning is a crucial step, since this is what will guide the specification of both the e-Task model and the e-Evidence model. In the following sections, we elaborate and illustrate the KSA and KSA-change models that constitute the e-Proficiency Model.

#### The Assessment Layer of the e-Proficiency Model – The KSA Model

A KSA model includes *variables* that are the features or attributes of competence that the assessment is targeting. The number of variables and their grain size are determined by the potential use of the assessment, and it can range from 1 (e.g., the θ in college admission tests such as the GRE, SAT, and ACT) to several subskills arranged in a map or a net (e.g., a net example, see Mislevy et al., 1999; a math competency map, see Arieli-Attali and Cayton-Hodges, 2014; two versions of a game-based physics competency model, see Kim et al., 2016). These variables can be derived by conducting a *cognitive task analysis* of the skill by experts, analyzing the content domain, or relying on a theory of knowledge and research findings. The variables and their interconnections create a map in which each variable is a *node* connected by a *link* with other nodes (variables). Following analysis of data from student responses (and using the statistical models), values on these variables define the level of mastery or the probability that a particular student possess those particular sub-skills (nodes), i.e., a value will be attached to each node.

As part of our development of a learning and assessment system, called the Holistic Educational Resources & Assessment (HERA) system for scientific thinking skills, we developed a KSA model for *data interpretation* skill. **Figure 3** depicts part of the model. Specifically, we distinguish three main skills of data interpretation depending on the data representation (*Table Reading*, *Graph Reading*, and the skill of interpreting data from *both tables and graphs*), and each skill is then divided to several subskills. For example, in *Table Reading* skill, we distinguish between *locating data points*, *manipulating data*, *identifying trend*, and *interpolation and extrapolation*. Note that these same subskills (albeit in a different order) appear also under *Graph Reading* skill, but they entail different cognitive ability. The skill of *Tables and Graphs* includes *comparing*, *combining*, and *translating* information from two or more different representations.

Although KSA models often specify the links between nodes, and may even order the skills in a semi-progression (from basic to more sophisticated skills) as in the example of the HERA model in **Figure 3**, a knowledge model often does not specify *how to move* from one node to the next, nor does it explicitly define learning processes. To that end we add the learning layer in the e-Proficiency model – the KSA-change model.

#### The Learning Layer in the e-Proficiency Model – The KSA-Change Model

Defining a learning layer within the e-Proficiency model makes room for explicit articulation of the learning processes targeted by the learning and assessment system. The idea is for these

specifications to be the result of purposeful planning, rather than a coincidental outcome of system creation. In the Intelligence Tutoring literature, developers consider what they call the "Learner model" (Pelánek, 2017) or the "Educational model" (Quintana et al., 2000) or more generally, processes for knowledge acquisition (Koedinger et al., 2012). This model can also be viewed as the "pedagogical model" and apply principles of Pedagogical Content Knowledge (Koehler and Mishra, 2009; Furtak et al., 2012). We call this model the "KSA-change Model" for generalizability and to keep the connection with the original KSA model, with the emphasis on the *change* in KSA. Using the title "change" makes room also for negative change (aka "forgetting"), which albeit not desirable, is possible.

A KSA-change model is the place to incorporate the specific learning theory or learning principles (or goals) that are at the basis of the systems. Similar to the way a KSA map is created, the KSA-change map should specify the learning aspects of the particular skills. Here we provide a general outline for how to specify a KSA-change model, but in each system this process may take a different shape.

A KSA-change model may include variables of two types:

1. Sequences of knowledge components, features or attributes 2. Learning processes within each sequence

These two types of variables define the learning *sequences* and *processes* that are needed to facilitate learning. The KSA-change variables are derived directly from the KSA model, such that each node/skill in the KSA model has a reference in the KSA-change model in the form of how to "move" students to learn that skill.

Given a specific skill (node in the map), this may be done in two stages: (1) the first step is to define the (linear) *sequence* of pre-requisites or precursors needed to learn that target skill (node). For example, Kingston and his colleagues (Kingston et al., 2017) developed Dynamic Learning Maps in which each of the target competencies are preceded with three levels of precursor pieces of knowledge (initial precursor, distal precursor, and proximal precursor) and succeeded by a successor piece of knowledge, together creating what they called "Linkage levels." When defining the sequence of precursors attention should be given to the grain size, as well as to specific features or attributes of these precursors. In KLI terminology (Koedinger et al., 2012), this would mean to characterize the *Knowledge Components* of the subskills. Some Knowledge Components are: fact, association, category, concept, rule, principle, plan, schema, model, production; and whether it is verbal or non-verbal, declarative or procedural; or integrative knowledge (2) the second step is to characterize the learning sequence by which kind of learning *process* is required to achieve the learning. For example, applying the KLI taxonomy (Koedinger et al., 2012), we can assign to each precursor (knowledge component) a specific learning process that is presumed to make the desired knowledge shift. The KLI framework characterizes three kinds of learning processes: *memory and fluency building*, *induction and refinement*, and *understanding and sense-making*. Specifying which kind of process is needed in the particular learning sequence is necessary for subsequent decisions about the supports to be provided. For example, if the focal learning process is *fluency building*, this implies that the learning system should provide practice opportunities for that KSA. In contrast, if the focal learning process for a different KSA is *understanding and sense making*, then the learning system should provide explanations and examples. **Figure 4** illustrates a general e-Proficiency model with an artificial example of adding-on the learning processes to a knowledge sequence built off of three prerequisites and a successor piece.

Applying the above approach to the HERA learning and assessment system, let us focus on the subskill of *interpolation and extrapolation from data in a graph* (the last red circle in the progression of *Graph Reading* skill in **Figure 3**). Based on our guidelines above, the first step would be to determine a sequence of subskills/precursors and to characterize them, and then as a second step to specify the cognitive process(es) that would make the transition from one subskill to the next. **Figure 5** presents one section of the KSA-change of the HERA system for the subskill of *interpolation and extrapolation in a graph*. The model specifies the proximal, distal, and initial precursors as follows: the proximal precursor = *identifying the rate of change in the dependent variable (y-variable) as the independent variable (x-variable) changes*; distal precursor = *being able to locate the y-value for a certain x-value point on a graph, and find adjacent points and compare the relative values*; initial precursor = *understanding that the two variables in a graph are co-related.* Now applying the KLI knowledge components characterization, the proximal precursor (identifying rate of change) may be characterized as "rule"; the distal precursor (locate points and compare) as "schema"; and the initial precursor (two variables are co-related) as a "concept."

Next, we determine the cognitive processes that foster the transition from one subskill to the next. For example, given an understanding of the co-variation of *x* and *y* (the initial subskill) students need to practice finding the y-points for different x-points to create the mental schema and build *fluency* with locating points and particularly two adjacent points. However, to "jump" to the next step of identifying the trend and the rate of change requires *induction and refinement* to derive the rule. The last transition from identifying rate of change to perform interpolation & extrapolation requires *sense making and deduction* – deducing from the rule to the new situation. Given the specific learning processes, we can later define which learning supports would be most appropriate (e.g., practice for fluency building, worked example and comparisons for induction, and explanation for sense making and deduction). The model in **Figure 5** shows the different learning processes as the transitions (arrows) required between the subskills in the sequence. This is the learning model for the specific skill in focus, and is usually derived based on expert analysis. The model in **Figure 5** also specifies particular misconceptions that students often exhibit at each level. Specifying misconceptions may also help determine which feedback and/or learning aid to provide to students. We show in the next section how to define Task and Task-support models based on this example.

There are several decisions that are taken as part of the model specifications. One of them is the grain-size of each precursor. An alternative KSA-change model can be determined with smaller or larger grain size subskills. Another decision is whether to adopt a three-level precursor skill structure, or alternatively focus on only one precursor and the different misconceptions students may have. Researchers and developers are encouraged to try different approaches.

We propose to derive the KSA-change variables by conducting a *learning process analysis* by experts, i.e., an analysis of the pedagogical practices in the content domain or relying on a theory of learning in that domain, similar to the way we illustrated above (by using the KLI taxonomy). This is also parallel to the way a KSA model is derived based on *cognitive task analysis* or domain analysis. The KSA-change model constitutes a *collection* of sequences (and their processes), each addressing one node in the KSA model (as illustrated in **Figures 4, 5**). This can also be viewed as a two-dimensional map, with the sequences as the second dimension for each node.

Similar to updating the KSA model for a student, here too, following analysis of data from student responses and student behaviors in using the learning supports, values on the KSA-change variables indicate level or probability that a particular student has gone through a particular learning process (or that a particular knowledge shift was due to the learning support used). We will discuss this in more detail in the e-Evidence model section.

### Expanded Task Model

In the original ECD framework, the Task model specifies the features of tasks that are presumed to elicit observables to allow inference on the target KSA. An important distinction

introduced in ECD is between a task model design based on a Proficiency model and a task-centered design (Mislevy et al., 1999). While in task-centered design, the primary emphasis is on creating the task with the target of inference defined only implicitly, as the tendency to do well on those tasks, in defining a task model based on a Proficiency (and Evidence) model, we make the connections and possible inferences *explicit from the start*, making the design easier to communicate, easier to modify, and better suited to principled generation of tasks (Mislevy et al., 1999, p. 23). Moreover, basing a task model on Proficiency and Evidence models allows us to consider reliability and validity aspects of task features, and particularly the cognitively or empirically based relevance of the task features. In other words, considerations of item reliability and validity guide the development of items to elicit the target observables and *only them* (minimizing added "noise"). This means that at the development stage of a task, all features of the task should stand to scrutiny regarding relevance to the latent KSA. As mentioned above, if reading ability is not relevant as part of the mathematics KSA, items or tasks that may impede students with lower reading skills should be avoided. Thus, defining a task model based on a Proficiency model resembles the relationship between the latent trait and its manifestation in observable behavior. The more the task relates to the target KSA, the better the inference from the observable to the latent KSA.

For assessment precision purposes per-se, there is no need to provide feedback to students; on the contrary, feedback can be viewed as interference in the process of assessment, and likewise scaffolds and hints introduce noise or interference to a single-point-in-time measurement. However, when the assessment tool is also intended for learning, the goal is to support learners when a weakness was identified, in order to help them gain the "missing" KSA. In the e-ECD we define a "Task-support model" that together with the original Task model creates the expanded-Task model (e-Task model). The Task-support model specifies the learning supports that are necessary and should be provided to learners in order to achieve KSA change. Similar to basing the Task model on the KSA model, the Task-support model is based on the KSA-change model. The supports may include customized feedback, hints and scaffolds, practice options, worked examples, explanations, or guidance to further tailored instruction derived from the latent learning processes specified in the KSA-change model. In other words, the supports are determined according to the focal knowledge *change*. We elaborate and illustrate on Task and Task-support models below.

#### The Assessment Layer Within the e-Task Model – The Task Model

The Task model provides a framework for describing *the situation* in which examinees are given the opportunity to exhibit their KSAs, and includes the specifications of the stimulus *materials*, *conditions* and *affordances*, as well as specifications for the *work product* (Mislevy et al., 1999, p. 19). The characteristics of the tasks are determined by the nature of the behaviors that provide evidence for the KSAs. Constructing a Task model from the latent KSA model involves considering the *cognitive aspect of task behavior*, including specifying the features of the situation, the internal representation of these features, and the connection between these representations and the problem-solving behavior the task targets. In this context, variables that affect task *difficulty* are essential to take into account. In addition, the Task model also includes features of task *management* and *presentation*.

Although the Task model is built off of the Proficiency model (or the KSA model in our notation), multiple Task models are possible in a given assessment, because each Task model may be employed to provide evidence in a different form, use different representational formats, or focus evidence on different aspects of proficiency. Similarly, the same Task model and work product can produce different evidence; i.e., different rules could be applied to the same work product, to allow inferences on different KSAs. Thus, it is necessary to define within each Task model the specific variables to be considered in the evidence rules (i.e., scoring rules; we elaborate on this in the next section).

Consider the abovementioned KSA from the HERA model: "*Perform an extrapolation using data from a graph*." As part of a scientific reasoning skills assessment, this skill is defined in a network of other skills related to understanding data representations, as seen in **Figure 5**. One possible Task model can be: "Given a graph with a defined range for the *x*-axis variable [*a,b*] and *y* values corresponding to all *x* values in the range, find the *y*-value for an *x*-value outside the range." That is, we present the learner with a graph (defined by its *x*- and *y*- axes) and a function or paired coordinates (*x, y*) for a limited domain. The question then asks learners to predict the *y*-value of an *x* point which is outside the domain presented in the graph. Because extrapolation assumes the continuation of the trend based on the relationship between variables, a required characteristic of the question is to include this assumption, explicitly or implicitly *via* the context (e.g. stating other variables do not change, or the same experimental procedure was used for a new value). Articulating the assumption is part of the Task model. Another option for an extrapolation Task model could be: "Given a graph with two levels of the dependent variable, both showing a linear relationship with the x-variable (i.e., same relationship trend) but with different slopes, find the y-value for a third level of the dependent variable." That is, we present the learner with a graph with two linear relationships (two line-graphs), one for level *a* and one for level *b* (for example, *a, b* are levels of weight of different carts, and the linear relationship is between speed and time). The question then asks learners to predict the *y*-value for level *c* (*c > a, b;* larger weight car) for an *x-* point for which we know the *y*-values of level *a* and *b*; that is, extrapolation beyond the data presented. This Task model is more sophisticated than the first one, due to the complexity of the data representation, and thus is tapping into a higher level of the skill.

Another aspect is the operationalization of the Task model in a particular item. Given a Task model, the question can take the form of a direct non-contextualized (what we may also call a "naked") question, (e.g., asking about a value of *y* given a specific *x*), or it can be contextualized (or "wrapped") within the context and terminology of the graph (e.g., "suppose the researcher decided to examine the speed of a new cart that has greater weight, and suppose the trend of the results observed is maintained, what would you expect the new result to be?"). The "naked" and "dressed" versions of the question may involve change in the difficulty of the item; however, this change needs to be examined, to the extent that it is construct- relevant or irrelevant. If it is construct-relevant, then it should be included in the Task model as part of the specifications. Other factors may affect the difficulty as well – the type of graphic (bar-graph, line-graph, multiple lines, scatter plot) and the complexity of the relationships between variables (linear, quadratic, logarithmic, increasing, decreasing, one y-variable or more), the familiarity of the context of the task (whether this is a phenomenon in electricity, projectile motion, genetics, etc.), the complexity of the context (commonly understood, or fraught with misconceptions), the response options (multiple choice, or open-ended), the quality of the graph and its presentation (easy or hard to read, presented on a computer, smartphone or a paper, presented as a static graph or interactive where learners can plot points), etc. These factors and others need to be considered when specifying the Task model, and their relevance to the construct should be clearly articulated.

#### The Learning Layer Within the e-Task Model – The Task-Support Model

Tasks for assessment and tasks for learning differ in the availability of options that support learning. When we design tasks for learning, we need to consider the type of "help" or "teaching" that the task affords, with the same level of rigor that we put into the design of the task itself. The Task-support model thus specifies the learning supports that might be necessary and should be provided to students in order to achieve the desired KSA-change (i.e., increase in KSA). Similar to basing the task model on the KSA model, the Task-support model is based on the KSA-change model.

Making room for the specification of the task support *in connection* to the learning processes/goals (the focal KSA-change) is the innovative core of the proposed e-ECD and its significant contribution to the design of learning and assessment systems. Many learning systems include scaffolds or hints to accompany items and tasks, often determined by content experts or teacher experience and/or practices. These hints and scaffolds help answer the particular item they accompany, and may also provide "teaching," if transfer occurs to subsequent similar items. However, in the design process of the hints and scaffolds, often no explicit articulation is made regarding the intended effect of hints and scaffolds *beyond* the particular question, or in connection to the general learning goals. Often, the hints or scaffolds are task-specific; a breakdown of the task into smaller steps, thus decreasing the difficulty of the task. This is also reflected in the approach to assigning partial credit for an item that was answered correctly with hints, contributing less to the ability estimate (as evidence of lower ability; e.g., Wang et al., 2010). Specifying a Task-support model per each Task model dictates a *standardization* of the scaffolds and hints (and other supports) provided for a given task. How do we specify task supports connected to the focal KSA-change?

If for example, we define a particular (as part of the KSA-change model) learning model similar to the one depicted in **Figure 5**, we may provide as a task support a "pointer" to the precursors, in the form of a hint or a scaffold. Thus, the scaffolds are not a breakdown of the question to sub-steps, but rather each scaffold points to one of the precursor pieces of knowledge (initial, distal, or proximal precursor). In addition, since we defined the kind of knowledge change between each precursor, we can provide the corresponding support per each desired change. If the knowledge change is related to memory and fluency-building, we may provide more practice examples instead of the scaffold. Similarly, if the knowledge change is related to understanding and sense-making, we may provide an explanation or reasoning, or ask the student to provide the explanation or reasoning (self-explanation was found to be beneficial in some case, Koedinger et al., 2012). It may very well be the case that similar scaffolds will result from explicating a Task-support model following an e-ECD compared to not doing so, however in following this procedure, the design decisions are explicit and easy to communicate, justify, modify, replicate, and apply in a principled development of scaffolds.

Similarly, other features of task support, such as feedback, visuals, and links to a video or wiki page, can be supported by the articulation of the KSA-change and the connection between the two.

Let us illustrate specifying a Task-support model for the example item from HERA described in the previous section. Recall that the item targeted the latent KSA "*Perform an extrapolation using data from a graph*," and the task materials included a graph with a specified function, asking students to extrapolate a point beyond the given range (i.e., predict the value of *y* for a new *x-value*). Also, recall **Figure 5** that depicts the KSA-change model for this particular subskill. Given the proximal, distal, and initial precursors, we can now specify each scaffold to address each of these three precursor skills. Alternatively, we can decide to address only the closest precursor (the proximal) as a scaffold, and if that does not help with answering the question correctly, then refer the student to "learn" the more basic material (e.g., in a different section of the system, or by presenting items/content that target the initial and distal precursor skills). These decisions depend on the system design (e-Assembly model) and may vary from system to system.

As part of our development of the HERA system for scientific thinking skills, we developed an item model that can be used to collect evidence for both assessment and learning, termed an Assessment and Learning Personalized Interactive item (AL-PI). This item looks like a regular assessment item, and only after an incorrect response, the learners are given "learning options" to choose from. We offer three types of learning supports: (1) Rephrase – rewording of the question; (2) Breakit-down – providing the first step out of the multi-steps required to answer the question; and (3) Teach-me – providing a text and/or video explanation of the background of the question. **Figure 6** presents a screenshot of an AL-PI item from a task about height-restitution of a dropped-ball, targeting the skill of extrapolation.

Using the terminology above, the Rephrase-option provides the learner with another attempt at the question, with the potential of removing the construct irrelevance that may stem from the item-phrasing (for learners who did not understand what the question is asking them, due to difficulty with the wording). In this example, a Rephrase of the question is: "The question asks you to find the "Height attained" (the *y*-value) for a new *x*-value that does not appear on the graph" (see **Figure 6** upper panel). Note that the Rephrase is practically "undressing" (decontextualizing) the question, pointing out the "naked" form, or making the connection between the context and the decontextualized skill.

The second learning support is Break-it-down which takes the form of providing the first step to answer the question. In the example in **Figure 6** the Break-it-down states: "The first step to answer this question is to evaluate the rate of change in *y* as a function of a change in the *x*-variable" with additional marks and arrows on the graph to draw the leaner's attention where to look. The Break-it-down option may look like a hint, signaling to learners where to focus, and in our terminology, it refers to the proximal precursor (recall: proximal precursor = *identifying the rate of the change in the dependent variable as the independent variable changes*).

The third type of support that we offer in an AL-PI item is Teach-me. The Teach-me option in this case includes the following components: (1) a general statement about the skill; i.e., *a graph presents data for a limited number of values, yet we can estimate or predict about new values based on the trend in the data presented*; (2) an explanation of how to identify the trend in a graph, i.e., locating adjacent points; and (3) an illustration of how once the trend was identified, we can perform extrapolation.

In our system we provide an illustration on a different value than the one in the question in order to avoid revealing the correct answer and leaving room for the learner to put mental effort into applying the method taught. In the Tasksupport model terminology and in relation to the KSA-change model, the Teach-me option addresses all three precursors.

Specifying the task support based on the learning goal and the desired change in KSA gives direction but does not limit the options. On the contrary, it enriches the space of the decision and opens-up new options. In addition, constructing task support by following the e-ECD framework gives rise to the hypothesis that this way of structuring scaffolds may enhance transfer, because the scaffolds do not address the particular question, but rather address the latent skill and its precursor skills. Empirical evidence of transfer is of course needed to examine this hypothesis.

### Expanded Evidence Model

The links made between the e-Proficiency model and the e-Task model need explication of the statistical models that allow inferences from the work products on the tasks to the latent KSAs. In the ECD framework, the Evidence model specifies the links between the task's observables (e.g., student work product) and the latent KSAs targeted by that task (termed here as Observational-Evidence model). The Observational-Evidence model includes the *evidence rules* (scoring rubrics) and the *statistical models*. The Evidence model is the heart of the ECD, because it provides the "credible argument for how students' behaviors constitute evidence about targeted aspects of proficiency" (Mislevy et al., 1999, p. 2).

In a system designed for learning, data other than the work product is produced, i.e., the data produced out of the task support (e.g., hints and scaffolds usage), which may be called *process data*. The task support materials are created to foster learning; thus, learning systems should have a *credible argument* that these supports indeed promote learning. Partial evidence for that can be achieved by inferences about knowledge or what students know and can do from their work product in the system, *following* and as *a result of* the use of the supports, and this can be obtained by the statistical models within the Evidence model. However, the efficacy of the task supports themselves (i.e., which support helps the most in which case), and drawing inferences from scaffolds and hint usage about "learning behavior" or "learning processes" (as defined in the

KSA-change model) may need new kind of models and evidence. The Transitional-Evidence model within the e-Evidence model addresses the data produced from the task support.

### The Assessment Layer Within the Evidence Model – The Observational-Evidence Model

In the original ECD, the Observational-Evidence model addresses the question of how to operationalize the conceptual target competencies defined by the Proficiency model, which are essentially latent, in order to be able to validly infer from overt behaviors about those latent competencies. The Observational-Evidence model includes two parts. The first contains the scoring rules, which are ways to extract a "score" or an *observable variable* from student actions. In some cases, the scoring rule is simple, as in a multiple-choice item, in which a score of 1 or 0 is obtained corresponding to a correct or incorrect response. In other cases, the scoring rule might be more complex, as in performance assessment where student responses produce what we call "process data" (i.e., a log file of recorded actions on the task). A scoring rule for process data can take the form of grouping a sequence of actions into a "cluster" that may indicate a desired strategy, or a level on a learning progression that the test is targeting. In such an example, a scoring rule can be defined such that a score of 1 or 0 is assigned corresponding to the respective strategy employed, or the learning progression level achieved. Of course, scoring rules are not confined to dichotomous scores and they can also define scores between 0 and 1, continuous (particularly when the scoring rules relies on response time) or ordered categories of 1-to-*m*, for *m* categories (polytomous scores).

The second part of the Observational-Evidence model contains the statistical model. The statistical model expresses how the scores (as defined by the scoring rules) depend, probabilistically, on the latent competencies (the KSAs). This dependency is probabilistic, that is, the statistical model defines the probability of certain "scores" (observables) given specific latent competencies (combination of values on the KSAs). In other words, at the point in time at which the student is working within the system, that student is in a "latent state" of knowledge, and given that latent state, there is a certain probability for the observable variables, which if observed, are evidence for the latent ability. However, all we have are the student observable variables, and what we need is a psychometric model that allows us to do the *reverse inference* from the given observables to the latent competencies.

There are various statistical models that can be used here. Since we are talking about an assessment and learning system, let us consider a multi-dimensional latent competency, i.e., multiple skills are targeted by the system both for assessment and learning. If we assume the latent competencies to be continuous, we can use a multi-dimensional Item Response Theory models (e.g., MIRT; Reckase, 2009) or Bayes-net models (Pearl, 1988, 2014; Martin and VanLehn, 1995; Chang et al., 2006; Almond et al., 2015). In the case where the latent competencies are treated as categorical with several increasingly categories of proficiency in each (e.g., low-, medium-, and high-level proficiency, or mastery/non-mastery levels), we can use diagnostic classification models (DCM; Rupp et al., 2010b). What these models enable is to "describe" (or model) the relationship between the latent traits and the observables in a probabilistic way, such that the probability of a certain observable, given a certain latent trait, is defined and therefore allow us to make the *reverse* inference – to estimate the probability of a certain level of a latent trait given the observable.

In order to make the link between the items/tasks (the stimuli to collect observables) and the latent KSAs, we can use what is called a Q-matrix (Tatsuoka, 1983). A Q-matrix is a matrix of <items × skills> (items in the rows; skills in the columns), defining for each item which skills it is targeting. The Q-matrix plays a role in the particular psychometric model, to determine the probability of answering an item correctly given the combination of skills (and whether all skills are needed, or some skill can compensate for others; non-compensatory or compensatory model, respectively). The Q-matrix is usually determined by content experts, but it can also be learned from the data (e.g., Liu et al., 2012).

Recent developments in the field of psychometrics have expanded the modeling approach to also include models that are data driven, but informed by theory, and is referred to as Computational Psychometrics (von Davier, 2017). Computational Psychometrics is a framework that includes complex models such as MIRT, Bayes-net and DCM, which allow us to make inferences about latent competencies; however, these models may not define *a priori* the scoring rules, but rather allow for a combination of the expert-based scoring rules with those that are learned from the data. In particular, the supervised algorithms – methodologies used in machine learning (ML) – can be useful for identifying patterns in the complex logfile data. These algorithms classify the patterns by skills using a training data set that contained the correct or theory-based classification. The word supervised here means that the "correct responses" were defined by subject-matter experts and that the classification algorithm learns from these data that were correctly classified to extrapolate to new data points.

In a learning and assessment system, the Observational-Evidence model may also take into account the scaffolds and hints usage to infer about the KSA model. Since the scaffolds and hints reduce the difficulty of the items/tasks, they also change their evidentiary value of the observables. This can be done *via* either using only responses without hint usage to model KSA or applying a partial credit scoring rule for items that were answered correctly with hints, thus assigning them less credit as a reflection of their evidentiary value (e.g., Wang et al., 2010; Bolsinova et al., 2019a,b).

To summarize, any and all statistical models that allow us to define the connection between overt observables and latent competencies can be used in the Observational-Evidence model.

#### The Learning Layer Within the Evidence Model – The Transitional-Evidence Model

Similar to the way the Observational-Evidence model connects the Task model back to the KSA model, the Transitional-Evidence model uses the task supports data to infer about learning, and to link back to the KSA-change model. Recall that the KSA-change model includes pedagogical principles which are reflected in the task supports. Similar to the assessment layer of the Evidence model, the Transitional-Evidence model also includes two parts: the scoring rules and the statistical models.

The scoring rules define the *observable variables* of the Transitional-Evidence model. If task supports are available by choice, student choice behavior can be modeled to make inferences about their learning strategies. The data from the task supports usage (hints, scaffolds, videos, simulations, animations, etc.) as well as number of attempts or response time, should first be coded (according to a scoring or evidence rule) to define which of them should count and in what way. As before, scoring rules can be defined by human experts or can be learned from the data.

The statistical models in the Transitional-Evidence model need to be selected, such that they allow us to infer about *change* based on observables over time. A popular stochastic model for characterizing a changing system is a Markov model (cf. Norris, 1998). In a Markov model, transition to the next state depends only on the current state. Because the focus here is on latent competencies, the appropriate model is then a hidden Markov model (HMM; e.g., Visser et al., 2002; Visser, 2011), and specifically an input-output HMM (Bengio and Frasconi, 1995). A HMM would allow us to infer about the efficacy of the learning supports in making a change in the

*latent* state (proficiency level). In addition, the input-output HMM will allow us to make the association between learning materials (as input) and the change in KSA (latent) based on the observables (output), to estimate the contribution (efficacy) of each particular support to the desired change in proficiency (i.e., learning). **Figure 7** illustrates this model for a single latent skill (KSA at time t1 and t2), a single observation (O at time t1 and t2) and a single learning support (l at time t1 and t2). The observation dependency on the skill (i.e., O given KSA; the arrow/link from KSA to O) is modeled by the Observational-Evidence model (the model from the original ECD), while the skill dependency on the learning support (i.e., KSA given l; the arrow/link from l to KSA) is modeled by the Transitional-Evidence model.

Working with the above example, let us assume a student does not know how to identify a data trend from a graph, and thus cannot extrapolate a new data point (incorrectly answers a question that requires extrapolation). Suppose a task support is provided, such that it draws the student's attention to the pattern and trend in the data. We now want to estimate the contribution of this support in helping the student learn (and compare this contribution to other task supports). We have the following observables: the student's incorrect answer in the first attempt, the student's use of the particular task support, and the student's revised answer in the second attempt (whether correct or not). Using an inputoutput HMM will allow us to estimate the probability of transitioning from the incorrect to the correct latent state (or in other cases from low proficiency to high proficiency), given the use of the task support. Of course, the model will be applied across questions and students in order to infer about latent state.

The above example of a single latent skill can be extended to a map of interconnected skills using dynamic Bayesian network (DBN; Murphy and Russell, 2002). DBN generalizes HMM by allowing the state space to be represented in a factored form instead of as a single discrete variable. DBN extends Bayesian networks (BN) to deal with changing situations.

How do we link the learning materials (defined in the Tasksupport model) to the learning processes/goals (defined in the KSA-change model)? Similar to the Q-matrix in the Observational-Evidence model, here too we need a matrix that links the learning materials (task supports) with the associated skills-change. We can use an S-matrix (Chen et al., 2018), which is a matrix of <supports × skills> (supports in the rows; skills in the columns), defining for each support which skills/process it can improve. In that sense, and similar to the Q-matrix, an S-matrix is a collection of "evidence" that explicate the connection between the supports and the desired learning shifts. For example, *providing a worked example* is a learning support that may be connected to several knowledge shifts (corresponding to subskills in the learning models), and providing opportunities for practice is another learning support that may be connected to different desired knowledge shifts (corresponding to different subskills). The S-matrix will specify these connections. The S-matrix will then play a role in the HMM, to determine the probability that a particular knowledge shift (learning process) occurred given the particular learning supports. Similar to the Q-matrix, the S-matrix should be determined by content experts, and/or learned or updated from the data.

### THE e-ASSEMBLY MODEL

In the original ECD, the Assembly model determines how to put it all together and specifies the conditions needed for obtaining the desired reliability and validity for the assessment. In other words, it determines the *structure* of the test, the number and the mix of the desired items/tasks. The Assembly model is directly derived from the Proficiency model, such that it ensures, for example, the appropriate representation of all skills in the map. Going back to the HERA example and the KSA-model in **Figure 3**, if we were to build an assessment with those target skills, we would have to ensure that we sample items/tasks for each of the skills and subskills specified on the map, and the Assembly model will specify how much of each.

For the expanded ECD, we do not create a parallel model to the Assembly model as we did for the three core models, because in a blended learning and assessment system we do not assemble the assessment separately and the learning separately. Rather, in the process of developing a system, after we specified the six core models of the e-ECD, we assemble it all together in what we call the e-Assembly model.

The role of e-Assembly model is to specify how to put it all together. It will include the specifications of number and mix of items/tasks, but it will also include how and when to present the learning support materials. This can be seen as determining how to switch between the "assessment" mode of the system and the "learning" mode of the system.

The e-Assembly model provides an opportunity to take into account additional pedagogical principles that are relevant to the *combination* of items and tasks, such as the importance of reducing cognitive load for learning; focusing on one skill at a time; gradual increased difficulty presentation; adaptive presentation of content, among others. Conditions to ensure the validity of the system may also specify pedagogical principles such as learning *via* real-world authentic tasks or learning by doing, as well as learner engagement factors, as relevant. Pedagogical Content Knowledge principles that include knowledge of student *misconceptions* regarding specific phenomena, if articulated as part of the KSA and KSA-change model, should be also considered here in selecting and designing tasks, such that the misconceptions are either accounted for or avoided so the KSAs can be validly addressed.

The e-Assembly model is also the place to take into account considerations from other relevant approaches, such as the learner-centered design approach (LCD; Soloway et al., 1994; Quintana et al., 2000), which argue that student engagement and constructivist theories of learning should be at the core of a *computerized* learning system. Adopting such an approach will affect the combination and/or navigation through the system. For example, the system may guide students to be more *active* in trying out options and *making choices* regarding their navigation in the system.

An important aspect of systems for learning and assessment is whether they are adaptive to student performance and in what way. This aspect within the e-Assembly model ties directly to the e-Evidence model. The statistical models in the Evidence model are also good candidates for determining the adaptive algorithm in adaptive assessments. For example, if a 2PL IRT model is used to estimate ability; this model can also be used to select the items in a Computer Adaptive Test (CAT), as is often done in large-scale standardized tests that are adaptive (e.g., the old version of the GRE). Similarly, if a Bayes-net is used to estimate the map of KSAs, then the selection of items or tasks can be done based on the Bayes-net estimates of skills. Similarly, we can use the DCM to identify weakness in a particular skill and thus determine the next item that targets that particular weakness. This is true for any other model, also including data-driven models, because the purpose of the models is to provide a valid way to estimate KSAs, and once this is done, adaptivity within the system can be determined accordingly.

The learning aspect of the system is motivated by the goal to maximize learners' gain and thus needs a more comprehensive adaptivity, or what is often called "recommendation model." A recommendation model does not only determine the next item to be presented but it also determines which instructional or training material to recommend or present to the learner. A good recommendation model makes full use of all available information about both the learner and the instructional materials to maximize the KSA gain for the learner. If we have a way to estimate (measure) the gain for the learner, we can feed this information to the recommendation engine to determine the adaptivity in the form of the next task support and/or training and instructional material needed. Thus, the additional layer of an evidence model for the learning materials (i.e., the statistical models for estimating the efficacy of the task supports) provides a good candidate model for the recommendation engine. Which materials were already used by the learner (which ones were chosen/preferred), which supports are found more effective for that particular learner, which skill is currently in focus and which supports are most effective for that particular skill (e.g., practice, explained example, video lecture, simulation demonstration, providing instructional material for a prior/prerequisite skill, etc.) are some of the decisions needed to be made by a recommendation engine, and these decisions rely on the statistical models that were used to evaluate and provide evidence for the efficacy of the task support and instructional materials.

### CONCLUSION AND FUTURE STEPS

In this paper, we propose a new way to fuse learning and assessment at the design stage. Specifically, we propose an expanded framework we developed to aid with the creation of a system for blended assessment and learning. We chose the ECD framework as a starting point because this is a comprehensive and rigorous framework for the development of assessments and underlies the development of tests for most testing organizations. Incorporating learning aspects, both learning goals and learning processes, in the ECD framework is challenging, because of fundamental differences in the assumptions and approaches of learning and assessment. Nevertheless, we showed that the unique structure of Proficiency, Task, and Evidence models lends itself to creating parallel models for consideration of the corresponding aspect of learning within each model.

We are currently applying this framework in our work. In future work, we hope to show examples of the learning and assessment system that we build following the e-ECD framework. We are also working to incorporate other elements into the framework, primarily the consideration of motivation, meta-cognition, and other non-cognitive skills. Since learners' engagement is a crucial element in a learning system, we can think of a way to incorporate elements that enhance engagement as part of the assembly of the system, by using reward system or gamification in the form of points, coins, badges, etc. Adding gamification or engagement-enhancing elements into a system does not currently have a designated model within the e-ECD. We are working to find a way to incorporate these elements into the framework.

### AUTHOR CONTRIBUTIONS

MA-A and AAvD contributed to the conception of the framework. MA-A contributed to the conception and specifications of the new models, and AAvD contributed to the CP component.

### REFERENCES


SW and JT contributed to the e-Task model. BD contributed to the e-Evidence model. The authors would like to thank the reviewers for substantial contribution.

### FUNDING

This work has been done as part of a research initiative at ACTNext, by ACT, Inc. No external funds or grants supported this study.

analytics in measuring computational thinking in block-based programming" in *Proceedings of the Seventh International Learning Analytics & Knowledge Conference* (ACM). Vancouver, BC, Canada, 530–531.


**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

*Copyright © 2019 Arieli-Attali, Ward, Thomas, Deonovic and von Davier. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.*

# Development of a Computerized Adaptive Testing for Internet Addiction

#### Yong Zhang, Daxun Wang\*, Xuliang Gao\*, Yan Cai\* and Dongbo Tu\*

School of Psychology, Jiangxi Normal University, Nanchang, China

Internet addiction disorder has become one of the most popular forms of addiction in psychological and behavioral areas, and measuring it is growing increasingly important in practice. This study aimed to develop a computerized adaptive testing to measure and assess internet addiction (CAT-IA) efficiently. Four standardized scales were used to build the original item bank. A total of 59 polytomously scored items were finally chosen after excluding 42 items for failing the psychometric evaluation. For the final 59-item bank of CAT-IA, two simulation studies were conducted to investigate the psychometric properties, efficiency, reliability, concurrent validity, and predictive validity of CAT-IA under different stopping rules. The results showed that (1) the final 59 items met IRT assumptions, had high discrimination, showed good item-model fit, and were without DIF; and (2) the CAT-IA not only had high measurement accuracy in psychometric properties but also sufficient efficiency, reliability, concurrent validity, and predictive validity. The impact and limitations of CAT-IA were discussed, and several suggestions for future research were provided.

#### Keywords: internet addiction, computer adaptive testing, item response theory, questionnaire, CAT-IA

## INTRODUCTION

Internet addiction (IA) disorder is now recognized as one of the most popular forms of addiction in psychological and behavioral areas. According to a report released by the International Telecommunication Union (2016), with the rapid development of advanced mobile networks, the number of users over the last 3 years has climbed to nearly four billion people, which is equivalent to 47% of the global population. Although the internet brings many benefits, excessive access to the network can lead to internet addiction (IA). A recent meta-analysis reported that the global prevalence of IA is 30.1% among university students pursuing a professional degree (Zhang et al., 2018). In Asia, the prevalence of IA ranged from 6.2% in Japanese adolescents to 21% in Filipino adolescents (Mak et al., 2014b). IA is associated with sleep disturbance (Zhang et al., 2017), poor quality of life (Tran et al., 2017a), and other psychiatric illnesses (Ho et al., 2014). Therefore, the assessment and prevention of IA are particularly important in practice. IA symptoms have been evaluated primarily by questionnaires that have been developed based on classical test theory. The commonly used questionnaires include the Internet Addiction Test (IAT; Young, 1998), Generalized Problematic Internet Use Scale (GPIUS; Caplan, 2002), Gaming Addiction Scale (GAS; Lemmens et al., 2009), and Revised Chen Internet Addiction Scale (CIAS-R; Mak et al., 2014a). The current questionnaires classify IA symptoms into loss of control or of time management (Tran et al., 2017b), craving and social problems (Lai et al., 2013). Although these questionnaires are

#### Edited by:

Samuel Greiff, University of Luxembourg, Luxembourg

#### Reviewed by:

Ioannis Tsaousis, University of Crete, Greece Roger Ho, National University of Singapore, Singapore

#### \*Correspondence:

Daxun Wang 447951689@qq.com Xuliang Gao gaoxuliang8817@qq.com Yan Cai cy1979123@aliyun.com Dongbo Tu tudongbo@aliyun.com

#### Specialty section:

This article was submitted to Quantitative Psychology and Measurement, a section of the journal Frontiers in Psychology

Received: 30 October 2018 Accepted: 16 April 2019 Published: 07 May 2019

#### Citation:

Zhang Y, Wang D, Gao X, Cai Y and Tu D (2019) Development of a Computerized Adaptive Testing for Internet Addiction. Front. Psychol. 10:1010. doi: 10.3389/fpsyg.2019.01010

**87**

frequently used in practice, they have certain weaknesses. One of the most notable drawbacks is that participants must finish all of the questionnaire items. However, many items may be "off target" for different test takers (Fliege et al., 2005). For participants with high ability levels, easy items have less contribution to measuring their actual ability level, and as such, these items may be redundant or unnecessary. Meanwhile, for participants with low ability levels, the requirement of responding to the difficult items results in the difficulty to measure their actual ability level. Therefore, it is essential to have a more effective method to evaluate IA.

One way to deal with the above issues is through computerized adaptive testing (CAT), which is a new kind of test that uses item response theory (IRT) to establish an item bank, and then automatically selects items according to the current theta of each participant, and finally estimates the ability of each test taker (Almond and Mislevy, 1999). In CAT, the test-taker continues to take test items until his/her estimated θ reaches a predefined level of precision, as indicated by its standard error. Compared with a linear test, CAT cannot only present items, input answers, and automatically score through the computer but also automatically select the most appropriate items for each responder according to the different answers to items, and then finally reach the most appropriate estimation of ability.

Many studies have shown that a CAT program has several advantages over paper-and-pencil questionnaires. Flens et al. (2016) revealed that compared with paper-and-pencil questionnaires, the number of used items based on CAT procedures decreases by 26–44%. Linacre (2000) pointed out that CAT programs can improve validation, reduce individuals' burden, and have more excellent measurement precision. In addition, with the selection of items based on a respondent's current theta, the floor and ceiling effects can be decreased in CAT procedures (Revicki and Cella, 1997). Further, the development of CAT procedures improves clinical assessment. However, CAT also has a number of disadvantages: high costs of research and development, complex technical requirements, and the need for timely maintenance of the item bank to prevent items from leaking in advance (Tan et al., 2018). Nonetheless, the virtues of a CAT program importantly overweigh the defects.

Initially, the development and applications of CAT programs mainly occurred in intelligence and ability testing (e.g., Tinsley, 1972; Ireland, 1977; Young, 1990). In recent years, many researchers have paid attention to the field of mental health. For example, Flens et al. (2017) used the IRT model to assess the Dutch-Flemish version of depression. Smits et al. (2011) established and evaluated CAT procedures for depression based on the Epidemiologic Studies-Depression scale. Walter et al. (2007) developed a German version of Anxiety CAT within IRT. However, to the best of our knowledge, the use of CAT to IA, a common disorder, has not been applied.

This study aimed to develop CAT to assess IA (CAT-IA) without loss of measurement precision. More specifically, this work addressed the following. First, a calibrated item bank with high psychometric qualities was developed. Second, in different stopping rules, we evaluated the psychometric properties, efficiency, reliability, and validity of CAT-IA via two CAT simulation studies. Third, we sought to extend the applications of CAT in the field of mental health and introduce IRT and CAT to readers who want to understand and apply adaptive testing.

### MATERIALS AND METHODS

### Participants

The total sample consisted of 1,368 participants. All of the participants were surveyed at different schools in China from June to September 2017. **Table 1** reveals the characteristics of the participants. The sample included 687 females (50.2%) and 681 men (49.8%). Their average age was 18.72 years (SD = 2.19, ranged from 12 to 28 years). The participants came from two regions: rural (58.9%) and urban (41.1%).

This study was conducted at the Research Center of Mental Health, Jiangxi Normal University, following the recommendations of psychometrics studies on mental health. It was approved by the Research Center of Mental Health, Jiangxi Normal University and the Ethics Committee of the Department of Psychology at Jiangxi Normal University. Written informed consent was obtained from all of the participants in accordance with the Declaration of Helsinki. Parental consent was also obtained for all participants under the age of 16 years.

### Measures and the Initial Item Pool

The initial item pool of CAT-IA consisted of 101 items (see **Table 2**). These items were selected from four standardized scales: IAT (Young, 1998), GPIUS (Caplan, 2002), GAS (Lemmens et al., 2009), and Chinese Internet Addiction Test (CIAT; Huang et al., 2007). All of them used five-point Likert-type item scores (never, rarely, sometimes, often, always; scored with 1, 2, 3, 4, and 5, respectively). A higher cumulative sum in all of the items represented more severe symptoms of IA. Based on previous studies, 101 items from the four selected standardized scales could be classified into seven domains (Young, 1998; Caplan, 2002; Huang et al., 2007; Lemmens et al., 2009): salience, tolerance, mood modification, relapse, withdrawal, negative outcomes, and benefits (i.e., compared with offline, individuals are more likely to participate in social behavior online and surfing the internet can reduce negative emotions).


#### TABLE 2 | Items from four scales.

fpsyg-10-01010 May 4, 2019 Time: 16:20 # 3


IAT, Internet Addiction Test; GPIUS, Generalized Problematic Internet Use Scale; GAS, Gaming Addiction Scale; CIAT, Chinese Internet Addiction Test.

### Item Bank Construction of CAT-IA

To obtain a high-quality item bank, psychometric evaluations were performed on the individuals' actual data as follows.

**Step 1:** Test the unidimensional assumption of the item pool.

Unidimensionality means that the test measures only one main latent trait; that is, responses on each item are affected by one main latent trait of the participants (Embretson and Reise, 2013). Both exploratory factor analysis (EFA) and confirmatory factor analysis (CFA) were used to assess the unidimensional assumption. In EFA, the unidimensional assumption is deemed sufficient when the first factor explains at least 20% of the variance (Reckase, 1979), and the ratio of the explained variance in the first and second factor is greater than 4 (Reeve et al., 2007). The CFA of a single-factor was used to assess the unidimensional assumption. We adopted two indicators: factor loading and root mean square error of approximation (RMSEA) estimated by the weighted least square means and variance adjusted method using Mplus7.0 (Muthén and Muthén, 2012). According to the rule of thumb of Browne and Cudeck (1993), the model has a close fit, is fair or acceptable, mediocre, or poor if the RMSEA value is below 0.05, between 0.06 and 0.08, between 0.09 and 0.10, or above 0.10, respectively. We excluded items with factor loadings smaller than 0.4 because factor loadings below 0.4 could easily be over-interpreted (Nunnally, 1978).

**Step 2:** Select the appropriate IRT model according to the test-level model-fit indices.

Selecting the appropriate model is one of the most important procedures to make valid inferences. In this study, four commonly used polytomous IRT models were considered: Graded Response Model (GRM; Samejima, 1969), Generalized Partial Credit Model (GPCM; Muraki, 1992), Graded Ratings Scale Model (GRSM; Andrich, 1978), and Nominal Response Model (NRM; Bock, 1972). The test-level model-fit indices were used to compare and select IRT models, which included Akaike's information criterion (AIC; Akaike, 1974), Bayesian information criterion (BIC; Schwarz, 1978), and −2Log-Likelihood (−2LL; Spiegelhalter et al., 1998). The smaller values of these indices showed the better model fit; therefore, the model with the smallest test-fit indices was selected for further analysis. Model selection analysis was done in R package mirt (Version 1.10; Chalmers, 2012).

**Step 3:** Assess the local independence of the remaining items in the item pool.

Local independence includes two aspects: one is that the response of the same participants (or similar-level participants) to any one item will not be affected by any other items on the same test; and the other is that the responses of different participants (or different-level participants) on the same item do not affect each other (Embretson and Reise, 2013). Currently, the Q3 statistic (Yen, 1993) is commonly used to verify the dependent relationship between items. We calculated the Q3 values of any two items from the item pool under the selected IRT model in Step 2, via R package mirt (Version 1.10; Chalmers, 2012). As suggested by Cohen (2013), Q3 values below 0.36 represented local independence. Hence, one item with Q3 > 0.36 in item pairs was removed.

**Step 4:** Assess the monotonicity of the remaining items in the item pool.

Monotonicity, meaning that a person with higher latent trait levels raises the possibility of higher scores for an item, was assessed by scalability coefficients for the item pool and individual items via R package Mokken (Version 2.7.7; van der Ark, 2007). According to Mokken (1971), a scale or item has high quality if the scalability coefficient is above 0.3. Items with scalability coefficients below 0.30 were thus eliminated until all of the scalability coefficients exceeded 0.3.

**Step 5:** Analyze the psychometric characteristics of the remaining items in item pool.

After items were excluded in the above four steps, psychometric characteristics (i.e., item-fit, differential item functioning [DIF], and discrimination) were evaluated for the remaining items. First, the S-X<sup>2</sup> statistic (Orlando and Thissen, 2003) was used to exam item fit using R package mirt (Version 1.10; Chalmers, 2012). Second, ordinal logistic regression, a nimbler method in detecting DIF, was used to test DIF for gender (male and female), age (under 18 years, and 18 and above), and region groups (rural and urban), respectively, via R package lordif (Version 0.2-2; Choi et al., 2011). DIF was assessed by means of change in McFadden's R <sup>2</sup> between different groups; items with R 2 change greater than 0.02 indicated DIF (Choi et al., 2011). The item parameters, namely, the discrimination (a) and difficulty parameters (b), were estimated under the selected model.

**Step 6:** Choose high-quality items to develop the final item bank of CAT-IA.

According to the psychometric characteristics in Step 5, poor model-fit (p < 0.01), DIF, and low discrimination items (a < 1.00) were all excluded. This procedure was repeated until no item was excluded.

### CAT Simulation

To evaluate the psychometric properties, efficiency, reliability, concurrent validity, and predictive validity of CAT-IA, two CAT simulation studies were carried out. A CAT study is generally composed of six parts: the item bank, item response models, selection methods of initial items, evaluation methods of latent trait, item selection methods, and the stopping rules (Weiss and Kingsbury, 1984). First, the 59-item bank of CAT-IA was established, and the item parameters were estimated under the

selected IRT model. Second, an item from the 59-item bank was randomly selected as the initial item to control the exposure rate. Ability estimation methods mainly include maximum likelihood estimation (MLE), weighted likelihood estimation (WLE), maximum a posteriori estimation (MAP), and expected a posterior estimation (EAP) in CAT procedures (e.g., Chen et al., 1998; Wang and Vispoel, 1998; Gorin et al., 2005). The MAP, MLE, and EAP methods regard the maximum point of the likelihood function (or posterior distribution) as the estimated ability value, which may result in multiple extreme points at the beginning of tests (Magis and Raîche, 2010). However, the mean value of the whole posterior distribution is adopted in EAP algorithm. Thus, the information provided by the entire posterior distribution can be effectively utilized, and the stability of the EAP algorithm is higher than that of the other three methods. The EAP method uses the mean value of the entire posterior distribution; therefore, it need not be iterated, and the calculation process is simpler. Compared with the MLE and WLE methods, the EAP method has a larger bias and belongs to biased estimation (Wang et al., 1999). Compared with the EAP method, the main advantage of MAP is that it requires fewer items in the variable-length test, which means that the test is more efficient (Wang and Vispoel, 1998). However, the virtues of the EAP algorithm importantly overweigh its drawbacks. The simplicity and stability of the EAP method makes it an optimal method for CAT simulations (e.g., Warm, 1989; Chen et al., 1998; Bulut and Kan, 2012). Further, maximum information criterion (MIC; Lord, 1980) is the most widely used item selection strategy in CAT programs because of its relatively simple implementation method. The purpose of this strategy is to improve the accuracy of measurement (Brunel and Nadal, 1998), but it can easily lead to uneven exposure of items in the item bank and reduced security of the test (Barrada et al., 2008). Different from the exam, a Likert-type scale without correct answers requires participants to respond in the usual way, which greatly reduces the test security problem. Therefore, MIC was selected as the item selection method in the CAT-IA simulation study. Finally, several stopping rules with different SEs were performed, including None (i.e., the entire item bank was used), SE ≤ 0.2, SE ≤ 0.3, SE ≤ 0.4, and SE ≤ 0.5, respectively.

#### Simulation Study 1: Psychometric Properties of CAT-IA

When a CAT-IA program is established, its psychometric properties should be evaluated, especially in terms of measurement accuracy. The results of CAT-IA may result in high-risk outcomes that are similar to the entrance exam. Therefore, the Monte-Carlo (MC) simulation method was used to evaluate the performance of CAT-IA. First, the ability of 1,000 virtual persons were generated randomly from the normal distribution (Mean = 0, SD = 1); this sample was regarded as the true ability values. Second, the item parameters of the final 59-item bank and selected IRT model were used to conduct the CAT-IA simulation study. Third, the MC method was used to estimate the ability value of each participant according to the true θ values, selected IRT model and item parameters. These abilities were the estimated values of 1,000 simulated persons. In addition, the CAT-IA performance was evaluated via several statistical indices, including conditional bias (CBIAS), conditional mean absolute error (CMAE), conditional root mean square error (CRMSE), and conditional standard error of estimation (CSEE) across all θ areas (Han, 2018). Simulation study 1 was done in the R package catR (Version 3.12; Magis and Barrada, 2017). These statistical indices for every participant were plotted under different stopping rules using SPSS (Version 23.0; George, 2016).

#### Simulation Study 2: Efficiency, Reliability, and Validity of CAT-IA

#### **Efficiency and reliability of CAT-IA**

To evaluate the efficiency and reliability of CAT-IA, a simulation based on the actual data was carried out via the R package mirtCAT (Version 0.5; Chalmers, 2015). In simulation study 2, the real responses to items were used instead of virtual responses generated by the MC method; the process of simulation study 2 was the same as that in simulation study 1. For each responder, the SE could be calculated in simulation study 2. Green et al. (1984) pointed out that a unitless reliability index is necessary for a CAT, even if this index is somewhat contrived. The index of marginal reliability was proposed by Green et al. (1984) to evaluate effectively the reliability of a CAT under different stopping rules. Marginal reliability is a relatively convenient way to monitor dynamically the reliability of a CAT, and can also be used to evaluate the stability of a CAT (Green et al., 1984). In general, marginal reliability is a function of standard error of measurement (SEM), as shown in formulas (1) and (2). The bigger the marginal reliability is, the smaller the SEM is. Therefore, marginal reliability is crucial for the assessment of SEM and the reliability of measurement in CAT. Marginal reliability is equal to the mean reliability under each stopping rule for all participants (Wainer et al., 2000b). The formula of marginal reliability is defined as:

$$MR = 1 - \text{SE}^2 \tag{1}$$

$$SE = \frac{\sum\_{i=1}^{N} SE(\theta\_i)}{N} \tag{2}$$

Where n is the number of all participants, and SE(θi) is the standard error of examinee i at the finally estimated θ. Some statistics were investigated to examine the efficiency and reliability of CAT-IA, including the mean and standard deviation of the used items, mean SE, marginal reliability, and Pearson's correlations between the estimated θ with the stopping rule of None and the remaining stopping rules. The number of used items with the reliability for every participant was plotted under different stopping rules using the R package ggplot2 (Version 2.2.1; Wickham, 2011).

#### **Concurrent validity and predictive validity of CAT-IA**

CAT-IA may take effect when CAT-IA estimation results have a favorable similarity to the results of the existing widely used scales. In other words, a person who is diagnosed with IA in a questionnaire has a higher latent trait in a CAT estimation compared with those without a diagnosis of IA. The similarities

were evaluated by concurrent validity and predictive validity of CAT-IA using SPSS (Version 23.0; George, 2016) based on the initial responses that were used to establish the item bank of IA. The concurrent validity was evaluated by the Pearson's correlations between the estimated θ of CAT-IA and the aggregate scores of each scale. Based on previous studies, only two scales (IAT and GAS) possess the definite diagnostic criteria for IA (Young, 1998; Caplan, 2002; Huang et al., 2007; Lemmens et al., 2009). Individuals whose sum scale scores of IAT exceed 39 are considered as having problematic network usage (Young, 1998). GAS includes seven diagnostic items (Lemmens et al., 2009); individuals with at least four items scoring 4 or 5 are considered to be addicted. The diagnostic results of IAT and GAS were used to compare the estimated results of CAT-IA. Then, the AUC (the area under ROC curve) index was employed to investigate the predictive effect of CAT-IA. According to the rule of Rice and Harris (2005), AUC values below 0.50 represent a small predictive effect; values between 0.51 and 0.70, a moderate predictive effect; and values higher than 0.71, a large predictive effect. In the ROC curve, determination of the critical points adopted the maximal Youden Index (YI = sensitivity + specificity − 1) (Schisterman et al., 2005). The sensitivity indicates the probability of a patient being diagnosed as a patient, and the specificity indicates the probability of a person without the symptoms being diagnosed as a normal person. Sensitivity and specificity are two important reference indicators for the accuracy of critical values, which are both ranged from 0 to 1, with the bigger values representing better predictive validation.

### RESULTS

### Item Bank Construction of CAT-IA

#### Unidimensionality

In EFA, the ratio of variance explained by the first factor was 32.44% higher than the critical standard of 20% (Reckase, 1979), and the ratio of variance explained in the first and second factors was 5.89 higher than the critical standard of 4 (Reeve et al., 2007). In the single-factor CFA, five items were removed (see **Table 3**) owing to their factor loadings of below 0.4 (Nunnally, 1978). Both the EFA and single-factor CFA were again conducted on the remaining 96 items. The EFA results showed the ratio of

TABLE 3 | Reasons for stepwise exclusion of the items.


DIF, different item function; the abbreviated content of each item can be seen in Table 5.

variance explained by the first factor was 33.87%, and the ratio of variance explained in the first and second factors was 6.14. Results of the single-factor CFA indicated that the RMSEA value was 0.08, indicating that the single factor model was fair or acceptable; all factor loadings were above 0.4. The above results showed that the remaining 96 items, after deleting five items, basically met the unidimensional hypothesis.

#### Model Selection

**Table 4** documents the model-fit indices, including −2LL, AIC, and BIC, for the four IRT models. Compared with the other three IRT models, the GRSM fitted the worst in that it had the largest −2LL, AIC, and BIC values. Of the remaining three models, the GPCM model had the worst fitting indices. Although the −2LL value of NRM was smaller than that of GRM, the AIC and BIC values of NRM were both higher compared with the GRM. The GRM model overall fitted the remaining 96-item bank best compared with other three. Therefore, GRM was selected for later analysis.

#### Local Independence

A total of 23 pairs of items showed local dependence: their Q3 values were above 0.36 (Cohen, 2013). Thus, 26 items were excluded owing to local dependence, including 2 IAT items, 11 GPIUS items, 10 GAS items, and 3 CIAT items (see **Table 3**). Then, the Q3 values of the remaining 70-item bank were reassessed, and the results showed all Q3 values were below 0.36.

#### Monotonicity

The scalability coefficient for the remaining 70-item bank was 0.4, which was higher the requirement of 0.3 (Mokken, 1971). However, for the scalability coefficient of the 70 items, there were still six items (see **Table 3**) with scalability coefficients below 0.3. After excluding these items, we reevaluated the scalability coefficients, and the results showed that the scalability coefficient of the 64-item bank was 0.39, whereas all scalability coefficients of the 64 items were above 0.3.

#### DIF

For the region and age groups, no DIF was found for all 64 items; the means of change in McFadden's R <sup>2</sup> between different groups were above the minimum requirement of 0.02 (Choi et al., 2011). However, for the gender group, four items (see **Table 3**), all belonging to GAS, were flagged for DIF. Therefore, we excluded these items and reassessed the DIF of 60 items. The results


GRM, Graded Response Model; GPCM, Generalized Partial Credit Model; GRSM, Graded Ratings Scale Model; NRM, Nominal Response Model; −2LL, −2Log-Likelihood; AIC, Akaike's information criterion; BIC, Bayesian information criterion.

#### TABLE 5 | Item parameters for 59-item bank with GRM.

fpsyg-10-01010 May 4, 2019 Time: 16:20 # 6


(Continued)

#### TABLE 5 | Continued

fpsyg-10-01010 May 4, 2019 Time: 16:20 # 7


a, discrimination parameter; b, difficulty parameter.

TABLE 6 | The psychometric properties of CAT-IA using CBIAS, CMAE, CRMSE, and CSEE indices across all θ areas.


None, all item bank was used; CBIAS, conditional bias; CMAE, conditional mean absolute error; CRMSE, conditional root mean square error; CSEE, conditional standard error of estimation.

showed that the means of change in McFadden's R 2 all were below 0.02 for the region, age, and gender groups.

#### Item-Fit

Only one item (IAT-2) failed to fit the GRM for having a p-value of S−X 2 that was less than 0.01. After removing this item, the remaining 59 items were reevaluated, and the results showed that the p-value of S-X<sup>2</sup> of all the 59 items were above 0.01.

#### Discrimination

Graded Response Model was used again to calibrate the remaining 59 items. The item parameters are listed in **Table 5**. The discrimination parameters of the 59 items were all above the

value of 1 with mean of 1.627 (SD = 14.5), which indicated the final item bank was of a high quality.

After the above steps, the final item bank of CAT-IA included 59 items with high discrimination, good item-fit, no DIF, and meeting the assumptions of IRT. The eighth column in **Table 5** shows the domains of the 59 items: 6 items measured salience, 9 items measured tolerance, 6 items measured mood

modification, 7 items measured relapse, 10 items measured withdrawal, 16 items measured negative outcomes, and 6 items measured benefits.

#### Psychometric Properties of CAT-IA

In **Table 6**, the values of CBIAS, CMAE, CRMSE, and CSEE across all θ areas are displayed under several stopping rules. The second column documents the CSEE values across all θ areas, which ranged from 0.154 to 0.464. The values of CSEE across all θ areas that were less than the corresponding measurement precision decreased as measurement precision was made stricter. The third column reveals the values of CBIAS across all θ areas, which ranged from −0.016 to 0.008. Except for the stopping rule of SE (θ) ≤ 0.5, with CBIAS of −0.016 across all θ areas, the values of CBIAS across all θ areas decreased when the measurement precision was made stricter. The last two columns of **Table 6** indicate that the CMAE and CRMSE values across all θ areas varied from 0.125 to 0.359, and 0.160 to 0.456, respectively. The values of CMAE and CRMSE across all θ areas decreased as measurement precision was made stricter, respectively. All these results indicated that the CAT-IA had high measurement accuracy in psychometric properties. The values of CBIAS, CMAE, CRMSE, and CSEE in each θ area under stopping rule SE (θ) ≤ 0.3 are displayed in **Figures 1**–**4**. Clearly, as shown in **Figure 1**, the CSEE values were closely commanded to less than 0.3 at −2 ≤ θ area. The values of CBIAS were inversely proportional to all θ areas. In addition, CBIAS values gradually decreased as the ability increased, as shown in **Figure 2**. The changing trends of CMAE and CRMSE were approximately consistent across all θ areas, as shown in **Figures 3**, **4**. These results were consistent for all stopping rules.

### Efficiency, Reliability, and Validity of CAT-IA

#### Efficiency and Reliability of CAT-IA

In **Table 7**, the CAT-IA simulation results are displayed under five measurement precision standards. As shown in the second column, the mean and SD of the items used both increased when the measurement precision was made stricter. In the third column, the mean SE of the latent traits for each stopping rule varied from 0.159 to 0.454. Except for the stopping rule of SE (θ) ≤ 0.2, the mean SEs were less than their corresponding measurement precision. Marginal reliability ranged from 0.794 to 0.973 with an average of 0.90, as shown in the fourth column. Evidently, marginal reliability increased as the measurement precision was made stricter. The last column in **Table 7** shows the Pearson's correlation between the estimated θ with stopping rule of None and the remaining stopping rules. The values of Pearson's correlation ranged from 0.898 to 1 and were all significant at the 0.01 level (two-tailed), which showed that under different stopping rules, the algorithm of CAT-IA was effective. **Table 7** also shows that the CAT-IA could greatly save item usage without loss of measurement precision. Under the stopping rule of SE (θ) ≤ 0.2, the Pearson's correlation between the estimated theta by CAT-IA and the estimated theta by all of the items in the item bank reached 0.990; CAT-IA only used about half of the items (27.655 items) in the item bank. In brief, the CAT-IA saved 53.1% in item usage without loss of measurement precision. Under the two stopping rules of SE (θ) ≤ 0.3 and SE (θ) ≤ 0.4, the Pearson's correlations were both above 0.90; CAT-IA thus saved 80.7 and 89.9% of item usage, respectively. All these results indicated that the CAT-IA had high efficiency and marginal reliability.

The reliability and number of used items in CAT-IA on levels of the latent trait under stopping rule SE (θ) ≤ 0.3 are displayed in **Figure 5**. We noted a remarkable connection between the number of used items and reliability. Despite only using about 11.38 items, the CAT-IA obtained high reliability (above 0.9) and


None, all item bank was used; SD, standard deviation; SE, standard error; r, Pearson's correlations. ∗∗ representing significant correlation at the 0.01 level (two-tailed).

TABLE 8 | Pearson's correlations between the estimated θ of CAT-IA and the sum scores of four IA scales under different stopping rules.


None, all item bank was used; IAT, Internet Addiction Test; GPIUS, Generalized Problematic Internet Use Scale; GAS, Gaming Addiction Scale; CIAT, Chinese Internet Addiction Test; ∗∗ representing significant correlation at the 0.01 level (two-tailed).

high measurement precision for a large number of individuals (estimated theta ranged from −2 to +4). Conversely, when the reliability was below 0.9, more items were used. This result was consistent for all stopping rules.

#### Concurrent Validity and Predictive Validity of CAT-IA

The Pearson's correlations between the estimated θ of CAT-IA and the aggregate scores of IAT, GPIUS, GAS, and CIAT are documented in **Table 8**. The values of Pearson's correlations varied from 0.646 to 0.944 and were all significant at the 0.01 level (two-tailed), which revealed that the CAT-IA had high concurrent validity. In addition, comparing the other scales, the correlation coefficient of CIAT was the highest under each stopping rule, whereas that of GAS was the lowest.

The results of the predictive validity of CAT-IA are displayed in **Table 9**. All AUC values (with 95% confidence intervals) were above 0.71, indicating that CAT-IA had a large predictive effect (Rice and Harris, 2005). According to the large predictive effect, the cut-off point of IA was determined under each stopping rule for IAT and GAS, based on the values of sensitivity and specificity. For example, under the stopping rule of SE (θ) ≤ 0.2 in the diagnostic criteria of GAS, if the cut-off point of the 59-item bank was set to 0.801, the sensitivity and specificity of CAT-IA reached 0.922 and 0.862, respectively. These results showed that the CAT-IA had high predictive validity and had strong discrimination between individuals with IA disorder and healthy individuals.

### DISCUSSION

CAT studies have focused on depression or anxiety for clinical individuals in the field of mental health (e.g., Fliege et al., 2005; Flens et al., 2016, 2017). However, to the best of our knowledge, there are no CAT studies on IA. In this research, we developed a CAT-IA to provide a new and effective assessment of IA. The original item bank of IA was subjected to psychometric evaluation; items were excluded until all of the remaining items in the item bank satisfied the requirements of psychometric evaluation. Subsequently, the efficiency, reliability, and validity of the final item bank of the CAT-IA were assessed under different stopping rules. The results showed that the final 59-item CAT-IA item bank met the three IRT assumptions, and possessed high discrimination, good item-model fit, and no DIF. Moreover, the CAT-IA could significantly save testing items and effectively reduce the test burden of participants, while also having high reliability, concurrent validity, and predictive validity.

Kocalevent et al. (2009) demonstrated that simulation and actual results of CAT tend to show high similarity. There are three reasons to implement actual CAT studies under different stopping rules. First, the same participants are used not only to estimate item parameters but also to simulate CAT studies, which could result in overfitting and more optimistic results (Friedman et al., 2010). Second, margin reliability and predictive validity might be overestimated because the data of CAT simulation

TABLE 9 | Area under the curve Statistics for the IAT and GAS scale under different stopping rules, and 95% confidence intervals.


None, all item bank was used; IAT, Internet Addiction Test; GAS, Gaming Addiction Scale. Se, sensitivity; Sp, specificity.

studies come from the original database. Third, De Beurs et al. (2012) indicated that the results of a test are affected by the measurement tools. The original CAT study was done on a computer, but now it is conducted as a paper-and-pencil survey, which may lead to different outcomes.

When applying CAT-IA in clinical practice or research, CAT-IA may have different reliability results for different observers; that is, individuals of different abilities are provided with different information. For example, in the present study, under the stopping rule SE (θ) ≤ 0.3, reliability was very low and a large number of items were used when the individual has overly high or low abilities, indicating that small differences between two participants with either very high or very low abilities may not be detected, which was similar to Reise and Waller (2009) findings. To prevent the emergence of test bias, the reliability provided by the CAT-IA was set as similar and high for all test-takers. Nonetheless, we recognized the impact of the difficulty parameter distribution under the GRM. For example, in this study, there were no items to match persons whose abilities are below −1.968 in that the minimum value of the difficulty parameters was b1 = −1.968. Therefore, the CAT-IA provided these people with scarce information, and the measurement accuracy and reliability for them were very low despite the use of a large number of items of the 59-item bank. In future studies, researchers can increase the number of items with high or low difficulty parameter to make the difficulty parameter reasonable, which could not only provide high measurement accuracy and reliability for each participant but also greatly reduce the number of selected items for each person.

The standard IRT model is generally based on assumptions of unidimensionality and local independence. However, the single-dimensional and locally independent assumptions in real life may not be completely satisfied. For example, many researchers believe that the factor structure of IA should be multidimensional rather than unidimensional (e.g., Thatcher and Goolam, 2005; Lemmens et al., 2009; Caplan, 2010). Based on local dependency, Wainer et al. (2000a) proposed a widely used 3PL testlet model, in which dependent items did not need to be excluded when the testlet model was used in a CAT. According to these results, future studies can extend the unidimensional CAT into the multidimensional CAT and use the testlet model to solve local dependency between items.

In addition, concurrent validity in the present study was evaluated by Pearson's correlations between the estimated θ. of

### REFERENCES


CAT-IA and the aggregate scores of each scale. This method can result in item overlap that may overestimate the concurrent validity. Future studies should utilize other external scales to investigate concurrent validity. Further, De Beurs et al. (2012) proved that the same test applied in different situations may lead to changes in the measurement characteristics. Therefore, factorial invariance should be considered in future research. Lastly, although there are many methods for the selection of initial items, with respect to the estimation of latent trait, item selection, and exposure rate, this study failed to address enough methods (such as different parameter estimation and item selection methods), which should be fully considered in future studies.

### ETHICS STATEMENT

All procedures performed in studies involving human participants were in accordance with the ethical standards of the institutional and/or national research committee and with the 1964 Helsinki Declaration and its later amendments or comparable ethical standards. Informed consent was obtained from all individual participants included in the study. The current study was conducted in conformity to the recommendations of psychometrics studies on mental health at the Research Center of Mental Health, Jiangxi Normal University and approved by the Research Center of Mental Health, Jiangxi Normal University and the Ethics Committee of Psychology Department in Jiangxi Normal University. The written informed consent was obtained from all participants in accordance with the Declaration of Helsinki. All participants gave their written informed consent. The parental consent was also obtained for all participants under the age of 16.

### AUTHOR CONTRIBUTIONS

YZ wrote the manuscript. YC and DT guided the manuscript writing and data processing. DW and XG processed the data.

### FUNDING

This study was funded by the National Natural Science Foundation of China (Grant Nos. 31760288 and 31660278).




**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2019 Zhang, Wang, Gao, Cai and Tu. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

# Validation of Embedded Experience Sampling (EES) for Measuring Non-cognitive Facets of Problem-Solving Competence in Scenario-Based Assessments

#### Andreas Rausch<sup>1</sup> \*, Kristina Kögler<sup>2</sup> and Jürgen Seifried<sup>3</sup>

<sup>1</sup> Economic and Business Education – Workplace Learning, Business School, University of Mannheim, Mannheim, Germany, <sup>2</sup> Business Education, University of Hohenheim, Stuttgart, Germany, <sup>3</sup> Economic and Business Education – Professional Teaching and Learning, Business School, University of Mannheim, Mannheim, Germany

#### Edited by:

Ronny Scherer, University of Oslo, Norway

#### Reviewed by:

Samy A. Azer, King Saud University, Saudi Arabia Christian Wandeler, California State University, Fresno, United States

> \*Correspondence: Andreas Rausch rausch@uni-mannheim.de

#### Specialty section:

This article was submitted to Educational Psychology, a section of the journal Frontiers in Psychology

Received: 07 December 2018 Accepted: 07 May 2019 Published: 24 May 2019

#### Citation:

Rausch A, Kögler K and Seifried J (2019) Validation of Embedded Experience Sampling (EES) for Measuring Non-cognitive Facets of Problem-Solving Competence in Scenario-Based Assessments. Front. Psychol. 10:1200. doi: 10.3389/fpsyg.2019.01200 To measure non-cognitive facets of competence, we developed and tested a new method that we refer to as Embedded Experience Sampling (EES). Domain-specific problem-solving competence is a multi-faceted construct that is not limited to cognitive facets such as domain knowledge or problem-solving strategies but also comprises non-cognitive facets in the sense of domain-specific emotional and motivational dispositions such as, for instance, interest and self-concept. However, in empirical studies non-cognitive facets are usually either neglected or measured by generalized self-report questionnaires that are detached from the performance assessment. To enable an integrated measurement, we developed the EES method to collect data on non-cognitive facets during scenario-based low-stakes assessments. Test-takers are requested to stop at certain times and spontaneously answer short items (EES items) regarding their actual experience of the problem situation. These EES items are embedded in an EES event that resembles typical social interactions with non-player characters. To evaluate the feasibility and validity of the method, we implemented EES in a series of three studies in the context of commercial vocational education and training (VET): A feasibility study with 77 trainees, a pilot study with 20 trainees, and the main study with 780 trainees who worked on three complex problem scenarios in a computer-based office simulation. In the present paper, we investigate how test-takers perceived the EES events, and whether social desirability biased their answers, and investigate the internal structure of the data and the relationship between EES data and data from several other sources. Interview data and survey data indicated no biases due to social desirability and no additional burden for the test-takers due to the EES events. A correlation analysis following the multitrait-multimethod approach as well as the calibration of a multidimensional model based on Item Response Theory (IRT) also supported the construct validity. Furthermore, EES data shows substantial correlations with test motivation but almost zero correlations with data from generalized retrospective self-report questionnaires on non-cognitive facets. Altogether, EES offers an alternative approach to measuring non-cognitive facets of competence under certain conditions. For instance, EES is also based on self-reporting and thus might not be suitable for high-stakes testing.

Keywords: embedded experience sampling, competence assessment, non-cognitive facets, problem solving, computer-based assessment, scenario-based assessment, business simulation

### INTRODUCTION

fpsyg-10-01200 May 22, 2019 Time: 17:3 # 2

Problem-solving competence has gained increasing attention in educational science as well as in vocational education and training (VET) and professional development. In vocational and professional contexts, problem-solving competence is important because of a general trend toward higher-order skills owing to the ongoing automatization and outsourcing of routine tasks that not only affect blue-collar work in production lines but also white-collar work (e.g., Brynjolfsson and McAfee, 2014; Frey and Osborne, 2017). Problem solving is considered to be an orchestration of cognitive, metacognitive, and non-cognitive processes in order to find an initially unknown way of bridging the gap between an actual state and a desired state (Dörner and Funke, 2017). Hence, unlike routine action, problem solving is by definition strenuous and problems usually evoke negative emotions that have to be dealt with. Altogether, problem solving is enhanced by motivation, excitement, perseverance, frustration tolerance, emotion regulation, (mild) positive affect, self-confidence, and so forth (Sembill, 1992; Frensch and Funke, 1995; Sugrue, 1995; Isen, 2008; Hannula, 2015; Schoppek and Fischer, 2015). Consequently, problem-solving competence also comprises non-cognitive dispositions which are also seen to be part of competence in general and work competence more specifically (Weinert, 2001; Rychen and Salganik, 2003; Kanfer and Ackerman, 2005). Nevertheless, the assessment of competencies is usually limited to cognitive aspects such as the reproduction or application of domain knowledge. We argue that a more holistic assessment of problem-solving competence should result in a competence profile that also comprises noncognitive facets (Sembill et al., 2013; Rausch and Wuttke, 2016). The lack of holistic measurement approaches has led us to develop an experience sampling procedure which builds on the integration of emotional and motivational self-reports into computer-based competence assessments. It is referred to as Embedded Experience Sampling (EES) and has been created to capture the non-cognitive dimension of problem solving in situ. This contribution outlines the characteristics and implementation of EES and presents findings concerning its validity gained by conducting three empirical studies throughout the developmental process.

### Non-cognitive Facets of Problem-Solving Competence

In his seminal report, Weinert (2001) developed a broad definition of action competence as a combination of "intellectual abilities, content-specific knowledge, cognitive skills, domain-specific strategies, routines and subroutines, motivational tendencies, volitional control systems, personal value orientations, and social behaviors" (Weinert, 2001, p. 51). He pointed out that "performance in specific situations depends on more than cognitive prerequisites" (Weinert, 1999, p. 19). Similarly, Kanfer and Ackerman (2005) consider knowledge, skills, abilities, motivation, personality, and self-concept as components of work competence. Furthermore, within research on problem solving, there is a broad consensus that besides the significance of domain-specific knowledge, problem solving is also enhanced by ". . . some non-cognitive factors such as selfconfidence, perseverance, motivation, and enjoyment" (Frensch and Funke, 1995, p. 21). Within the framework of problem solving introduced by the National Center for Research on Evaluation, Standards, and Student Testing (CRESST), problemsolving competence comprises motivation (further divided into effort and self-efficacy) along with cognitive facets (Herl et al., 1999). Similar definitions are found in research on mathematical problem solving (Verschaffel et al., 2012; Schoenfeld, 2013). There is no universally accepted definition of the term "noncognitive" (Duckworth and Yeager, 2015) just as there is no such definition of "cognition" (Neisser, 1967). Any attempt to distinguish cognitive from non-cognitive constructs remains artificial, but facilitates the understanding and analysis of their interdependence (Weinert, 1999).

When solely focusing on the assessment of cognitive facets of competence, it is implicitly assumed that test-takers invest maximum effort to perform as well as possible. Test performance is interpreted as maximum performance in the sense of Cronbach (1960) and thus varying test motivation threatens the validity of the assessment. It is well-known that in testing for intelligence and in international large-scale studies, test motivation exerts an influence on achievement (Butler and Adams, 2007; Duckworth et al., 2011). Eklöf (2010) points out that an achievement test score is a function of "skill and will." Correspondingly, including non-cognitive facets in the definition and modeling of competence moves the construct to be measured from "can do" to "will do" (Kanfer and Ackerman, 2005; Cortina and Luchman, 2012); or, respectively, from maximum performance to typical performance in the sense of Cronbach (1960). Consequently, emotions and motivation no longer represent construct-irrelevant variance, but are a manifest result of latent non-cognitive facets of competence which has to be considered in the measurement. Regarding convergent validity, data of non-cognitive facets of competence should be correlated with measures of test motivation.

Based on a literature review, we developed a competence model that distinguishes knowledge application, action regulation, self-concept, and interest as components of

TABLE 1 | Model of domain-specific problem-solving competence (Rausch and Wuttke, 2016, p. 177).


domain-specific problem-solving competence (**Table 1**). We further defined several facets within each of the components. These facets are arranged alongside an ideal problemsolving process and are intended to guide the measurement of problem-solving competence (Rausch and Wuttke, 2016).

The non-cognitive components (self-concept and interest) mirror the expectancy-value theory of achievement motivation (Wigfield and Eccles, 2000) and the control and value appraisals of achievement motivation (Pekrun, 2006), respectively. Confidence in one's own competence when confronted with a domain-specific problem, tolerating ambiguity and uncertainty, and having confidence in one's own solutions concerning domain-specific problems are defined as facets of a domainspecific self-concept. Being interested in the context of a domain-specific problem, maintaining positive and active emotional states while working on a domain-specific problem, and being interested in the progress of and learning from these problems are defined as facets of domain-specific interest.

### Modeling and Measuring Non-cognitive Facets of Competence

Based on a multidimensional understanding of competence, a crucial question is how non-cognitive facets are measured. Two basic options in dealing with the multidimensionality of the construct can be distinguished (Sembill et al., 2013).

#### Multifaceted Competence Model With Fragmented Measurement

Following this very common approach, non-cognitive facets are part of a multifaceted construct of competence but are measured separately, usually by administering retrospective selfreport questionnaires. Those self-reports remain detached from the actual performance. In general, self-reports are considered face-valid (Debus, 2000) but there is plenty of research that stresses several threats and biases regarding the validity of decontextualized retrospective self-reports on emotion and motivation (van Reekum and Scherer, 1997; Robinson and Clore, 2002; Novak and Johnson, 2012; Schwarz, 2012). Furthermore, in their investigation of the empirical relation between intelligence and problem solving, Wittmann and Süß (1999) point to the "Brunswik asymmetry" named after Brunswik (1956) in order to explain the poor prediction of problem solving via intelligence. This poor relation is due to an asymmetry in the content and breadth of the predictor (intelligence) and the criterion (problem solving), because the former is a very broad construct, while the latter is derived from a contextualized performance task. The same argument holds true for the relation of problem solving and non-cognitive facets if non-cognitive facets are measured through general self-report questionnaires which are detached from problem solving (Rausch et al., 2016; Rausch, 2017). This approach may lead to an underestimation of the importance of non-cognitive competence facets (Dermitzaki et al., 2009; Sembill et al., 2013).

#### Multifaceted Competence Model With an Integrated Measurement

Following an integrated approach, the measurement of noncognitive facets is integrated into the performance assessment. Regarding the differentiation of state and trait, recurrent situational emotional states are interpreted as the dispositional core of a trait emotion (Diener and Lucas, 2000). Just as the assessment of cognitive facets of competence is based on the repeated measurement of manifest performance, the suggested in situ assessment of non-cognitive facets is based on the repeated measurement of emotional states in the context of different problem scenarios. A multitrait-multimethod approach (MTMM; Campbell and Fiske, 1959) can be applied to investigate the internal or construct validity of such an approach. The multiple problem scenarios constitute different methods and the various non-cognitive facets (see **Table 1**) constitute different traits. According to MTMM (Podsakoff et al., 2003), higher correlations between the same traits across different scenarios (monotrait-heteromethod) than between different traits within one scenario (heterotrait-monomethod) indicate internal or construct validity.

### Embedded Experience Sampling to Measure Non-cognitive Facets of Competence

Our empirical approach to measuring non-cognitive facets of competence is inspired by the Experience Sampling Method (ESM) which was introduced by Csikszentmihalyi and Larson (1987, p. 526) as "an attempt to provide a valid instrument to describe variations in self-reports of mental processes.". In ESM, participants are repeatedly requested to report their emotional states over a period of time. Different types of ESM have been established (Scollon et al., 2003, p. 7ff.): Signal-contingent sampling requires participants to complete self-reports when prompted by a randomly-timed signal (e.g., twice a day). Event-contingent sampling requires participants to complete

self-reports whenever a predefined event occurs (e.g., in case of problems). Interval-contingent sampling uses constant timeintervals. The Continuous State Sampling Method (CSSM) is a special case of such time-sampling ESM with very short intervals of only 5–10 min. CSSM has been developed and applied in the context of classroom research (Sembill et al., 2008; Conrad and Schumann, 2017; Kärner et al., 2017; Kögler and Göllner, 2018). CSSM is also used for validating our own approach.

Our development of Embedded Experience Sampling (EES) builds on traditional ESM. In order to measure the noncognitive facets in computer-based tests on problem-solving competence, EES aims at collecting self-report data on noncognitive facets in situ and furthermore integrates these selfreports into the storyline of authentic problem scenarios. Testtakers are briefly interrupted during the test and requested to answer short questions (EES items) regarding their momentary experience. These EES items are embedded into the test situation in authentic EES events that resemble ordinary social interaction at the workplace (e.g., a colleague asks how one is doing). Closed-ended questions were used in order to spare the testtakers the time they would need to write down their answers. Furthermore, they improve the comparability of the answers and facilitate the implementation of EES in large-scale assessments regarding psychometric scaling. EES items focus on difficult to monitor non-cognitive competence facets such as interest, attitudes, commitment, and self-concept.

A similar approach was applied in PISA 2006 as an "embedded science interest assessment". Directly after working on selected test items regarding science competence, the participants were requested to rate their situational interest in the prior item context. The data were calibrated in Item Response Theory (IRT) models to assess trait interest (Drechsel et al., 2011). However, few such approaches are so far known to the authors. Furthermore, the EES approach differs from the PISA approach because in PISA the items were not embedded into the "storyline" of the assessment. A further example for integrating experience sampling into a complex assessment is the "affect self-report device" applied to the game-based learning environment "Crystal Island." During their interaction with the learning environment, test-takers received an in-game prompt asking them to report on their cognitive and emotional states. These status updates were described as part of an in-game social network (Sabourin and Lester, 2014). The "affect self-report device" is embedded in the sense of EES, but it was not designed to measure non-cognitive traits as part of a competence assessment.

Any sampling of self-reported experiences in situ faces limitations: for instance, social desirability may affect individuals' responses and possibly lead to a bias in the psychometric data in terms of construct-irrelevant variance (Messick, 1994). In this context, the criteria of cognitive validity (Pellegrino et al., 2016) or construct validity (Messick, 1994), respectively, require that participants do not consciously deliberate about whether a particular answer would be more socially desirable but only answer according to their actual situational experience. Following the argument of Reis (2012), measuring non-cognitive facets within the problem-solving process promotes ecological validity, given that the problem scenarios and the EES events are representative of daily work. Furthermore, biases due to social desirability might decrease in EES compared to retrospective selfreports, due to the concurrent cognitive load and time pressure during the problem-solving process (Stodel, 2015). However, the repeated sampling of subjective states may also cause reactivity and reactance, for better or worse, because on the one hand it constitutes a disruption and on the other hand it may also trigger reflection (Csikszentmihalyi and Larson, 1987; Scollon et al., 2003; Novak and Johnson, 2012).

### Research Questions and Hypotheses

We implemented EES into test situations in three field studies and collected EES data to investigate


**Table 2** gives an overview of the research questions and corresponding hypotheses of the field studies.

The studies were part of the research project 'modellng and measuring domain-specific problem-solving competence of industrial clerks' (DomPL-IK), which was funded by the German Federal Ministry of Education and Research (Grant No. 01DB081119–01DB1123). The apprenticeship program to become an industrial clerk is the fifth most frequent of nearly 330 state-recognized apprenticeship programs in the well-respected German dual system of vocational education and training (VET). Apprenticeship programs usually require 3 years to complete and are characterized by a combination of workplace learning in the training company and classroom-based learning in staterun vocational schools. Certified industrial clerks usually work in back-office departments of industrial or service companies. A general description of the research project and selected results have been published in Rausch et al. (2016).

In the present article, we focus on the development and validation of the EES approach by analyzing EES data from two pilot studies and the main study. In a first feasibility study, we investigated how participants perceived the EES events, whether social desirability played a role, whether the EES data met the requirements of the MTMM approach, and how EES data were correlated to retrospective measures of interest and self-concept. The aims of the second pilot study was to test the computer-based office simulation that, for the first time, also included a computer-based implementation of EES events. Additional data were collected to investigate the subjective experience of the EES, social desirability in EES responses, and the relation to CSSM data and test motivation. Finally, the computer-based assessment of domain-specific problemsolving competence was implemented in a large-scale study with almost 800 participants in vocational schools in six federal German states. The resulting EES data were calibrated in a psychometric model based on Item Response Theory (IRT). Parts of this final step of the test development are published in


TABLE 2 | Overview of the studies, the research questions, and the hypotheses.

Rausch et al. (2016). The studies within the research project have been approved by the responsible ministries of education and the responsible commissions of data protection of the respective German Federal States as well as by the Ethics Committee of the Otto-Friedrich-University of Bamberg (Otto-Friedrich-University Bamberg, Bamberg, Germany).

### STUDY 1: FEASIBILITY STUDY OF IMPLEMENTING EES EVENTS INTO AUTHENTIC PROBLEM SCENARIOS

## Materials and Methods

#### Participants

The feasibility of implementing EES in the assessment of domainspecific problem-solving competence was investigated in a pilot study with N = 77 students in vocational education and training (VET) of two vocational business schools in Germany. All participants were enrolled in a 3-year apprenticeship program to become industrial clerks and were nearing the end of their 2nd year of the apprenticeship. The sample included 28 male and 49 female participants who showed a typical age distribution (M = 21.8; SD = 1.56; min = 18; max = 26). Participation was voluntary and all participants provided written informed consent.

#### Procedure

Data were collected in computer-equipped classrooms. At the beginning of the data collection sessions the researchers introduced themselves, the project, and the agenda. First, the participants completed several self-report questionnaires including scales on vocational interest and work-related self-efficacy. Next, they worked on three authentic, computerbased business problems including the completion of several EES items (for further information see Rausch, 2017). The session ended with group discussions or individual interviews about the problem scenarios and the experience of EES.

The three computer-based problem scenarios required a cost deviation analysis (30 min), a supplier selection (40 min), and a make-or-buy decision (50 min). Each scenario started with an email from a supervisor which included a problem and a variety of documents of varying relevance, transparency, and credibility. All scenarios required participants to go through multiple processes of information seeking, processing, and interpreting. To complete a scenario, the participants had to reply to the initial email with a well-founded proposed solution. The test environment provided "open book" conditions meaning that participants could look up technical terms, formulae, legal regulations etc. in a large reference work. However, they were not allowed to consult any other sources such as the internet. The participants used Microsoft Excel <sup>R</sup> to work on several spreadsheet files and Microsoft Word <sup>R</sup> documents to write their email reply and make notes. The problem environment was open in the sense that there was no further structure provided during the given time frame for each problem scenario. Editable documents were analyzed for each participant to assess the cognitive facets of problem-solving competence (see **Table 1**). For further information on the analysis of the cognitive facets see Rausch et al. (2016), Rausch (2017), and Seifried et al. (unpublished).

#### Measures

#### **Embedded experience sampling (EES)**

In this feasibility study, four EES events were implemented into each of the above problem scenarios. **Table 3** lists the EES

TABLE 3 | Overview of EES events, competence facets, and EES items in Study 1.


EES events and EES items were the same for all of the three problem scenarios; (−) indicate inverse items. Abbreviations behind competence facets refer to competence model in Table 1; facets D1 and C2 have not been measured yet in this first study.

events, the related competence facets, and the EES items that were used. In this first application of the method, no events and items had been designed for the competence facets C2 "ambiguity/uncertainty tolerance" and D1 "personal interest in the problem context/content".

In this early stage of the project, EES events were paperbased and came in separate envelopes that were numbered consecutively and placed on each participant's desk (see **Appendix Figure A1** for an example). Female and male participants were provided a gender-specific version of the EES events. At predefined times during the test, participants were asked to open a particular envelope, to immediately complete the items, and to put the paper sheet back into the envelope. Altogether, 1,845 such envelopes were prepared for this study. Apparently, test efficiency was questionable in this paper-based implementation of EES.

The data of the two EES items concerning the competence facet "confidence in one's competence" (C1) were condensed into one scale for each scenario. The internal consistencies were not satisfactory (0.57 < Cronbach's alpha < 0.59). "Situational confidence in one's solution" (C2) was measured with a single item (see **Table 3**). The data of the four EES items on the competence facet "positive and active emotional state" (D2) were condensed into one scale for each scenario. Inverse items were re-coded and a mean score was calculated for each scenario. Again, the internal consistencies were not satisfactory (0.56 < Cronbach's alpha < 0.61). The four dichotomous EES items on the competence facet "interest in the progress of the problem" (D3) were condensed into one scale for each scenario by sum score. Thus, the scores for each non-cognitive facet ranged from 1 to 4.

#### **Generalized self-reports of work-related self-efficacy and work-related interest**

We administered a scale designed to measure work-related self-efficacy (Abele et al., 2000). The scale consisted of six statements that were rated on a five-point Likert scale ranging from 1 = disagree to 5 = agree (e.g., "I do not worry about work-related challenges because I can always trust my abilities."). The internal consistency of the scale was satisfactory (Cronbach's alpha = 0.73). An adapted and shortened version of a scale originally developed to measure dispositional interests in students (Schiefele et al., 1993) was administered. The scale consisted of six statements rated on a four-point Likert scale ranging from 1 = disagree to 4 = agree. The items assessed general interest in the current apprenticeship program (e.g., "I am sure that I have chosen an apprenticeship program which reflects my personal interests."). The internal consistency of the scale was satisfactory (Cronbach's alpha = 0.76).

#### **Subjective experience of EES**

To investigate how the participants experienced the EES, two group discussions in class (with approximately 20 participants each) and 11 individual interviews were conducted. Participants were asked how they experienced the procedure (the additional questions that came in the envelopes). They were asked whether they had deliberated about alternative responses and whether answering these questions had caused additional stress during their work on the problem situations.

#### Data Analysis

Following a multitrait-multimethod (MTMM) approach, the various facets of competence are multiple traits and the three

scenarios are multiple methods. Although the variables were not normally distributed (Shapiro–Wilk tests), parametric Pearson correlations were calculated since this method is considered robust (Norman, 2010). In correlation tables, indications of significance are omitted in favor of legibility. Following Cohen (1988), correlation coefficients of 0.10 < r < 0.30 indicate small effects, 0.30 < r < 0.50 indicate medium effects, and r > 0.50 indicate large effects. The interview data were categorized with regard to social desirability and the additional burden of answering the EES items while working on the problem scenarios. The data was analyzed using IBM SPSS 24.

### Results

#### Descriptive Statistics

The mean values for the EES variables range between 1.71 and 3.19 on a four-step scale (see **Appendix Table A1**). The variable D2 (maintaining positive and active emotional states) shows high values, consistently above the value of 2.3, while variable D3 (interest in the progress of/in learning from the problem) shows much lower values. Here, the mean values only reach a value above 2.0 in scenario 2. Finally, the decrease of the mean values over time for variable C3 (situational confidence in one's solution) is noteworthy. The mean value drops from 2.97 in scenario 2 to 1.71 in scenario 3. This finding is in line with the difficulty of the scenarios (determined by the solution rates)—scenario 2 was evaluated as the easiest one while scenario 3 showed the lowest solution rate, as expected with regard to the complexity of the scenario.

#### Test-Takers' Perception and Social Desirability (RQ1, RQ2)

To investigate participants' subjective experience of the EES, individual interviews and group discussions were conducted. In both group discussions the participants reacted positively to the way in which social interaction was implemented via the paperbased questionnaires and stated that such interruptions were quite realistic. Two of the 11 individually interviewed participants made similar statements when asked how they experienced these short questionnaires and added that it was an entertaining addition to the test scenarios. None of the participants reported adverse experiences. In one group discussion, a participant cautiously indicated that one could have thought about how some of the responses would appear to others. All of the 11 individually interviewed participants indicated that they answered spontaneously according to their actual experience and did not deliberate about "good answers". Only one out of 11 participants stated that answering the EES items caused an additional burden. Altogether, the participants' responses gave no reasons to assume biases from social desirability or any additional burden and thus they support H1a and H2a (see **Table 2**).

#### Multitrait-Multimethod Analyses (RQ3)

In the next step, we analyzed the structure of the data by applying a multitrait-multimethod approach. High heterotraitmonomethod correlations between different non-cognitive competence facets (traits) within a scenario (method) argue for situational influences of the scenario, while high monotraitheteromethod correlations between the same competence facets (traits) measured in different scenarios (method) argue for trait influences. **Table 4** shows the results of the MTMM analysis.

The mean correlation of all 18 heterotrait-monomethod combinations is r = 0.28 while the mean correlation of all 12 monotrait-heteromethod combinations is r = 0.33, which is consistent with the MTMM assumption. Heterotraitmonomethod correlations different from zero are plausible because the theoretical constructs are not assumed to be fully independent of each other. The monotrait-heteromethod correlations are higher which supports the assumption of internal validity and thus supports H3a (see **Table 2**). However, they are not much higher than the heterotrait-monomethod correlations. Internal consistency across all three scenarios and across both EES variables of self-concept was CA = 0.66 (6 variables); the respective internal consistency across all three scenarios and across both EES variables of interest was CA = 0.71 (6 variables).

#### Relations Between EES and Generalized Retrospective Self-Reports (RQ4)

Finally, by calculating mean scores across the EES variables, we received two EES-based scales, one for self-concept and one for interest. The correlations between EES-based scales and scales from generalized self-reports of work-related self-efficacy and vocational interest were close to zero and not significant (r = 0.05, p = 0.66 for self-concept; r = 0.04, p = 0.69 for interest). We hypothesized small correlations (H4c) even though the theoretical constructs are quite similar.

### STUDY 2: VALIDATION STUDY OF RESPONSES TO COMPUTER-BASED EES EVENTS

### Materials and Methods Participants

To test the computer-based implementation of EES events and the subjective experience of the EES, 21 VET students participated voluntarily in this pilot study and provided written informed consent. Eight participants were male and 13 were female; the participants were 20.3 years old on average (SD = 1.93; min = 18; max = 24).

#### Procedure

Data were collected in a computer-equipped classroom. At the beginning of the sessions the researchers introduced themselves, the project, and the agenda. The participants worked on one authentic, computer-based problem scenario including the completion of several EES items. In contrast to the feasibility study, the scenario in this pilot study was presented and completed in an integrated custom-built office simulation that comprised typical features of an office workplace, such as an email client, a spreadsheet application, a folder structure, a file viewer, a notepad, a calculator and so forth. **Figure 1** shows a screenshot of the office simulation.



C1, situational confidence in one's competence; C3, situational confidence in one's solution; D2, maintaining positive and active emotional states; D3, interest in the progress of/in learning from the problem; abbreviations of the competence facets refer to the competence model in Table 1; facets D1 and C2 have not been measured yet in this first study.

In addition to EES, data were also collected via the "Continuous State Sampling Method" (CSSM) and via a short questionnaire on test motivation and one's experience with the EES events directly after the problem scenario. Furthermore, the participants completed a longer questionnaire that included biographic information as well as several standardized scales, one of which was applied to measure a disposition toward socially desirable responding.

#### Measures

#### **Embedded experience sampling method (EES)**

In this pilot study of the technological implementation, four EES events were defined. However, due to a technical malfunction the fourth EES event was not presented to the participants. **Table 5** lists the remaining three EES events, the related competence facets, and the EES items that were applied.



In this validation study, the EES events were also presented within the office simulation for the first time. **Figure 2** shows the EES event "phone call."

Embedded experience sampling data were condensed in the same way as in the feasibility study (Study 1), resulting in four EES variables for the competence facets shown in **Table 1**, C1 (confidence in one's competence), C2 (uncertainty tolerance), D1 (interest in the problem content), and D2 (positive emotional states).

#### **Continuous state sampling method (CSSM)**

Continuous state sampling method data was collected during the problem scenario via mobile devices (PalmOne Tungsten <sup>R</sup> ). In 5-min intervals, the participants were requested to rate three statements on a scale from 0 to 100. The items were: (1) Right now, this is very interesting. (2) Right now, I am making great efforts. (3) Right now, I am making great progress. Participants were carefully instructed that this data collection was not part of the assessment and that they were expected to answer honestly according to their actual experience, while no such announcement was made for the EES events. To become familiarized with the method, the first point of measurement was before the problem scenario and was not included in the analysis. Six measurement points followed during the problem scenario at minute 4<sup>0</sup> , 90 , 14<sup>0</sup> , 19<sup>0</sup> , 24<sup>0</sup> , and 29<sup>0</sup> . Scales were calculated from the six items of each statement. Internal consistencies (Cronbach's alpha) were C.A. = 0.70 for "interesting," C.A. = 0.78 for "effort" and C.A. = 0.67 for "progress."

#### **Social desirability**

Two measures were applied to investigate whether social desirability played a role in answering the EES items. First, we administered seven items from the scale "impression management" from the "Balanced Inventory of Desirable Responding (BIDR)" (Paulhus, 1994) in a German version by Musch et al. (2002). Paulhus (1994) defined and measured "impression management" as the purposeful deception of looking good to someone. Participants were to rate statements that referred to misconduct that one is usually not willing to admit to such as, for instance, "I sometimes tell lies if I have to" (inverse item) or "I never take things that do not belong to me." Responses were given on a four-point Likert-scale. The internal consistency (Cronbach's alpha) was C.A. = 0.71. Second, immediately after the completion of the scenario, the participants completed a short questionnaire. One question aimed at "impression management" during EES responses. Participants had to rate the statement "Concerning the interposed questions, I thought hard about which answer would make me look good" on a five-point scale from 1 = strongly disagree to 5 = strongly agree.

#### **Experience of EES**

In the same short questionnaire directly after the problem scenario two additional questions were aimed at assessing the authenticity of the EES events ("The interposed questions [phone call, visit to my office etc.] are very realistic") and the additional burden due to the EES events ("I would have arrived at a better solution without these interposed questions [phone call, visit to my office etc.]").

#### **Test motivation**

We administered an adapted version of the Effort Thermometer which Kunter et al. (2002) originally developed for and applied in the Programme for International Student Assessment (PISA). The participants were requested to indicate the effort that they had invested in the previous problem scenario on a 10 point scale compared to the maximum effort they would have invested in a test situation of very high personal relevance. The Effort Thermometer was administered directly after the problem scenario.

#### Data Analysis

For correlation analysis, Kendall's tau-b correlations were calculated because the data were not normally distributed and the sample size was small. The data was analyzed using IBM SPSS 24.

#### Results Descriptive Statistics and Social Desirability (RQ1 and RQ2)

On average, the participants experienced the problem scenario as not being very interesting (see EES variable D1 and CSSM scale "interesting"). They invested medium effort according to the CSSM scale "effort" and showed a correspondingly medium test motivation as measured by the Effort Thermometer. With regard to the EES events, the participants did not report that they tried to "look good" when answering the EES items. On average, they experienced the EES events as being quite authentic and hardly as an additional burden (see descriptive statistics in **Appendix Table A2**). Altogether, the data support H1b and H2b (see **Table 2**). The average CSSM ratings of "interesting," "progressing," and "effort" did not vary very much during the course of the problem scenario. The curve for "effort" resembles an inverted U-shape while ratings of "interesting" and "progressing" increased toward the end of the 30-min problem scenario (see **Appendix Figure A2**).

#### Relations Between EES Data and CSSM Data (RQ4)

**Table 6** shows the correlations of selected EES items and corresponding CSSM items.

**Table 6** shows that there are substantial correlations between the Embedded Experience Sampling (EES) and the Continuous State Sampling (CSSM) of situational interest (supporting H4a) while there are smaller correlations between EES data and CSSM data of confidence in one's competence and subjectively perceived progress, respectively.

#### Relations Between EES Data and Impression Management and Test Motivation (RQ1 and RQ4)

An analysis was made of how far EES data are influenced by social desirability or impression management and how it relates to test motivation. **Table 7** shows the results of the respective correlation analysis.

As shown in **Table 7**, there are almost zero correlations between dispositional impression management and the EES variables. Furthermore, there are only small correlations between the EES variables and situational impression management (i.e., having ". . . thought hard about which answer would make me look good"). There are medium to large correlations between some EES variables and test motivation, which is in line with our theoretical argument. Altogether, the data support H2b and H4b (see **Table 2**).

TABLE 6 | Correlations of selected EES items and corresponding CSSM items in Study 2.


18 < n < 21. Kendal's tau-b correlations; MP, measurement point.



17 < n < 21, Kendall's tau-b correlations.

### STUDY 3: CALIBRATION STUDY OF MEASURING NON-COGNITIVE FACETS OF COMPETENCE VIA EES

Finally, the computer-based assessment of domain-specific problem-solving competence was implemented in a largescale study with almost 800 participants in vocational schools in six federal German states. Parts of this final step of the test development are published in Rausch et al. (2016). Hence, parts of the following description are borrowed from Rausch et al. (2016).

### Materials and Methods

#### Participants

A total of 786 VET students participated in the study, of which six were excluded from the analyses due to missing data (due either to lack of willingness or a technical malfunction of the test software). The participating VET students were in the 2nd or 3rd year of a 3-year commercial apprenticeship program, 50.1% were female and the sample showed a typical right skewed age distribution (M = 21.3 years; SD = 2.69; min = 17; max = 44).

#### Procedure

All data were collected in computer-equipped classrooms in vocational schools. At the beginning of the data collection sessions the researchers introduced the project and the agenda. All participants provided written informed consent. Before and after the problem scenarios, the participants completed several self-report questionnaires including scales on workrelated interest and work-related self-concept. In the following, we focus on the internal consistency and internal validity of the assessment of the non-cognitive facets of domain-specific problem-solving competence.

#### Measures

#### **Embedded experience sampling (EES)**

For the main study, four EES events were defined. The first three EES events were the same that were used in the previous pilot study (see **Table 5**: short email response after the reception of the task, phone call from the sender of the task, short visit by a colleague). **Table 8** only lists the additional fourth EES events, the related competence facets, and the EES items that were used.

#### **Generalized self-reports of work-related self-efficacy and work-related interest**

We administered a questionnaire on work-related self-efficacy (Abele et al., 2000) which consisted of six statements that had to be rated on a five-point Likert scale ranging from 1 = disagree to 5 = agree (e.g., "I do not worry about work-related challenges because I can always trust my abilities."). Cronbach's alpha was 0.69. An adapted version of a scale to measure dispositional interests in students (Schiefele et al., 1993) was administered to measure dispositional work-related interest. Six statements had to be rated on a four-point Likert scale ranging from 1 = disagree to 4 = agree (e.g., "I am sure that I have chosen an apprenticeship program which reflects my personal interests"). Cronbach's alpha was 0.76.

#### Data Analysis

To assess the cognitive facets of competence (see competence model in **Table 1**), a complex three-step method (similar to Bennett et al., 2003) was applied: (1) Fine-grained results from a highly structured content analysis were condensed into (2) partial credit items on the basis of consensual expert judgments. (3) Finally, these partial credits were subject to psychometric scaling using a multidimensional Rasch model. For further details see Rausch et al. (2016) and Seifried et al. (unpublished).

### Results

#### Requirements of IRT (RQ3)

The variables of non-cognitive facets were calibrated in a six-dimensional partial credit model (Masters, 1982). However, facet D3 ("interest in the progress of/in learning from the problem"), showed insufficient reliability (EAP/PV reliability = 0.30) and therefore was excluded. Thus, the final estimation only included five dimensions and was estimated including background information such as gender, age, vocation, intelligence, competence scores for the cognitive facets, and other relevant variables. All calculations were conducted using the R package TAM (Kiefer et al., 2015). **Table 9** shows the EAP/PV reliabilities (on the diagonal) and the latent correlations between the five remaining non-cognitive competence facets (Rausch et al., 2016).

#### Correlations With Generalized Retrospective Measures (RQ4)

Furthermore, **Table 9** shows correlations between non-cognitive facets as measured by EES and the corresponding generalized self-report measures of work-related self-efficacy and workrelated interest.

**Table 9** shows that the EES data meet the requirements of IRT with the exception of D3 (see above). This supports H3b. There are only small correlations between EES-based scores and scores that are based on generalized self-reports, supporting H4c.


TABLE 8 | Additional fourth EES events, competence facets, and EES items in Study 3.

#### DISCUSSION

#### Summary of Results

Non-cognitive facets of competence are often neglected in competence assessments. In this paper we introduced Embedded Experience Sampling (EES) as an approach to measuring noncognitive facets of domain-specific problem-solving competence within a computer-based office simulation. The feasibility and validity of EES were investigated throughout three studies by using different measures and analysis approaches. Most of the results support the validity of EES. The results are discussed with regard to the research questions and hypotheses that were outlined previously (see **Table 2**).

Research question 1 aimed at the test-takers' perception of the EES events in terms of ecological validity. It was hypothesized that participants in low-stake tests do not experience EES events as an additional and unrealistic burden, a finding supported by group discussions and individual interviews in study 1 and by survey data in study 2. Despite experiencing the scenario as quite difficult, they considered it to be authentic and, on average, did not evaluate EES as an additional burden.

TABLE 9 | EAP/PV reliabilities (diagonal) and latent correlations of the non-cognitive facets and generalized self-reports in Study 3.


Parts of these results were published in Rausch et al. (2016).

Research question 2 aimed at social desirability as a potential bias in terms of construct validity. In study 1, the participants' responses in group discussions and individual interviews gave no reasons to assume biases from social desirability. In study 2, the dispositional tendency for impression management was uncorrelated with the EES responses and situational impression management (i.e., having thought about which response to the EES items would make someone look good) showed very small correlations with the EES responses. Altogether, social desirability does not appear as a source of bias in EES responses.

Research question 3 aimed at assessing the consistency of the EES data with assumptions of the Multitrait-Multimethod approach (MTMM) and the requirements of a multidimensional model based on Item Response Theory (IRT) in terms of internal validity. In study 1, low correlations of heterotraitmonomethod combinations and higher correlations of monotrait-heteromethod combinations support the assumption of internal validity, however, the differences are only small. In study 3, the EES data was calibrated in a multidimensional IRT model and showed satisfactory EAP/PV reliabilities for five of the six facets while one facet had to be excluded due to low reliability. Altogether, our analysis supports the assumption of internal validity.

Research question 4 aimed at the correlation of EES data with CSSM data (Continuous State Sampling Method) and test motivation in terms of convergent validity and the correlation between EES data and generalized retrospective self-reports in terms of divergent validity. Substantial correlations between EES data and test motivation support the assumption of convergent validity, while the correlations between EES data and CSSM data were more heterogeneous. Low (almost zero) correlations between EES data and generalized retrospective self-reports in study 1 and study 3 emphasize the significance of the measurement approach.

Altogether, we collected data on the feasibility and validity of EES throughout three field studies on problemsolving competence in the business domain and found very promising results. Embedding self-reports on situational experience into the "storyline" of authentic problem scenarios produces reliable and valid data on non-cognitive facets of problem-solving competence.

### Limitations and Further Research

Both, the methodological approach and the empirical studies have their limitations. First and foremost, we have not tested for external validity, namely by measuring whether emotional states in the test situations are good proxies for emotional states in respective work situations, which constitutes a strong assumption; not only for the non-cognitive facets but also for the cognitive facets of competence. However, it is very difficult to put together an appropriate research design and collect the respective data to investigate these assumptions. Furthermore, data collection in EES is still based on self-reporting. In our studies, we did not find indications of social desirability or of an additional burden due to EES. However, these studies comprised low-stakes testing. In high-stakes testing, responses to EES items are prone to manipulation and EES events might be experienced as more disruptive. The operationalization of the non-cognitive facets of problem-solving competence is arguable. In study 1, the internal consistencies of EES items measuring the same facet were not satisfactory. Many alternative items would have been just as appropriate or maybe more appropriate as indicators of the respective facet. We have not experimented widely with the operationalization of the facets. One significant alteration concerned the facet D3 "interest in the progress of/in learning from the problem." However, this alteration worsened the model fit and resulted in the exclusion of facet D3 from the IRT model in Study 3, while the correlations within the MTMM analysis in Study 1 had been quite promising. We will vary the item content and the item format in future studies and we encourage other research teams to apply similar approaches in their studies, too.

Limitations of Study 1 and Study 2 were the smaller sample sizes that did not allow for more sophisticated analyses. In Study 2, the CSSM items could have been more similar to the other EES items. Only the items regarding situational interest were very similar. In future studies, more appropriate CSSM items should be applied. Moreover, physiological measures such as heart rate (HR), heart rate variability (HRV), skin conductance or cortisol may be used to further validate the EES data. One such study was conducted by Kärner et al. (2018) who also used the above office simulation and found that CSSM data and physiological data (HR, HRV, and cortisol) showed very similar trends in the course of problem solving. A further data source for validation is the log files from the office simulation. Novak and Johnson (2012) discuss how this non-intrusive data source can be used to measure emotion. Finally, an experimental study in which the participants' emotional experience is manipulated would allow the sensitivity of EES to be tested.

### CONCLUSION

Twenty years ago, Weinert (1999) stated that "when assessing competencies, current motivational influences on performance cannot be measured. [. . .] It is feasible only to measure competence-specific motivational attitudes, for example, with reliable and valid questionnaires" (Weinert, 1999, p. 20). In this paper, we introduced Embedded Experience Sampling (EES) as an alternative method to measure non-cognitive facets of competence within the performance assessment instead of relying on decontextualized general self-reports. The idea behind EES is that the repeated measurement of emotional or motivational states during domain-specific tasks allows for an inference to be made regarding noncognitive traits; similar to Chomsky (1965) distinction between manifest performance and latent competence. This helps to overcome the asymmetry in the content and breadth in the measurement of the cognitive and non-cognitive constructs (Brunswik, 1956).

Drawing on our experience, EES is a feasible and informative approach to measuring non-cognitive facets of competence under the following conditions: (1) The computer-based performance assessment is embedded in an immersive and authentic simulation of a real-life domain. (2) The participants are confronted with comprehensive scenarios that require a sustained performance. (3) The participants are introduced to EES within a tutorial prior to the performance assessment. Drawing on our empirical studies, we found indications of the validity of EES. We would like to encourage other researchers to implement EES or similar approaches into their studies of competence assessment because further research is needed for the subsequent development and validation of the method.

### ETHICS STATEMENT

This study was carried out in accordance with the recommendations of the Ethical Board of the University of Bamberg, Germany and the Board of Data Protection of the Federal State Authority of Bavaria (Germany) with written informed consent from all subjects.

### AUTHOR CONTRIBUTIONS

AR, KK, and JS contributed to method, data collection, and preparation of the manuscript. AR led the project.

### FUNDING

This research was funded by the German Federal Ministry of Education and Research und Grant No. 01DB081119–01DB1123. The publication of this article was funded by the Ministry of Science, Research and the Arts Baden-Württemberg and the University of Mannheim.

### ACKNOWLEDGMENTS

The authors would like to thank the DomPL-IK research group: Christin Siegfried, Detlef Sembill, Eveline Wuttke, Jan Küster, Karsten D. Wolf, Marc Egloffstein, Rebecca Eigenmann, Steffen Brandt, and Thomas Schley.

### REFERENCES

fpsyg-10-01200 May 22, 2019 Time: 17:3 # 14



**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2019 Rausch, Kögler and Seifried. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

### APPENDIX

fpsyg-10-01200 May 22, 2019 Time: 17:3 # 16



FIGURE A1 | Paper-based EES event "Phone call" with one EES item (translated from German).

TABLE A1 | Descriptive statistics of EES items in Study 1.


See Table 3 for corresponding EES items.

TABLE A2 | Descriptive statistics of EES items in Study 2.


See Table 5 for corresponding EES items.

# Evaluating Different Equating Setups in the Continuous Item Pool Calibration for Computerized Adaptive Testing

Sebastian Born<sup>1</sup> \*, Aron Fink<sup>2</sup> , Christian Spoden<sup>3</sup> and Andreas Frey2,4

<sup>1</sup> Department of Research Methods in Education, Institute of Educational Science, Friedrich Schiller University Jena, Jena, Germany, <sup>2</sup> Educational Psychology: Measurement, Evaluation and Counseling, Institute of Psychology, Goethe University Frankfurt, Frankfurt, Germany, <sup>3</sup> German Institute for Adult Education, Leibniz Centre for Lifelong Learning, Bonn, Germany, <sup>4</sup> Faculty of Educational Sciences, Centre for Educational Measurement, University of Oslo, Oslo, Norway

The increasing digitalization in the field of psychological and educational testing opens

#### Edited by:

Ronny Scherer, University of Oslo, Norway

#### Reviewed by:

Alvaro J. Arce-Ferrer, Pearson, United States Alexander Robitzsch, Christian-Albrechts-Universität zu Kiel, Germany

> \*Correspondence: Sebastian Born sebastian.born@uni-jena.de

#### Specialty section:

This article was submitted to Quantitative Psychology and Measurement, a section of the journal Frontiers in Psychology

Received: 23 November 2018 Accepted: 15 May 2019 Published: 06 June 2019

#### Citation:

Born S, Fink A, Spoden C and Frey A (2019) Evaluating Different Equating Setups in the Continuous Item Pool Calibration for Computerized Adaptive Testing. Front. Psychol. 10:1277. doi: 10.3389/fpsyg.2019.01277 up new opportunities to innovate assessments in many respects (e.g., new item formats, flexible test assembly, efficient data handling). In particular, computerized adaptive testing provides the opportunity to make tests more individualized and more efficient. The newly developed continuous calibration strategy (CCS) from Fink et al. (2018) makes it possible to construct computerized adaptive tests in application areas where separate calibration studies are not feasible. Due to the goal of reporting on a common metric across test cycles, the equating is crucial for the CCS. The quality of the equating depends on the common items selected and the scale transformation method applied. Given the novelty of the CCS, the aim of the study was to evaluate different equating setups in the CCS and to derive practical recommendations. The impact of different equating setups on the precision of item parameter estimates and on the quality of the equating was examined in a Monte Carlo simulation, based on a fully crossed design with the factors common item difficulty distribution (bimodal, normal, uniform), scale transformation method (mean/mean, mean/sigma, Haebara, Stocking-Lord), and sample size per test cycle (50, 100, 300). The quality of the equating was operationalized by three criteria (proportion of feasible equatings, proportion of drifted items, and error of transformation constants). The precision of the item parameter estimates increased with increasing sample size per test cycle, but no substantial difference was found with respect to the common item difficulty distribution and the scale transformation method. With regard to the feasibility of the equatings, no differences were found for the different scale transformation methods. However, when using the moment methods (mean/mean, mean/sigma), quite extreme levels of error for the transformation constants A and B occurred. Among the characteristic curve method the performance of the Stocking-Lord method was slightly better than for the Haebara method. Thus, while no clear recommendation can be made with regard to the common item difficulty distribution, the characteristic curve methods turned out to be the most favorable scale transformation methods within the CCS.

Keywords: computerized adaptive test, item response theory, equating, continuous calibration, simulation

### INTRODUCTION

fpsyg-10-01277 June 6, 2019 Time: 9:17 # 2

The shift to using digital technology (e.g., laptops, tablets, and smartphones) for psychological and educational assessments provides the opportunity to implement computer-based state-of-the-art methods from psychometrics and educational measurement in day-to-day testing practice. In particular, computerized adaptive testing (CAT) has the potential to make tests more individualized and to enhance efficiency (e.g., Segall, 2005). CAT is a method of test assembly that uses the responses given to previously presented items for the selection of the next item (e.g., van der Linden, 2016), whereby the item that satisfies a statistical optimality criterion best is selected from a precalibrated item pool. Therefore, the calibrated item pool is an essential and important building block in CAT (e.g., Thompson and Weiss, 2011; He and Reckase, 2014). A set of items is called a calibrated item pool if the item characteristics, such as item difficulty and item discrimination, were estimated on the basis of an item response theory (IRT; e.g., van der Linden, 2016) model beforehand. However, in some contexts, such as higher education, clinical diagnosis, or personnel selection, the item pool calibration for CAT often poses a critical challenge because separate calibration studies are not feasible, and sample sizes are too low to allow for stable item parameter estimation.

To overcome this problem, Fink et al. (2018) proposed a continuous calibration strategy (CCS), which enables a stepby-step build-up of the item pool across several test cycles during the operational CAT phase. In the context of the CCS a test cycle is understood as the whole test procedure including steps like test assembly, test administration and analysis of test results. As the item parameter estimates of existing and new items are continuously updated within the CCS, equating is a critical factor to enable interchangeable score interpretation across test cycles. The equating procedure implemented in the CCS is based on a common-item non-equivalent group design (Kolen and Brennan, 2014) and is carried out in four steps: (1) common item selection, (2) scale transformation, (3) item parameter drift (IPD; e.g., Goldstein, 1983) detection, and (4) fixed common item parameter (FCIP; e.g., Hanson and Béguin, 2002) calibration.

In their study, Fink et al. (2018) evaluated the performance of the CCS for different factors (sample size per test cycle, calibration speed, and IRT model) with respect to the quality of the person parameter estimates. Although the results were promising, two issues remained open. First, the study of Fink et al. (2018) was conducted under ideal conditions (i.e., constant ability distribution of the examinees across test cycles). Second, despite the importance of the equating procedure in the CCS, its performance with respect to different setups of the procedure (i.e., selection of common items, scale transformation method, item drift detection) was not investigated in detail. For example, it became apparent that the CCS did not work as intended for very easy or very difficult items when using small sample sizes (i.e., 50 or 100 examinees) per test cycle. In these cases, item parameter estimates were biased due to a few inconsistent responses, with the consequence that these items were no longer selected by the adaptive algorithm in the following test cycles. Therefore, it was not possible to continuously update the item parameter estimates for these items.

Against this background, the aim of the present study was to investigate the performance of the equating procedure for different setups conducted under more realistic conditions (i.e., examinees' average abilities and variance differ between test cycles). The remainder of the article is organized as follows: First, we provide the theoretical background for the present study by introducing the underlying IRT model and by describing the CCS. Next, we discuss both the previously implemented equating procedure and alternative specifications. Then, we examine the performance of different setups of the different equating procedures in a simulation. Finally, we discuss the results and make recommendations for the implementation of the CCS.

### THEORETICAL BACKGROUND

### IRT Model

The IRT model used in this study was the two-parameter logistic (2PL) model (Birnbaum, 1968) for dichotomous items. The 2PL model defines the probability of a correct response uij = 1 of examinee j = 1 . . . N with a latent ability level θ<sup>j</sup> to an item i by the following model, whereby a<sup>i</sup> is the discrimination parameter and d<sup>i</sup> is the easiness parameter of item i:

$$P\left(u\_{\rm ij} = 1 | \theta\_{\rm j}, a\_{\rm i}, \, d\_{\rm i}\right) = \, \frac{\exp\left(a\_{\rm i}\theta\_{\rm j} + d\_{\rm i}\right)}{1 + \exp\left(\, a\_{\rm i}\theta\_{\rm j} + d\_{\rm i}\right)},\tag{1}$$

In the traditional IRT metric where aiθ<sup>j</sup> + d<sup>i</sup> = a<sup>i</sup> θ<sup>j</sup> − b<sup>i</sup> , the a<sup>i</sup> parameters will be the identical for these parametrizations, while the item difficulty parameter b<sup>i</sup> is calculated as b<sup>i</sup> = −d<sup>i</sup> /a<sup>i</sup> .

### Continuous Calibration Strategy

In the following paragraphs, we briefly outline the CCS as introduced by Fink et al. (2018) and detail the equating procedure implemented. The CCS consists of two phases, a non-adaptive initial phase and a partly adaptive continuous phase. In the initial phase, which is the first test cycle of the CCS, the same items are presented to all examinees and only the item order can vary between examinees. In the continuous phase, the tests assembled consist of three types of item clusters (calibration cluster, linking cluster, adaptive cluster), whereby a cluster is comprised of several items. Each type of cluster has a specific goal. The calibration cluster offers the opportunity to include new items in the existing item pool, the linking cluster utilizes common items to allow a scale to be established across test cycles, and the adaptive cluster aims at the enhancement of measurement precision. The items in the calibration and the linking clusters are the same for all examinees and are administered sequentially, whereas the items in the adaptive cluster can differ between examinees due to the adaptive selection algorithm. Each test cycle in the continuous phase can be broken down into seven steps: (1) common item selection for the linking cluster, (2) test assembly and test administration, (3) temporary item parameter estimation, (4) scale transformation of the common items, (5)

IPD detection for the common items, (6) FCIP calibration, and (7) person parameter estimation. The equating procedure consists of four of these steps, which will be detailed in the following four paragraphs. The first three steps of the equating procedure serve as quality assurance of the common items to ensure feasible equating in the fourth step.

In the common item selection, items that have already been calibrated in the previous test cycles are selected as common items for the linking cluster. To ensure that the common items represent the statistical characteristics of the item pool (Kolen and Brennan, 2014), such as the range of the item difficulty, the items are assigned to five categories (very low, low, medium, high, and very high) based on their easiness parameters d<sup>i</sup> . Fink et al. (2018) selected the items from the categories in such a way that the difficulty distribution of the common items corresponded approximately to a normal distribution. Beside the representation of the statistical item pool characteristics it is important that the common items adequately reflect the content of the item pool. This can be done by using content balancing approaches (e.g., van der Linden and Reese, 1998; Cheng and Chang, 2009; Born and Frey, 2017) within the common item selection and within the adaptive cluster.

After test assembly and test administration, the parameters for the common items are estimated based on the responses of the current test cycle. In the second step of the equating procedure, a scale transformation of the common items has to be conducted, because the ability distribution of the examinees usually differs between test cycles and, therefore, the item parameter estimates obtained are not directly comparable across cycles. The comparability of the parameter estimates is a necessary condition to check whether the common items are affected by IPD. For this reason, scale transformation methods (e.g., Marco, 1977; Haebara, 1980; Loyd and Hoover, 1980; Stocking and Lord, 1983) are important for the equating procedure. Fink et al. (2018) used the mean/mean method (Loyd and Hoover, 1980) for the scale transformation.

As IPD of item parameters may have a serious impact on equating results such as scaled scores and passing rates (Hu et al., 2008; Miller and Fitzpatrick, 2009), the IPD detection as the third step of the equating procedure is important if the method is to operate optimally. A number of tests for IPD can be used in IRT-based equating procedures, such as the Lord's χ 2 -test (Lord, 1980) and the likelihood-ratio test (Thissen et al., 1988). In an iterative process of scale transformation and testing for IPD, common items that show significant IPD are excluded from the final set of common items. The iterative purification continues as long as at least one of the remaining common items shows significant IPD or less than two common items are left. The rationale behind the latter stopping rule is that at least two link items are necessary to keep the scale comparable across test cycles. Nevertheless, it should be mentioned that with a smaller number of link items, the equating procedure is more prone to sampling errors (Wingersky and Lord, 1984). Fink et al. (2018) used a one-sided t-test to examine whether the parameter estimates of a common item from the current test cycle differed significantly from the parameter estimates of the same item from the preceding test cycle.

The last step of the equating procedure, the FCIP calibration, involves the parameter estimation of all items using marginal maximum likelihood (MML; Bock and Aitkin, 1981) based on the responses from all test cycles. Because one aim of the CCS is to maintain the original scale from the initial calibration (first test cycle), the use of one step procedures (e.g., concurrent calibration; Wingersky and Lord, 1984) for estimating all item parameters of the different test cycles in one run is not suitable. If maintaining the scale from the initial calibration over the following test cycles has no priority, promising methods exist for equating multiple test forms simultaneously (Battauz, 2018). In the FCIP calibration, the parameters of the final common items are fixed at the item parameters estimated from the previous test cycle, whereas all the other items are estimated freely. If a "breakdown" occurs, which means that less than two common items remain after the IPD detection, a concurrent calibration (Wingersky and Lord, 1984) is used to establish a new scale.

### Specifications of the Common Item Selection

The common item selection and the scale transformation of the common items are crucial parts of the CCS because they ensure that the procedure functions well. In terms of the common item selection, different distributional assumptions such as an approximated normal distribution, as used in Fink et al. (2018), or a uniform distribution may underlie the item selection. Up to now, only Vale et al. (1981) examined the impact of different common item distributions on the accuracy of the item parameter estimates using the mean/sigma method (Marco, 1977). The authors selected the common items in such a way that the test information curves of the common items were peaked (with the most information at theta equals zero) or had an approximately normal or uniform shape. In terms of the bias of the item parameter estimates, the peaked test information curve performed worst. There were only slight differences in the performance, depending on whether normally or uniformly shaped test information curves were used for the common items. As an alternative, items with extreme difficulties (bimodal distribution) might be selected as common items for the linking cluster and, therefore, might be administered to all examinees. As a consequence, the number of responses for these items increases and the impact of the few inconsistent responses that might cause bias in the estimates and prevent later administration and parameter updating in the following test cycles would be reduced. Because the quality of the equating highly depends on the common items selected, it may be argued that especially a bimodal distribution of the common items threatens the goal of maintaining the scale across test cycles. However, the item drift test implemented in the CCS ensures that significant changes in the parameter estimates of the common items between test cycles do not affect the later FCIP calibration that is used to maintain the scale.

### Scale Transformation

When item parameters are estimated using different groups of examinees, the obtained parameters are often not comparable

due to arbitrary decisions that have been made to fix the scale of the item and person parameter space (Yousfi and Böhme, 2012). In that case, the comparability of the item parameters can be attained by an IRT scale transformation. If the underlying IRT model holds for two groups of examinees, K and L, then the logistic IRT scales differ by a linear transformation for both the item parameters and the person parameters (Kolen and Brennan, 2014). The linear equation for the θ-values can be formulated as follows:

$$
\Theta\_{\rm Lj} = A \theta\_{\rm Kj} + B,\tag{2}
$$

where A and B represent the transformation constants (also referred to as slope and shift) and θKj and θLj the person parameter values for an examinee j on scale K and scale L. The item parameters for the 2PL model on the two scales are defined in Eqs 3 and 4, where aKi, bKi, and aLi, bLi represent the item parameters on scale K and on scale L, respectively.

$$a\_{\rm Li} = \frac{a\_{\rm Ki}}{A} \tag{3}$$

$$b\_{\rm Li} = Ab\_{\rm Ki} + B \tag{4}$$

To obtain the transformation constants A and B, several scale transformation methods can be used. The moment methods such as the mean/mean and the mean/sigma express the relationship of scales by using the means and standard deviations of item or person parameters, whereas the characteristic curve methods minimize a discrepancy function with respect to the item characteristic curves (Haebara, 1980) or the test characteristic curve (Stocking and Lord, 1983). Research comparing these methods has found that characteristic curve methods produced more stable results compared to the moment methods (e.g., Baker and Al-Karni, 1991; Kim and Cohen, 1992; Hanson and Béguin, 2002). Within the moment methods, the mean/mean method turned out to be more stable (Ogasawara, 2000). Furthermore, Kaskowitz and de Ayala (2001) found that characteristic curve methods were robust against moderate estimation errors and were more accurate with a larger number of common items (15 or 25 compared to only five common items). In sum, the moment methods are easily implementable, but the characteristic curve methods seem to be more robust against estimation errors.

#### RESEARCH QUESTIONS

As the purpose of equating procedures in the CCS is to enable an interchangeable score interpretation across test cycles, the selection of the common items is a crucial factor for feasible equating. Up to now, only recommendations for the number of common items that should be used when conducting IRT equating have been made (Kolen and Brennan, 2014). Furthermore, it is suggested that the common items should

represent the content and statistical characteristics of the test or rather the complete item pool. For example, modifying the common item selection in such a way that more items with extreme item difficulty levels are included may enhance the precision of these items, but it could threaten the quality of the equating. Therefore, our first two research questions can be formulated as follows:


Fink et al. (2018) used the mean/mean method for scale transformation because of its simple and user-friendly implementation. Given prior research on scale transformation methods, this might not be the best choice when the sample size per test cycle is low. Furthermore, there are several packages for the open-source software R (R Core Team, 2018) available to implement the characteristic curve methods (e.g., Weeks, 2010; Battauz, 2015). As already mentioned above, the scale transformation method used and the IPD detection implemented in the CCS could serve as quality assurance to ensure that significant changes in the parameter estimates of the common items between test cycles do not affect the later FCIP calibration. For this reason, our third research question is:

3. What effect does the scale transformation method used in the CCS have on the quality of the equating?

As the CCS was developed for a context in which separate calibration studies are often not feasible and sample sizes are too low to allow for stable item parameter estimation, it is important to evaluate whether the results for these three research questions were affected by the sample size. Consequently, each of the three research questions was investigated with a special focus on additional variations of the sample size.

### MATERIALS AND METHODS

#### Study Design

Many factors can affect the quality of the equating within the CCS. These include, among others, the number of common items, the test length, the characteristics of the common items, the scale transformation method applied, the number of examinees per test cycle, the presence of IPD and the test applied for IPD. In the present study, some of these factors were kept constant (e.g., number of common items, test length, the presence of IPD, test applied for IPD) to ensure the comprehensibility of the study results.

To answer the research questions stated above, a Monte Carlo simulation based on a full factorial design with three independent variables (IVs) was conducted. With the first IV, difficulty distribution, the distribution of easiness parameters d<sup>i</sup> of the common items (normal, uniform, and bimodal with very low and very high difficulties only) was varied. The second IV, transformation method, compared the most common scale transformation methods (mean/mean, mean/sigma, Haebara, and Stocking-Lord) used for computing the transformation constants to conduct the scale transformation. The third IV, sample size, reflected the number of test takers per test cycle (N = 50; N = 100; N = 300). Because the CCS uses the responses from multiple test cycles, the number of test takers per test cycle chosen for the study is small compared to the recommendations (e.g., a minimum of 500 responses per item for the 2PL model; de Ayala, 2009). The fully crossed design comprised 3 × 4 × 3 = 36 conditions. For each of the conditions, 200 replications were conducted and analyzed with regard to various evaluation criteria (see below).

The simulations were carried out in R (R Core Team, 2018) using the "mirtCAT" package (Chalmers, 2016) for simulating adaptive tests and the "mirt" package (Chalmers, 2012) for item and person parameter estimation. Transformation constants were calculated based on the common items of consecutive test cycles using the "equateIRT" package (Battauz, 2015). The test for IPD was also conducted with the "equateIRT" package. We decided to use the "equateIRT" package in the simulations because it enables a direct import of results from the "mirt" package and offers an implemented test for IPD. The corresponding functions were called in a R script, which was written to carry out the CCS.

## Simulation Procedure

#### Data Generation

In each replication, the discrimination parameters a<sup>i</sup> were drawn from a lognormal distribution, a<sup>i</sup> ∼ logN (0, 0.25), and the easiness parameters d<sup>i</sup> were drawn from a truncated normal distribution, d<sup>i</sup> ∼ N (0, 1.5), d<sup>i</sup> ∈ (−2.5, 2.5). Since this study was not designed to investigate IPD detection rates (e.g., Battauz, 2019), no IPD was simulated in the data. Therefore the true item parameters a<sup>i</sup> and d<sup>i</sup> remained unchanged over the test cycles.

The ability parameters of the examinees in the first test cycle in each replication were randomly drawn from a standard normal distribution, θ ∼ N (0, 1). For the subsequent test cycles t within a replication, the ability parameters followed a normal distribution, θ ∼ N (µt, σt), whereby the mean µ<sup>t</sup> ∈ (−0.5, 0.0, 0.5) and the standard deviation σ<sup>t</sup> ∈ (0.7, 1.0, 1.3) were randomly drawn. This was done to mimic the fact that examinees of different test cycles usually differ with respect to the mean and variance of their ability distribution. The examinees' responses to the items were generated in line with the 2PL model.

#### Specification of the CCS

The CCS in the current study was applied with all seven steps proposed by Fink et al. (2018) including the IPD detection of the common items. Although no IPD was simulated in the data, in realistic settings the untested assumption of item parameter invariance is questionable. Even in the absence of IPD item parameters can significantly differ between test cycles because of sampling error. The number of test cycles within the CCS was set to 10 test cycles, whereby the first test cycle represented the initial phase and the subsequent test cycles the continuous phase. The test length was kept constant with 60 items. The calibration cluster in the continuous phase consisted of 20 items, resulting in an item pool size of I<sup>t</sup> = 60 + (t − 1) · 20 after the test cycle t, and a total item pool size of 240 items after the 10th test cycle. Following the recommendation of Kolen and Brennan (2014) that the number of common items should be at least 20% of the test length, the number of common items in the linking cluster was set to 15 items. Consequently, the adaptive cluster in each test cycle of the continuous phase contained 25 items. Within the adaptive cluster, the maximum a posteriori (MAP; Bock and Aitkin, 1981) was used as the ability estimator and the maximum information criterion (Lord, 1980) was applied for the adaptive item selection.

For the common item selection within the equating procedure, only items that had already been calibrated in the previous test cycles and that did not serve as common items in the preceding test cycle were eligible. The selection procedure for the common items differed depending on the intended distribution. For the normal distribution, the procedure of Fink et al. (2018) was applied. The eligible items were first assigned to five categories (very low, low, medium, high, and very high) based on their easiness parameters d<sup>i</sup> . Then, five items from the "medium" category, three items each from the "low" and "high" categories, and two items from each of the extreme categories were chosen to mimic a normal distribution. For the uniform distribution, the eligible items were assigned to 15 categories based on their easiness parameters d<sup>i</sup> and one item from each category was drawn. The interval limits of the categories were determined as quantiles of the item difficulty distribution. For the bimodal distribution, the eligible items were ordered according to their easiness parameters d<sup>i</sup> and two subsamples were formed containing the 11 easiest and the 11 hardest items, respectively. Then, 15 items in total were randomly drawn from the two subsamples (seven easy and eight difficult items, or vice versa). As already mentioned, the selected common items in periodical assessments should be comparable also with regard to content characteristics. Content balancing approaches like the maximum priority index (Cheng and Chang, 2009) and the shadow testing approach (van der Linden and Reese, 1998) may be used for this purpose. Because no substantial impact was expected on the

measurement precision of the item parameters or on the quality of the equating, content balancing was not considered as a factor in the study.

For the scale transformation, one of the four transformation methods (Mean/Mean, Mean/Sigma, Haebara, and Stocking-Lord) was applied. A modified version of Lord's chi-squared method (Lord, 1980) that is implemented in the "equateIRT" package (Battauz, 2015) was used as the test for IPD with a type I error level of 0.05. In an iterative purification process (Candell and Drasgow, 1988) of scale transformation and testing for IPD, items that showed significant IPD were removed from the set of common items. In each test cycle, MML estimation was used to obtain the item parameters for both the temporary item parameter estimation and the FCIP calibration. The lower and the upper bound for the item discrimination a<sup>i</sup> was set to –1 and 5, respectively. For the item easiness parameters d<sup>i</sup> , the bounds were set to –5 and 5.

### Evaluation Criteria

The mean squared error (MSE) of the item parameters a<sup>i</sup> and d<sup>i</sup> , respectively, was calculated after each test cycle t as the averaged squared difference between the item parameter estimates and the true item parameters for all items I<sup>t</sup> across all replications R = 200. Thus, a high degree of precision is denoted by low values for the MSE.

$$MSE\_{\mathbf{l}}\left(a\_{\mathbf{i}}\right) = \frac{1}{R^\* I\_{\mathbf{l}}} \sum\_{\mathbf{r}=1}^{R} \sum\_{\mathbf{i}=1}^{I\_{\mathbf{l}}} \left(\hat{a}\_{\mathbf{i}\mathbf{r}} - a\_{\mathbf{i}\mathbf{r}}\right)^2 \tag{5}$$

$$MSE\_{\mathbf{l}}\left(d\_{\mathbf{i}}\right) = \frac{1}{R^\* I\_{\mathbf{l}}} \sum\_{\mathbf{r}=1}^{R} \sum\_{\mathbf{i}=1}^{I\_{\mathbf{l}}} \left(\hat{d}\_{\mathbf{i}\mathbf{r}} - d\_{\mathbf{i}\mathbf{r}}\right)^2 \tag{6}$$

Because our aim was to evaluate whether the modified common item selection could prevent a dysfunction of the CCS in terms of more precise item parameter estimates for items with very low and very high values for d<sup>i</sup> , the conditional MSE was used as a criterion. Therefore, the MSE was calculated for seven easiness intervals: d<sup>i</sup> ∈ −Inf, −2 , d<sup>i</sup> ∈ (−2, −1], d<sup>i</sup> ∈ (−1, −0.25], d<sup>i</sup> ∈ (−0.25, 0.25], d<sup>i</sup> ∈ (0.25, 1], d<sup>i</sup> ∈ (1, 2], and d<sup>i</sup> ∈ 2, Inf .

Three criteria were used to evaluate the equating quality. As a first criterion, we used the proportion of test cycles in which no breakdown of the common items occurred. Second, we calculated the proportion of drifted items for each of the 36 conditions. And third, we computed the accuracy (Error) of the scale transformation constants A and B for each replication r

when no breakdown occurred as the difference between the true and the estimated transformation constants for every test cycle in the continuous phase. The average of the Error corresponds to the Bias of the transformations constants.

$$Error\left(A\_{\rm tr}\right) = \left(\hat{A}\_{\rm tr} - A\_{\rm tr}\right) \tag{7}$$

$$Error\left(B\_{\rm tr}\right) = \left(\hat{B}\_{\rm tr} - B\_{\rm tr}\right) \tag{8}$$

The true transformation constants A and B were calculated based on the true examinees' abilities from/in all previous test cycles p and from/in the current test cycle t (Kolen and Brennan, 2014).

$$A\_{\mathfrak{k}} = \frac{\mathfrak{o}\left(\theta\_{\mathfrak{k}}\right)}{\mathfrak{o}\left(\theta\_{\mathfrak{p}}\right)}\tag{9}$$

$$B\_{\rm t} = \mu \left( \theta\_{\rm t} \right) - A\mu \left( \theta\_{\rm p} \right) \tag{10}$$

The estimated transformation constants Aˆ <sup>t</sup> and Bˆ <sup>t</sup> were obtained based on the parameter estimates of the final set of common items from the previous and the current test cycles using one of the four scale transformation methods implemented in the "equateIRT" package (Battauz, 2015). The third criterion was calculated only for the cases where at least two common items remained after the IPD detection.

### RESULTS

Note that the conditions with the mean/mean method as scale transformation method and normal distributed common items mimic the setup of the equating procedure from Fink et al. (2018).

#### Conditional Precision of Item Parameters

To answer the first research question regarding the precision of the item parameter estimates, we analyzed the conditional MSE of the item discrimination parameters a<sup>i</sup> and the item easiness parameters d<sup>i</sup> depending on the scale transformation method, the common item difficulty distribution, and the sample sizes per test cycle. For the sake of clarity, the results are only

presented for the second, the sixth, and the 10th test cycles of the CCS. **Figures 1**–**3** illustrate the conditional MSE of the item discrimination parameter estimates a<sup>i</sup> , and **Figures 4**–**6** illustrate the conditional MSE of the item easiness parameter d<sup>i</sup> . As can be expected based on the findings from Fink et al. (2018), the MSE for the item discrimination parameter estimates and the item easiness parameter estimates decreased as the number of test cycles in the CCS increased and as the sample size per test cycle increased. With regard to the precision of the item parameter estimates, no substantial differences were found between the different scale transformation methods, independent of the common item difficulty distribution and the sample size per test cycle. When a bimodal difficulty distribution of common items was chosen, the precision of the item parameter estimates for the very easy and very difficult items was higher compared to a normal or uniform difficulty distribution of common items (**Figures 1**, **4**). However, this minimal gain came at the expense of a lower precision of the item parameter estimates for items with medium difficulty. This effect was found for very small sample sizes per test cycle (N = 50), and diminished for larger sample sizes (N = 100, N = 300).

### Quality of Equating

The second and third research questions focused on the equating procedure. The first evaluation criterion was the proportion of feasible equatings (at least two items remained after the IPD detection). Most striking was that over all replications for none of the test cycles a breakdown of the common items occurred. Furthermore, for all 36 conditions the median number of eligible common items over all test cycles and replications ranged from 14 to 15.

The second evaluation criterion was the proportion of drifted items. As IPD was not simulated in the study and because the type I error level of the test for IPD was set to 0.05, it was expected that approximately five percent of the common items would show significant IPD. **Figure 7** shows the proportion of drifted common items depending on the common item difficulty distribution, the scale transformation method, and the sample

size per test cycle. It is obvious from this figure that independent of the scale transformation method and the common item difficulty distribution, the type I error rates increased with increasing sample size per test cycle. This effect was stronger for the moment/methods. Furthermore, it became apparent that if the difficulty distribution of the common items was uniform or normal, all scale transformation methods did not considerably differ from the type I error level of 0.05. The only exception to this result was the mean/sigma method which generally led to considerably smaller type I error rates when the sample size was small (N = 50). All in all, using the Stocking-Lord method resulted for all conditions in type I error rates that did not considerably differ from the type I error level of 0.05.

The third evaluation criterion was the accuracy of the transformation constants A and B when no breakdown occurred. **Figures 8**, **9** show violin plots for the Error of the transformation constants A and B depending on the common item difficulty distribution, the scale transformation method, and the sample size per test cycle. In violin plots, the frequency distribution of a numeric variable (e.g., bias) is expressed. Note that the average error ( = Bias; represented by the dot in the violin) for both transformation constants A and B did not differ substantially from zero for all scale transformation methods, independent of the common item difficulty distribution and the sample size per test cycle. However, the variation of the error (represented by the height of the violin) differed between the scale transformation methods and, especially for the moment methods rather high levels of error occurred. The characteristic curve methods showed the lowest variation in error. With increasing sample size per test cycle, the variation of the error decreased, but there were still extreme levels of error for the mean/mean and the mean/sigma method.

In summary and in terms of the three research questions, the study provided the following results:

1. The difficulty distribution of the common items in the CCS did not have a substantial impact on the precision of the item parameter estimates

although small differences existed between the common item distributions; these differences were in opposite/varying directions for extreme and mediumranged item easiness parameters d<sup>i</sup> when the sample size was very small.

violin represents the bias and the width of the violin expresses the frequency of the corresponding value.


### DISCUSSION

The objective of the present study was to evaluate different setups of the equating procedure implemented in the CCS and to make/provide recommendations on how to apply these setups. For this purpose, the quality of the item parameter estimates and of the equating was examined in a Monte Carlo simulation for different common item difficulty distributions, different scale transformation methods, and different sample sizes per test cycle.

The following recommendations can be made based on the results obtained: First, no clear advantage of using any of the three common item difficulty distributions was identified. Regarding the precision of the item parameter estimates, the results show a slight increase in the precision of the item parameter estimates for items with extreme difficulties when using a bimodal common item difficulty distribution compared to a normal or uniform distribution. However, the precision of the item parameter estimates for items with medium difficulty decreased. These effects were only found for very small sample sizes per test cycle (N = 50) and no differences were found for larger sample sizes (N = 100, N = 300). Furthermore, the use of different scale transformation methods did not have a substantial effect on the precision of the item parameter estimates.

Note that exposure control methods (e.g., Sympson and Hetter, 1985; Revuelta and Ponsoda, 1998; Stocking and Lewis, 1998) might be an alternative to increase the number of responses to items with extreme difficulty levels and, in consequence, the precision of the item parameter estimates for these items. However, using these methods would sacrifice adaptivity to a certain degree and, thus, the efficiency of the computerized adaptive test (e.g., Revuelta and Ponsoda, 1998). This is even more relevant to tests assembled within the partly adaptive CCS, because only one of the three cluster types used is based on an adaptive item selection. Furthermore, in the early stages of the CCS, the item pool is rather small, which also limits the adaptivity of the tests. For these reasons, it can be expected that exposure control methods do not offer an ideal option for the CCS to increase the precision of item parameter estimates for items with extreme difficulties. This point might be examined by future research.

Second, with respect to the quality of the equating, no difference was found for the scale transformation methods with regard to the proportion of feasible equatings independent of the common item difficulty distribution used and the sample size available per test cycle. The rule for evaluating an equating as feasible (at least two common items remained after the test for IPD) is worthy of discussion because of two reasons: first, with a small number of remaining common items, the equating procedure is more prone to sampling error (Wingersky and Lord, 1984) and second, it is rather unlikely that the content of the item pool is adequately reflected by the remaining common items. However, even if the criterion for evaluating an equating as feasible had been set to ten remaining common items, the proportion of feasible equatings would be at least 99% in all conditions. With regard to the type I error rate and the error of the transformation constant the characteristic curve methods outperformed the moment methods especially for small sample

### REFERENCES


sizes. This is in line with the result of Ogasawara (2002) who found that the characteristic curve methods are less affected by imprecise item parameter estimates and lead to more accurate transformation than moment methods. Among the characteristic curve methods the Stocking-Lord method was slightly better than the Haebara method in almost all conditions. Thus, although our results do not facilitate a clear recommendation regarding the most favorable common item difficulty distribution, they do enable a clear recommendation in terms of the preferred scale transformation method: The Stocking-Lord method should be used as the scale transformation method within the CCS.

### AUTHOR CONTRIBUTIONS

SB conceived the study, conducted the statistical analyses, drafted the manuscript, and approved the submitted version. AFi performed substantial contribution to the conception of the study, contributed to the programming needed for the simulation study (R), reviewed the manuscript critically for important intellectual content, and approved the submitted version. CS performed substantial contributions to the interpretation of the study results, reviewed the manuscript critically for important intellectual content, and approved the submitted version. AFr provided advise in the planning phase of the study, reviewed the manuscript critically for important intellectual content, and approved the submitted version.

### FUNDING

The research reported in the article was supported by a grant from the German Federal Ministry of Education and Research (Ref: 16DHL1005).



**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

The handling Editor declared a shared affiliation, though no other collaboration, with one of the authors AFr at the time of review.

Copyright © 2019 Born, Fink, Spoden and Frey. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

# Collaborative Problem Solving: Processing Actions, Time, and Performance

Paul De Boeck1,2 \* and Kathleen Scalise<sup>3</sup> \*

<sup>1</sup> Department of Psychology, The Ohio State University, Columbus, OH, United States, <sup>2</sup> Department of Psychology, KU Leuven, Leuven, Belgium, <sup>3</sup> Department of Educational Methodology, Policy, and Leadership, University of Oregon, Eugene, OR, United States

#### Edited by:

Ronny Scherer, University of Oslo, Norway

#### Reviewed by:

Sharlene D. Newman, Indiana University Bloomington, United States Maria Bolsinova, University of Amsterdam, Netherlands

#### \*Correspondence:

Paul De Boeck deboeck.2@osu.edu Kathleen Scalise kscalise@uoregon.edu

#### Specialty section:

This article was submitted to Educational Psychology, a section of the journal Frontiers in Psychology

Received: 18 December 2018 Accepted: 15 May 2019 Published: 07 June 2019

#### Citation:

De Boeck P and Scalise K (2019) Collaborative Problem Solving: Processing Actions, Time, and Performance. Front. Psychol. 10:1280. doi: 10.3389/fpsyg.2019.01280 This study is based on one collaborative problem solving task from an international assessment: the Xandar task. It was developed and delivered by the Organization for Economic Co-operation and Development Program for International Student Assessment (OECD PISA) 2015. We have investigated the relationship of problem solving performance with invested time and number of actions in collaborative episodes for the four parts of the Xandar task. The parts require the respondent to collaboratively plan a process for problem solving, implement the process, reach a solution, and evaluate the solution (For a full description, see the Materials and Methods section, "Parts of the Xandar Task.") Examples of an action include posting to a chat log, accessing a shared resource, or conducting a search on a map tool. Actions taken in each part of the task were identified by PISA and recorded in the data set numerically. A confirmatory factor analysis (CFA) model looks at two types of relationship: at the level of latent variables (the factors) and at extra dependencies, which here are direct effects and correlated residuals (independent of the factors). The model, which is well-fitting, has three latent variables: actions (A), times (T), and level of performance (P). Evidence for the uni-dimensionality of performance level is also found in a separate analysis of the binary items. On the whole for the entire task, participants with more activities are less successful and faster, based on the United States data set employed in the analysis. By contrast, successful participants take more time. By task part, the model also investigates relationships between activities, time, and performance level within the parts. This was done because one can expect dependencies within parts of such a complex task. Results indicate some general and some specific relationships within the parts, see the full manuscript for more detail. We conclude with a discussion of what the investigated relationships may reveal. We also describe why such investigations may be important to consider when preparing students for improved skills in collaborative problem solving, considered a key aspect of successful 21st century skills in the workplace and in everyday life in many countries.

Keywords: problem solving, strategy, factor model, measurement, collaboration

### INTRODUCTION

fpsyg-10-01280 June 7, 2019 Time: 9:54 # 2

The construct explored here, collaborative problem solving (CPS), was first introduced to the Program for International Student Assessment (PISA) in 2015. Attempts to explore process data collected in complex activities such as CPS are emerging rapidly in education. Yet which models might best fit process data and the analytic techniques to employ to investigate patterns in the data are not well understood at this time. So here we investigate whether relationships seen in the actions taken by PISA respondents, as coded by PISA, might shed light on approaches for modeling complex CPS tasks.

In the CPS task released by PISA, the Xandar task, there are four parts. The parts of the task require the respondent to collaborate to plan a process for problem solving, implement the process, reach a solution, and evaluate the solution. (For a full description of these parts, see the Materials and Methods section, "Parts of the Xandar Task.") Examples of actions in Part 1, for instance, include posting to a chat log, accessing a shared resource, or conducting a search on a shared map tool.

In each of the parts, process data are available on time spent and number of actions, as well as on the performance on specific items within the four parts. We explore modeling these Xandar data to address three research questions:

RQ1. Does a factor model employing process data (actions and time) support evidence for a latent variable differentiation between the types of process data (actions, time) and between the latter two and quality of performance? The expected latent variables are Actions, Time, and Performance.

RQ2. Do extra dependencies at the level of the observed variables improve model fit, including direct effects and correlated residuals (independent of the factors)? If they do, they reveal direct relationships between process aspects and performance, independent of the latent variables. These direct relationships are indications of the dynamics underlying collaborative problem solving, whereas the latent variables and their correlations inform us about global individual differences in process approaches and performance.

RQ3. Can the performance also be considered as unidimensional at the specific level of the individual items (from all four Xandar parts)?

In this Xandar investigation, each factor (latent variable) is composed of four corresponding measures from the four Xandar parts. Data are fit with a latent variable model to answer RQ1. Dependencies within parts can be expected between the three measures. So we address the extra dependencies in RQ2. The dependencies are not only considered for methodological reasons when variables stem from the same part, but they may also reveal how subjects work on the tasks. Finally, because a good-fitting factor model would imply uni-dimensionality of the performance sum scores from the four parts, we also explore uni-dimensionality at the level of the individual items in RQ3.

Sections in this paper first discuss the PISA efforts to explore problem solving in 2012 and 2015 assessments, then offer a brief summary of the literature on CPS. Next in the Materials and Methods section, we discuss the PISA 2015 collaborative complex problem solving released task, "Xandar," including the availability of the released code dictionary and data set. In the Results and Discussion, we model United States data from the Xandar task and report results to address the three research questions.

### PISA AND A BRIEF SUMMARY OF LITERATURE ON CPS

The PISA 2015 CPS construct, which included measuring groups in collaboration, was built on PISA's 2012 conception of individual problem solving (OECD, 2014). In PISA 2012, some student individual characteristics related to individual problem solving were measured. These measures were openness to learning, perseverance, and problem solving strategies.

For the 2015 PISA collaborative framework (OECD, 2013), the construct of problem solving was extended from 2012 in order to include measures of group collaboration. For this new assessment in 2015, it was recognized that the ability of an individual to be successful in many modern situations involves participating in a group. Collaboration was intended to include such challenges as communicating within the group, managing conflict, organizing a group, and building consensus, as well as managing progress on a successful solution.

The PISA framework described the importance of improving collaboration skills for students (Rummel and Spada, 2005; Vogel et al., 2016) The measurement of collaboration skills was at the heart of problem solving competencies in the PISA CPS 2015 framework. The framework specified first that the competency being described remained the capacity of an individual, not the group. Secondly, the respondent must effectively engage in a process whereby two or more agents attempt to solve a problem, where the agents can be people or simulations. Finally, the collaborators had to show efficacy by sharing the understanding and effort required to come to a solution, such as pooling knowledge to reach solutions.

Approaches to gathering assessment evidence cited by the PISACPS framework (OECD, 2013) ranged from allowing actions during collaboration to evaluating the results from collaboration. Measures of collaboration in the research literature include solution success, as well as processes during the collaboration (Avouris et al., 2003). In situ observables for such assessments could include analyses of log files in which the computer keeps a record of student activities, sets of intermediate results, and paths taken along the way (Adejumo et al., 2008). Group interactions also offer relevant information (O'Neil et al., 1997), including quality and type of communication (Cooke et al., 2003; Foltz and Martin, 2008; Graesser et al., 2008) and judgments (McDaniel et al., 2001).

The international Assessment and Teaching for twenty-first century Skills (ATC21S) project also examined the literature on disposition to collaboration and to problem solving in online environments. ATC21S described how interface design

feature issues and the evaluation of CPS processes interact in the online collaboration setting (Scalise and Binkley, 2009; Binkley et al., 2010, 2012).

In the PISA 2015 CPS assessment, a student's collaborative problem-solving ability is assessed in scenarios where the student must solve a problem. For collaboration, the problem is solving working with "agents," or computer avatars that simulate collaboration. The CPS framework describes that a problem need not be subject-matter specific task,. Rather it could also be as a partial task in an everyday problem. Examples of subject-matter specific problem solving include setting up a sustainable fish farm in science, planning the construction of a bridge using engineering and mathematics, or writing a persuasive letter using language arts and literacy Examples of an "everyday" problem include communicating with others to delegate roles during collaboration for event planning, monitoring to ensure a group remains on task, and evaluating whether collaboration is complete. All these actions can be directed toward the ultimate goal.

In the PISA 2015 perspective, assessment is continuous throughout the unit and can incorporate student's interactions with the digital agents. Each student response on a traditional question follows a stream of actions during which the student has chosen how to interact and collaborate with standardized agents in each particular task situation. Very few of the collaborative actions and tasks are released by PISA, but the number of collaborative actions in each part of the task are released and made available in the PISA data sets. So here we accept that PISA has coded the action as taking place, and analyze the numeric results provided.

## MATERIALS AND METHODS

### Parts of the Xandar Task

Here we analyze numeric data provided for the PISA 2015 Xandar unit (OECD, 2017a,b). In the unit Xandar:

"A three-person team consisting of the student test-taker and two computer agents takes part in a contest where [the team] must answer questions about the fictional country of Xandar. The questions [involve] Xandar's geography, people and economy. This unit involves decision-making and coordination tasks, requires consensus-building collaboration, and has an inschool, private, and non-technology-based context."

Xandar is a fictional planet appearing in comic books published by Marvel Comics. In the PISA Xandar task, it is treated as a mythical location to be investigated collaboratively. The Xandar task has four parts:

• Part 1 – Agreeing on a Strategy. This part of the Xandar activity familiarizes the student with how the contest will proceed, the chat interface and the task space including buttons that students can click to take actions in particular situations and a scorecard that monitors team progress. In Part 1, the student is assigned to work in a team with digital agents named Alice and Zach. A variety of actions are available. The respondent and the agents interact to generate a stream of actions. The respondent is expected to follow the rules of engagement provided for the contest and to effectively establish collaborative and problem-solving strategies that were the goal of Part 1.


Each of the four parts comes with a number of items to score the performance. The complete Xandar released task is presented in an OECD PISA report that illustrates the items that students faced in the 2015 PISA collaborative problem-solving assessment (OECD, 2016). The released code dictionary and data are also available on the 2015 PISA website. We do not repeat the Xandar information here (due in part to copyright), but summarize only. The Xandar released unit presents:


### Sample

As described earlier, this study employed data publicly released from the Organization for Economic Co-operation and Development Program for International Student Assessment (OECD PISA) for the optional collaborative problem solving (CPS) assessment. It was administered in 2015 to nationally representative samples of approximately age 15 students. Since PISA is designed to have systematically missing data in a matrix sample, only students who took the Xandar task were included. Students were sampled according to the PISA sample frame. Data analyzed here are representatively sampled United States participants from the Xandar released task. See **Table 1** for descriptives by age, gender and race/ethnicity of the United States Xandar task sample used.

From the 994 students who took the Xandar task, 986 have complete Xandar data. The descriptive statistics and all analyses are based on N = 986. (Note that limitations to be discussed later in this manuscript include only United States data examined to date in this exploration. Extensions to more countries and comparisons across countries are an exciting and interesting potential to the work. However, the international extensions are out of scope for this article.) For the purposes of the current study, the school variable was not employed. All students were treated as one group.

Regarding ethical approval and consent for human subjects data collection in PISA, OECD gains ethical approval and consent through PISA processes. Processes are established in coordination with each country for public release of some deidentified data collected in PISA main study assessments. Data sets made available for release are intended for purposes of secondary research. The CPS data set used here is available through the OECD data repository website<sup>1</sup> .

As discussed earlier, for the Xandar task, released data are available for actions, time and level of performance. The data for the current study included four indicators each of CPS actions taken (parts 1–4), time taken (parts 1–4), and success scores (parts 1–4). These become the three latent traits, or factors, in this study. To measure CPS actions, we used number of collaboration actions as measured by the data provided in the log transformation of C1A, C2A, C3A, and C4A. "C" indicates this was a collaborative assessment, the numeral indicates the Xandar part, and "A" indicates number of actions taken. To measure timing, we used timing as measured by data provided in the log transformation of C1T, C2T, C3T, and C4T. "C" indicates this was a collaborative assessment, the numeral indicates the Xandar part, and T indicates time taken. To measure student success, we used

<sup>1</sup>www.oecd.org/pisa/data/

TABLE 1 | Descriptives for collaborative problem solving Xandar assessment for the United States sample.


the sum of the binary item response success scores for each of the four parts, C1P, C2P, C3P, and CP4 (based on 5, 3, 2, and 2 items within the Xandar parts).

Exploratory data analysis following log transformation as described above for some variables revealed only minor deviations from normality. Skewness between −2 to 2 was used for all observed variables (Cohen et al., 2002). Note, however, that this is not a strongly conservative range, as discussed in the limitations. So we also report for this study skewness with all observed variables approximately in the range −1 to 1 except for C1A (1.52) and C2A (1.48). Due to no major levels of deviation, the analysis proceeded without further transformation to the observed variables. Other descriptives for all observed variables are provided in **Table 2**.

We fit the model using lavaan (Rosseel, 2012) in R version 3.5.1 (R Core Team, 2018). We used the weighted least squares "WLSMV" option which employs the diagonally weighted least squares (DWLS) estimator with robust standard errors and a mean and variance adjusted test statistic. We have estimated a confirmatory factor analysis (CFA) model with three factors (each with standardized latent variables). The factors are Actions, Time, and Performance. Each one has the four corresponding measures from the four Xandar parts.

Because dependencies within parts can be expected between the three measures, some parameters were added to the model. They are direct within-part effects of actions on time (more actions implies more time), direct within-part effects of performance on time (better performance may take more time), and correlated residuals for actions and performance within each part (exploring the relationship between actions and performance level).

Direct effects and residual correlations are two different types of dependencies. Direct effects are effects of one variable on another (e.g., of Y<sup>1</sup> on Y2). The two directions, Y<sup>1</sup> → Y<sup>2</sup> and Y<sup>2</sup> → Y1, are not mathematically equivalent. Correlated residuals are equivalent with the effect of a residual of one variable on the other variable (e.g., of εY<sup>1</sup> on Y2). the two directions are mathematically equivalent and equivalent with the covariance of the residuals. To be clear, neither of the dependencies prove a causality relation. A causal hypothesis


can be at the basis of hypothesizing a direct effect, whereas correlated residuals can be used for explorative purposes, without specifying a direction. For the present study, we hypothesized that more actions take more time and that a higher level of performance requires more time. For number of actions and level of performance we explore the dependency with correlated residuals.

See the row heads of **Tables 3**, **4** and **Figure 1** for a definition of the model estimated. It includes the latent variable structure as well as the dependencies. The model can also be derived from the R code for the analysis, which is available in the **Supplementary Material**.

### RESULTS

In this section we describe the results of the modeling. With the dependencies as described in the Methods section added to the model, the model fit was good (close), with a TLI of 0.95


TABLE 4 | Extra dependencies in CFA model for Xandar measures.


→indicates direct effects and ↔ indicates correlated residuals.

and RMSEA of 0.038 (90% CI 0.029 to 0.048). Without the dependencies (without the eight direct effects and four residual correlations), the model fit is clearly worse, with a TLI of 0.574 and RMSEA of 0.112 (90% CI 0.104 to 0.119). These results address RQ1 and RQ2.

The correlations between the latent variables are −0.473, p<0.001 (Actions and Time), −0.732, p < 0.001 (Actions and Performance), and 0.190, p < 0.01 (Time and Performance). The loadings and dependencies are shown in **Tables 3**, **4**, respectively. As expected, the indicators of actions, time, and performance all showed significant positive factor loadings on the corresponding factors (see **Table 3**). The standardized coefficients in the last column indicate that the loadings of the Part 4 indicators are lower than those of the other three parts: 0.19 (Actions), 0.43 (Time), and 0.38 (Performance).

**Table 4** shows the estimates of the dependencies:


Although the factor model with these dependencies fits well, we wanted to check whether the performance is also uni-dimensional at the level of the individual items (RQ3). Uni-dimensionality of the four sum scores as implied by the factor model, does not imply uni-dimensionality at the level of the 12 individual binary items. This is especially because the items represent four processes (exploring and understanding, representing and formulating, planning and executing, and monitoring and reflecting) and three competencies (establishing and maintaining shared understanding, taking appropriate action to solve the problem, and establishing and maintaining team organization), but not with a perfectly crossed design.

The answer to the dimensionality question based on the analysis with this data set is that the 12 items can be considered as uni-dimensional based on the empirical data, although they are designed to tap on a diversity of processes and competencies. The uni-dimensional model fit was good (close), with a TLI of 0.94 and RMSEA of 0.037 (90% CI 0.029, 0.046). The uni-dimensional model is the result of an ordinal confirmatory factor model for the binary items using WLSMV and the same lavaan version as for the earlier analysis. For the delta parameterization the loadings vary between 0.272 and 0.776 and they are all significant (p < 0.001).

dependency values are omitted to avoid clutter in the figure. The correlations between the latent variables can be found in the text, the factor loadings are presented in Table 3, and the dependency values in Table 4.

### DISCUSSION

For the model with loadings and dependencies showing in **Tables 3**, **4**, the latent variable correlations of Actions with Time and with Performance are negative. Hence, participants showing more activities are faster and perform less well in their collaborative problem solving. This is based on the United States dataset with the Xandar task. Successful participants take more time, perhaps a consequence of the previous two relationships. Multiplying the two negative correlations yields −0.473 × −0.723 = 0.346, which is higher than the 0.190 estimate of the correlation between Time and Performance. This explains that in an alternative but formally equivalent model with an effect of Actions on Time and on Performance, the correlation between the residuals of the latent variables Time and Performance is negative. However, the correlation of −0.260 in question is not significant (p > 0.05).

The negative correlation between Actions and Time suggests that highly active students are fast and not so active students are slow. The combination of fast and active on the latent variables seem to reflect an impulsive and fast trial-and-error style. This strategy shows itself in the Xandar task as not very successful

versus a slower, more thoughtful and apparently more successful style. It makes sense that respondents who are more deliberative may have more knowledge to bring to considering a successful solution, or be exhibiting more test effort in the Xandar context. We do not have the information to examine what is happening during the deliberation. This is in part because descriptions of the possible actions are not available in the data set. As well there is no interpretive information provided by PISA for the sample. This could include think-alouds where students describe why they are doing what they are doing. It could also have included qualitative response process information in which student explain their processes, in-depth interviews, or other approaches that supply interpretive information.

However, it makes of course sense that more actions take more time, which shows in the analysis of the dependencies between observed actions and time. This illustrates why it is informative to differentiate relationships between latent variables from relationships which show in dependencies.

Other important dependencies concern Part 4, which is a clearly reflective task, a kind of reflective and evaluative pause. The nature of the task may explain why performance is associated with more actions and requires more time, in contrast with Part 1 (agreeing on a strategy) where the association between actions and performance is negative. For instance, too much discussion on a strategy may signal a lack of structure.

For the result that the items examined can be considered as uni-dimensional although they are designed to tap on a diversity of processes and competencies, this suggests that the collaborative ability generalizes across processes. In other words, the collaborative competencies rely on a general underlying ability. The specificities of the processes are reflected in the extra dependencies. Part 4 involves monitoring and reflecting. This may explain why more activities and more time are associated with better performance. Part 1 by contrast involves planning and execution and representing and formulating. This may lead to better results if not based on trial and error (many actions) but on a structured and goal-oriented approach (less actions).

These dependencies suggest that, depending on the task, the collaborative ability may rely on a general underlying ability but be implemented through a different approach in various collaborative actions, as has been discussed in the literature (Fiore et al., 2017; OECD, 2017b; Eichmann et al., 2019). The special and specific status of Part 4 is also reflected in its lower loadings on all three latent variables (see standardized loadings).

Note that the extra dependencies here are not only considered for methodological reasons when variables stem from the same part. They may also reveal how subjects work on the tasks. This is consistent with the findings here. Parts such as 1 and 4 have a distinct theoretical description in the PISA framework. But how they draw on the collaborative ability can be seen in the empirical data to seemingly require different approaches as indicated in the process data.

Taken together, these results for the United States data set are consistent with problem solving performance modeled as invested time and number of actions.

Potential impacts underscore that it seems possible both to collect and to scale information on the collaborative ability. Measures may help provide intervention support, since in today's world especially, teams with good collaborative skills are necessary in any group. Groups can range from families to corporations, public institutions, organizations, and government agencies (OECD, 2013). Previously, dispositions to collaborate were reported based on the PISA data (Scalise et al., 2016). Indicators of collaborative ability also may be needed to create adequate interventions to train collaboration skills and to change current levels of individual collaboration.

As previously reported, the disposition dimensions of collaborate, negotiate, and advocate/guide might be useful starting points for creating such interventions (Scalise et al., 2016; OECD, 2017a). Alternatively, the factor structure here may yield suggestions on additional interesting starting points. This could include structures by which a student may approach collaboration (OECD, 2017b; Wilson et al., 2017) but more interpretive information would be needed. This could be combined with how participatory a student is disposed to be in collaboration, along with his or her team leadership inclinations, and beliefs in the value or efficacy of collaboration (Scalise et al., 2016).

Limitations to the analysis here include that only the United States data set of many countries available in the PISA data was analyzed. So this analysis should be extended to more countries and results compared in future work.

Also, from a statistical standpoint as discussed earlier, missing data were excluded listwise. In addition, minor but not major skewness was seen in two of the observed variables. Finally, multilevel modeling was not employed so the nested nature of students within schools was not taken into account.

TLI and RMSEA were reported here as the two fit indices since they seem most commonly used in the educational assessment field for large scale analyses. But there have been limited considerations for CPS on this topic.

For limitations from a conceptual standpoint, OECD releases a limited range of information, for instance items for only one of the 2015 collaborative problem solving tasks (Xandar) was released and collaborative actions were numbered but not described in the data set and data dictionary.

For implications of future work from this study, there are several. First, the era of analyzing process data and not only item response data in robust assessment tasks is upon us (many researchers including Praveen and Chandra, 2017). Approaches such as used here could be applied for other constructs, not just problem solving. Models can consider how to explore two types of relationship:


These extra dependencies may provide a window on the underlying process dynamics, see **Figure 1**. It should be noted for implications for future work that it would be helpful if a range of simplified visualizations could be developed for such complex analyses. Standard plots after including dependencies seemed too complex to be fully useful.

For extensions to the specific modeling here, it would be important as discussed earlier to explore fitting the same or similar models across data sets from other countries (Thomas and Inkson, 2017). This could be augmented by also modeling potential country-level effects at the item level, by exploring differential item functioning. Furthermore it would be interesting to consider covariates available in the PISA student questionnaire data set (SQ) in relation to the collaborative ability examined here. This could include indicators for dispositions for collaborative problem solving that moved forward to the main PISA study (Scalise et al., 2016). These indicators include studentlevel indicators available in the CPS SQ data set regarding self-report of dispositions toward cooperation, guiding, and negotiating.

It should also be mentioned that other very interesting student-level indicators regarding additional preferences in collaboration had to be dropped from the PISA main study. This was due to time limitations. Dropped indicators included dispositions toward collaborative leadership, as well as studentlevel indicators of in-school and out-of-school collaborative opportunities. While these were not possible to include in the main study due to time limitations for the PISA administration, the indicators were part of the field testing. They could be very interesting to administer at the country-level in other national or international assessments.

Teacher-level indicators are also available in the PISA data set that provide information on opportunity to learn (OtL) for students in the PISA CPS. Data include classroom-level

#### REFERENCES


OtL reports of team activities, grouping practices, types of collaborative activities, and types of rewards provided for engaging in successful team work. Exploring relationships here might allow more reflection on connections to potential interventions. The PISA data are cross-sectional but might help to inform research studies within countries.

In closing, it is important to mention that the creation and delivery of the innovative PISA CPS instrument included both simulated collaboration of a hard-to-measure construct (Scalise, 2012) and sharing of some process data. This was critical to the examination here, as has been the case for other collaboration-oriented assessments (Greiff et al., 2014, 2015, 2016). This analysis underscores that addressing challenges of education in the 21st century may continue to require new data sources, to address new challenges for education worldwide.

### AUTHOR CONTRIBUTIONS

All authors listed have made a substantial, direct and intellectual contribution to the work, and approved it for publication.

#### SUPPLEMENTARY MATERIAL

The Supplementary Material for this article can be found online at: https://www.frontiersin.org/articles/10.3389/fpsyg. 2019.01280/full#supplementary-material


http://www.oecd.org/pisa/data/2015-technical-report/ (accessed September 21, 2017).


**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2019 De Boeck and Scalise. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

# Making the Psychological Dimension of Learning Visible: Using Technology-Based Assessment to Monitor Students' Cognitive Development

#### Gyöngyvér Molnár<sup>1</sup> \* and Beno Csapó ˝ 2

<sup>1</sup> Department of Learning and Instruction, University of Szeged, Szeged, Hungary, <sup>2</sup> MTA–SZTE Research Group on the Development of Competencies, University of Szeged, Szeged, Hungary

Edited by:

Ronny Scherer, University of Oslo, Norway

#### Reviewed by:

Alison Margaret Gilmore, University of Otago, New Zealand Andreas Rausch, Universität Mannheim, Germany

> \*Correspondence: Gyöngyvér Molnár gymolnar@edpsy.u-szeged.hu

#### Specialty section:

This article was submitted to Educational Psychology, a section of the journal Frontiers in Psychology

Received: 07 December 2018 Accepted: 27 May 2019 Published: 10 June 2019

#### Citation:

Molnár G and Csapó B (2019) Making the Psychological Dimension of Learning Visible: Using Technology-Based Assessment to Monitor Students' Cognitive Development. Front. Psychol. 10:1368. doi: 10.3389/fpsyg.2019.01368 Technology-based assessment offers unique opportunities to collect data on students' cognitive development and to use that data to provide both students and teachers with feedback to improve learning. The aim of this study was to show how the psychological dimension of learning can be assessed in everyday educational practice through technology-based assessment in reading, mathematics and science. We analyzed three related aspects of the assessments: cognitive development, gender differences and vertical scaling. The sample for the study was drawn from primary school students in Grades 1–8 (ages 7 to 14) in Hungary. There were 1500 to 2000 students in each grade cohort. Online tests were constructed from 1638 items from the reading, mathematics, and science domains in the eDia system. The results confirmed that the disciplinary, application and psychological dimensions of learning can be distinguished empirically. Students' cognitive development was the most steady (and effective) in mathematics, where the greatest development occurred in the first years of schooling. Path models suggested that the psychological dimension of learning can be predicted at a moderate level based on students' level of school knowledge consisting of the disciplinary and application dimensions of learning as latent constructs. The predictive power was almost the same in both dimensions. Generally, girls developed faster in the psychological dimension of reading, mathematics and science learning; however, the size of gender differences varied by age and domain. This study (1) provides evidence that the psychological dimension of learning can be made visible even in an educational context, (2) highlights the importance of the explicit development of the psychological dimension of learning during school lessons, and (3) shows that there are gender differences in the developmental level of the psychological dimension of learning in favor of girls but that this varies by grade and domain.

Keywords: technology-based assessment, online assessment, assessment for learning, visible learning, cognitive development

## INTRODUCTION

fpsyg-10-01368 June 10, 2019 Time: 13:12 # 2

Improving students' cognitive abilities has always been a goal of schooling since the very beginning of formalized education (Hattie and Anderman, 2013). However, despite the theoretical foundations, assessment instruments and pedagogical practices that have evolved over time, this aim has not yet been met; in many school systems students' cognitive abilities are not optimally enhanced. In the 20th century, several research schools and paradigms sought to conceptualize cognition, define its key constructs and make them measurable (see e.g., Binet and Simon, 1916; Inhelder and Piaget, 1958; Adey, 2007). Among these, research on intelligence and the related psychometric tradition, Piaget and his school, and the cognitive revolution have all had a major impact on redefining the goals of education. The implications of the research within these paradigms were drawn for educational practice, and a number of mostly standalone programs were initiated in the 1970s, outside classroom instruction (Feuerstein et al., 1980; Klauer, 1989a,b, 1991, 1993, 1997). Later on, in the 1990s, developmental effects were embedded in school subjects using the content of learning (Adey and Shayer, 1994; Shayer, 1999; Adey et al., 2001; Shayer and Adey, 2002; Shayer and Adhami, 2007). The related research, including a number of experiments, resulted in a better understanding of the role that cognitive processes play in school learning, but it has had a modest impact on educational practice.

At the beginning of this millennium, more or less the same ideas emerged in a new wave of teaching 21st-century skills. Several projects sought to define, operationalize, measure and teach these skills (see e.g., Trilling and Fadel, 2009; Griffin and Care, 2014), but the same constraints appeared to hinder progress in putting these ideas into practice in mass education as with previous similar attempts. There were no proper tools for assessing and monitoring changes in students' cognition. The availability of appropriate assessment instruments is a necessary condition for any pre-test – post-test experimental design as well. However, what can be created and applied in specific experimental conditions cannot always be scaled up for broader practical applications. Similarly, the roots of a number of practical educational challenges can be traced back to the fact that significant determinants of school learning are not visible (Hattie, 2009). They are also not easy to observe, nor can developmental deficiencies always be identified by teachers (MacGilchrist et al., 2004). The lack of thinking skills – the cognitive tools required for successful learning – are not identified; thus, they remain untreated, and this significantly hampers further learning.

Thinking, or more specifically, a set of cognitive skills essential for learning, such skills are not observable in the everyday educational context. Students are not aware of the existence of the required processes, and teachers, even if they receive training in identifying the cognitive processes underlying learning, are not able to observe them, or they simply have no time or capacity to determine each student's individual needs. Although the developmental levels of crucial thinking skills might be measured with traditional paper-based instruments, the immense costs, the human resources required, and the time between assessment and feedback excludes the possibilities of using them diagnostically. Technology may be a solution for making thinking processes visible by creating simpler, faster, frequently applicable and costeffective assessments (Mayrath et al., 2012).

In this paper, which is part of a larger project, we present the results of work in identifying cognitive processes relevant for learning, making them measurable in normal educational contexts, and providing students and teachers with frequent feedback. One of the most challenging aspects of this work, is establishing the validity of diagnostic instruments to assess of cognitive processes; showing that the tests measure something more than mastering the current teaching material. To do this, we empirically validated a 3-dimensional framework developed for diagnostic assessment and explored the psychometric characteristics of an item bank devised for the assessment of the psychological dimension of learning.

### THEORETICAL BACKGROUND

The idea of making learning visible was introduced into educational research and development by John Hattie. He made a great step forward in initiating evidence-based educational practice when he synthesized the results of over 800 metaanalyses (Hattie, 2009). He translated his findings into actual classroom work, and in his book for teachers, he explained:

Visible teaching and learning occurs when learning is the explicit and transparent goal, when it is appropriately challenging, and when the teacher and the student both (in their various ways) seek to ascertain whether and to what degree the challenging goal is attained, when there is deliberate practice aimed at attaining mastery of the goal, when there is feedback given and sought, and when there are active, passionate, and engaging people (teachers, students, peers, and so on) participating in the act of learning (Hattie, 2012, p. 18).

As he emphasizes, feedback plays a central role in successful learning, which at a higher level of learning, includes selfmonitoring, self-evaluation and self-assessment. However, he also explains how difficult a task it is to provide proper feedback: "Learners can be so different, making it difficult for a teacher to achieve such teaching acts: students can be in different learning places at various times, using a multiplicity of unique learning strategies, meeting different and appropriately challenging goals" (Hattie, 2012, p. 18).

Student diversity, i.e., students at different levels in different cognitive attributes, is not the most challenging phenomenon when proper feedback is considered. A major problem is that a number of learning outcomes, sometimes the most important ones, are not visible and cannot easily be made visible. While the majority of the studies Hattie reports on deal with organizational issues, methods and classroom practices for teaching curricula, there are far fewer studies that cover the underlying cognitive processes, e.g., reasoning skills, required to understand mathematics and science or precursors of reading, such as phonemic awareness. Some studies have focussed on the most hidden aspects of learning. For example,

Ritchhart et al. (2011) identify a broad range of teaching and learning practices to make thinking visible. They identify the crucial problem in a simplified conception of learning (reduced to memorization) and knowledge (reduced to information, facts and figures): "When we demystify the thinking and learning processes, we provide models for students of what it means to engage with ideas, to think, and to learn. In doing so, we dispel the myth that learning is just a matter of committing the information in the textbook to one's memory" (Ritchhart et al., 2011, p. 28).

Taking into account diversity among students, the limited capacity of teachers and the need to provide feedback on the most relevant but least visible aspects of school learning – promoting students' cognitive development – we may conclude that students and teachers need a different approach to assessment to improve learning. The online assessment system, eDia, was designed for this purpose. It assesses "thinking," or "cognitive development," as a separate dimension, which we call the psychological dimension of learning. We briefly introduce the 3-dimensional theoretical framework that forms the basis for the diagnostic assessment system, and then we elaborate on the psychological dimension in more detail, as that is the focal topic of the present study. Finally, we discuss the crucial role of technology, arguing that its widespread availability in schools makes the time right for such a system to be introduced and integrated into regular educational processes.

### Learning and Cognitive Development: A 3-Dimensional Model of Learning Outcomes

An online diagnostic assessment system, eDia, has been constructed to provide teachers and students with relevant feedback information (Csapó and Molnár, unpublished). The eDia system covers the three most frequently assessed domains of school education; reading, mathematics and science. Large item banks have been developed for use in regular classroom assessments in Grades 1 to 6 of primary school, and for Grades 7 and 8 to explore the developmental trends in a broader age range.

The objectives of each item bank are defined in its assessment framework, similarly to international comparative studies, such as Trends in International Mathematics and Science Study (TIMSS; Mullis et al., 2005) and Progress in International Reading Literacy Study (PIRLS; Mullis and Martin, 2015); they are based on a 3-dimensional model of the goals of learning that forms a common foundation for diagnostic assessment. The three dimensions include thinking/reasoning, application and disciplinary knowledge. [The 3-dimensional framework has been published in several articles and book chapters before the assessment frameworks were elaborated (see e.g., Csapó, 2010; Nunes and Csapó, 2011; Adey and Csapó, 2012; Blomert and Csépe, 2012)]. The framework for reading was somewhat different those for mathematics and science (Csapó and Csépe, 2012; Csapó et al., 2015c), which were more similar (Csapó and Szendrei, 2011; Csapó and Szabó, 2012; Csapó et al., 2015a,b).

The intention of "cultivating the mind" – developing cognitive abilities – may be traced back to ancient philosophy. To set goals in this direction, a model of mind is needed; more specifically, knowledge of how internal psychological attributes are structured and how psychological processes play a role in learning (see more details in the next section). In the eDia frameworks, this is the "thinking" (this term is mostly used in the context of mathematics and science), or, more generally, the "psychological dimension." According to the model, we propose the psychological dimension of knowledge does not only contain "domain-specific reasoning skills," but also general reasoning skills embedded in different content and contexts, which has lately been referred to as transversal skills; and is not the same as procedural knowledge. We assume that there are natural cognitive developmental (psychological) processes. These processes, as described by Piaget, take place in the interaction between the child and his/her environment. School education may stimulate this development if it provides a student with proper environmental stimuli and if these stimuli are within the zone of proximal development (ZPD) of the child (Vygotsky, 1978). Very often, school instruction is not adjusted to the individual needs of the students; usually the stimuli are far beyond their ZPD. In these cases, students benefit little from instruction; they memorize the rules and develop specific skills through a large amount of drill practice, which have any real impact on their cognitive development. For example, students may learn rules to deal with ratios and proportions without this learning having much impact on the development of proportional reasoning. Schools may teach students a great deal about combinatorics, probability and correlation without having a real impact on the development of combinatorial, probabilistic or correlative reasoning. In this way, we distinguish the psychological dimension from the disciplinary dimension, which may include procedural knowledge (e.g., skills for solving linear equations or proving geometric theorems) or domainspecific reasoning skills. This model and approach opens the door to fostering domain-general reasoning skills in a domainspecific context.

Application deals with another ancient goal – that school should teach something that is applicable beyond the school context. Applying knowledge and transferring it to new contexts require a deeper conceptual understanding and usually specific exercises to facilitate application. Therefore, most knowledge mastered at school remains inert and not applicable in new contexts (Alexander and Murphy, 1999; Bransford and Schwartz, 1999; Csapó, 2010). The PISA conducted by the OECD has focussed on this dimension from the very beginning. The PISA expert groups elaborated the concept of applicable knowledge and defined it as competencies students need in a modern society. To develop such a framework, the social relevance of knowledge, i.e., the needs of societies have also be taken into account. For the frameworks of the first and second PISA assessment, the concept of literacy was extended in include the objects of the assessment in the three domains as reading literacy, mathematical literacy and scientific literacy (OECD, 1999, 2003).

Disciplinary knowledge is the third dimension and is most commonly known as curricular content. Arts and sciences content constitutes the major source of disciplinary knowledge. The first major international comparative studies (e.g., First and Second International Mathematics Study – Husén, 1967;

Burstein, 1993; First and Second International Science Study – Bloom, 1969; IEA, 1988), the precursor to the TIMSS, assessed this dimension. The first assessments were based on an analysis of the curricula in the participating countries. More recently, the TIMSS frameworks organize the objects of the assessment into three groups: content, application and reasoning. This classification bears some similarity to the 3-dimensional eDia frameworks (For PISA assessment frameworks, see OECD, 2003).

Education must not be reduced to providing the right answer quickly, but must deal with the ongoing cognitive work of understanding new ideas and information that will serve students as learners in the future (Costa and Kallick, 2009). In modern society, students are expected to apply their knowledge in a wide range of contexts, and they should be able to solve problems in unknown, novel situations. Thus, these goals must reinforce and interact with each other as they are strongly connected (Molnár and Csapó, 2019).

It is reasonable that the earliest efforts to measure knowledge learnt at school focussed on areas that were the easiest to measure: the disciplinary (knowledge) dimension of learning (see e.g., IEA TIMSS). The goal of applying that knowledge in a new context (the application dimension) and assessing students' ability to do so is a more complicated task (see e.g., OECD PISA). The goal of developing students' thinking abilities (the psychological dimension) is even more complex. To be able to make thinking visible, we must be clear about, and draw on, our understanding of what thinking is and what types of thinking we want to assess and enhance.

## Assessment Beyond the Content of Actual Learning

In the 20th century, several research paradigms have conceptualized the development of thinking and its relationship to school education. Among these, research on "intelligence" was the first that was closely linked to education. The first intelligence test (Binet–Simon test, Binet and Simon, 1916) was constructed to assess children's preparedness for schooling, and the Scholastic Aptitude Test (SAT) (see Grissmer, 2000) served a similar purpose at the transition from secondary to tertiary education. Several new approaches, models and interpretations of the concept of intelligence have been proposed. From the perspective of education, the more useful ones consider intelligence as (able to be modified, taught, learnt, or improved within educational contexts). Our psychological dimension in each domain may thus overlap with the inductive reasoning components of "fluid" intelligence. The psychological dimension can be embedded within the conception of plastic general ability (see Adey et al., 2007), and a number of cognitive skills covered by the psychological dimension of our frameworks are explicitly identified in Carroll's three-strata model of abilities (Carroll, 1993) and the Specialized Cognitive Systems of Demetriou's model (Demetriou et al., 1992, 1993; Adey et al., 2007). On the other hand, we emphasize that all the cognitive skills discussed in the psychological dimension of the frameworks are embedded within the content and context of each particular domain, and the tasks developed from the frameworks are adjusted to the developmental level of the cohort of students to be assessed.

The work of Jean Piaget and his school was characterized by another approach. Piaget described students' reasoning skills with well-defined operations, which correspond with certain mathematical structures (see e.g., Inhelder and Piaget, 1958). He mostly used basic science content for his experiments (e.g., the pendulum), and the operations he identified may be found in various learning contexts as well as in everyday problems.

The cognitive revolution in psychology provided a new impetus to research efforts in school learning. It led to a more differentiated conception of knowledge and learning, allowing a more precise definition of the goals of education. Recent studies in psychology and education have shown that these skills are especially crucial at the beginning years of schooling, as students' developmental level determines later success (see Nguyen et al., 2016).

The psychological dimension has been conceptualized as the interaction between the development of students' thinking skills and learning at school (Nunes and Csapó, 2011; Adey and Csapó, 2012; Blomert and Csépe, 2012) and must address how students learn in reading, mathematics and science.

In this study, we explored the prospects of making the psychological dimension of learning visible by using technologybased assessments to monitor the development of students' thinking skills. The aim of this study was to show how the psychological dimension of learning (thinking) can contribute to the development of specific reasoning skills.

In reading, assessment of the psychological dimension (thinking and reasoning) covers the cognitive mechanisms of development from laborious phonological decoding to the automatic recognition of whole words, and from prerequisite skills of reading through phonemic, phonological and morphological awareness to metacognitive aspects (Blomert and Csépe, 2012). In mathematics (Nunes and Csapó, 2011) and science (Adey and Csapó, 2012), there are generic objects and domain specific objects. For example, number sense is specific to mathematics, while the control of variables and scientific reasoning are better covered within the science framework. Operational reasoning (e.g., seriation, class inclusion, classification, combinatorial reasoning, probabilistic reasoning, proportional reasoning) and some higher-order thinking skills (e.g., inductive reasoning and problem solving) are more generic and can be assessed in both mathematics and science.

## AIMS, RESEARCH QUESTIONS AND HYPOTHESES

In this study, we explored the prospects of making the psychological dimension of learning visible by using technologybased assessments to monitor the development of students' reasoning skills. The aim of the study was to show how the psychological dimension of learning (thinking) can be assessed in everyday educational practice and how it is related to students' level of subject matter content knowledge. Three domains were explored from this perspective: reading, mathematics

and science. Reading is the basis for all further learning, including mathematics and science, while mathematics provides foundations for learning in various areas of science. These domains are central in many education systems, and large-scale international comparative studies, such as TIMSS, PIRLS, and PISA, have focussed on these areas. We analyzed three aspects of the assessments: cognitive development, gender differences and vertical scaling.

Worldwide, there are many initiatives and computer-based tests available in the domains of reading, mathematics and science worldwide. However, they mainly focus only on disciplinary knowledge dimension (content) or the application dimension (literacy of learning) (e.g., TIMSS, PIRLS, and OECD PISA). There are no regular large-scale assessments that include the psychological dimension of learning in primary school – cognitive development. The available assessment systems in reading, mathematics and science have been designed to assess older students' reading, mathematics and science knowledge (e.g., TIMSS, PIRLS, and PISA). The present study sought to: (1) define and examine the different dimensions of learning in reading, mathematics and science; (2) monitor and compare cognitive development (the psychological dimension of learning) in the three domains over time; (3) analyze the proportion of unexplained variance in cognitive development if school knowledge (the application and disciplinary dimensions) is taken into account in reading, mathematics and science; and (4) identify any gender differences in the cognitive development in the three domains. We sought to answer five research questions.

RQ1: Can the three dimensions of learning be distinguished empirically? We explored this question to see if cognitive development, the development of reasoning skills, can be assessed separately and be made visible in everyday educational practice. We hypothesize that the psychological, application and disciplinary dimensions of learning can be distinguished empirically, assessed and monitored in everyday educational practice (Csapó and Szendrei, 2011; Csapó and Csépe, 2012; Csapó and Szabó, 2012). We also hypothesize that they will interact and correlate with each other.

RQ2: Is the psychological dimension of learning the same across the three domains? That is, is the same construct being measured in the psychological dimension of learning across the three main domains? The roots of cognitive development may be universal as early neurocognitive development in children is similar across cultures and societies (Molnár and Csapó, 2019). Therefore, based on the conceptualization of the psychological dimension of learning as the interaction between students' cognitive development and learning at school (Nunes and Csapó, 2011), we hypothesize that the 1-dimensional model will fit the data better than the 3-dimensional model. However, we argue that the 3-dimensional model will take into account results from research on knowledge transfer. According to McKeachie (1987), "Spontaneous transfer is not nearly as frequent as one would expect" (p. 709).

RQ3: How does the psychological dimension of reading, mathematics and science develop over time during primary schooling? Based on previous research results on reasoning skills, we hypothesize that children's cognitive development is slow (Molnár et al., 2013, 2017), indicating the need for more stimulating school lessons. Based on Polya's (1981) theory of problem solving, and results from research on mathematics teaching (e.g., Nunes and Csapó, 2011), we hypothesize that the psychological dimension of learning in mathematics will develop the most readily.

RQ4: How can the psychological dimension of learning be explained by students' level of school knowledge in reading, mathematics and science? That is, how can learning in reading, mathematics and science contribute to the development of the psychological dimension of learning, and how effectively does it stimulate students' general cognitive development? Research in this field provides rich resources ranging from the classical work of Piaget (see e.g., Inhelder and Piaget, 1958) to the most recent neurocognitive studies (such as Geake and Cooper, 2003; Thomas et al., 2019). We hypothesize that learning reading, mathematics and science will contribute to students' development in the psychological dimension of learning but that the transfer effect will be low. We base our hypothesis on empirical research that has found that reasoning skills develop relatively slowly during primary and secondary education with the average pace of development being about one quarter of a standard deviation per year (Csapó, 1997; Molnár and Csapó, 2011; Greiff et al., 2013; Molnár et al., 2013, 2017). The development of reasoning skills is a "by-product" of teaching rather than guided by explicit instruction (de Koning, 2000).

RQ5: How does the developmental level of the psychological dimension of learning differ by gender, grade and domain? Based on the most prominent international studies (Martin et al., 2016; Mullis et al., 2016, 2017; OECD, 2016) and the research results on gender differences in students' development of reasoning skills (Wüstenberg et al., 2014), we hypothesize gender differences in the development of the psychological dimension of learning will vary by grade and domain. The PISA studies indicated that the achievement of 15-year-old Hungarian girls in the application dimension of reading was significantly better than that of boys, while there were no statistically significant gender differences in mathematics and science (OECD, 2016). In contrast, the TIMSS studies that focus on younger students (Grades 4 and 8; 10- to 14-year-olds) mainly assess the disciplinary dimension of mathematics and science knowledge. Their findings indicated that boys significantly outperform girls in mathematics in Grade 8 (Mullis et al., 2016), but there was no statistically significant gender difference in Grade 4. In science, boys significantly outperformed girls at both grade levels (Martin et al., 2016). In PIRLS, Grade 4 Hungarian girls significantly outperformed their boys in reading (Mullis et al., 2017). Please note that the present study focussed on the psychological dimension and not on the application or disciplinary dimensions of learning in mathematics, science or reading.

### MATERIALS AND METHODS

### Participants

The sample of students for the study was chosen from the partner school network of the Center for Research on Learning and Instruction at the University of Szeged in Hungary. As schools participated voluntarily in the project, representative

sampling of school classes or students was not a goal. However, based on the data collected from the schools, it was possible to generate nationally representative indicators for the main variables. We noted that schools with relatively large numbers of low socioeconomic (SES) students were under-represented in the present study, possibly due to the lack of ICT available in those schools.

The sampling unit was a school class. Classes were drawn from primary and secondary schools from Grades 1–8 (aged 7–14). A total of 656 classes from 134 schools in different regions were involved in the study, resulting in a wide-ranging distribution of students' background variables. The total number of students involved in the study was 14,062 (**Table 1**). The proportion of boys and girls was about the same. As participation was voluntary, not all students completed tests in all three domains or in each dimension within each domain. Thus, data was potentially available for students who completed nine elements: the assessment of three dimensions of learning (psychological, application, and disciplinary) in three domains (reading, mathematics, and science). After the scaling procedure, we excluded students from the analyses where, because of missing data. it was not possible to compute an ability level in at least one of the nine elements. Thus, 5,714 students from 310 classes and 97 schools were involved in the analyses.

### Tests

An item bank was constructed for diagnostic assessments in reading, mathematics and science based on the three dimensions of learning described in the previous section. These item banks collectively contained almost 17,000 tasks with most tasks having several items. There were 6685 tasks for reading, 6691 for mathematics and 3535 for science. Tests to measure the psychological, application, and disciplinary dimensions of learning in reading, mathematics and science among students in Grades 1–6 (aged 6–7 to 12–13). The tests for the study were drawn from these item banks. Students in Grades 7 and 8 received tasks originally written for students in Grades 5 and 6 (see **Table 2**).

For each grade level, nine tasks with different difficulty levels (three easy, three medium-difficulty and three difficult) were


chosen from each item bank to assess each dimension. After this procedure, there were 543 tasks in reading, 604 in mathematics and 492 in science.

The tasks were grouped into clusters, with 10–15 items per cluster for students in the lower grades and 15–20 items for students in the higher grades. One 45-min test consisted of four clusters of tasks for students in Grades 1 and 2 (50–55 items) and five clusters for students in Grades 3 to 6 (60–85 items). Each test contained clusters of tasks from each learning dimension with the clusters positioned in a different order to avoid the item-position effect in the scaling procedure. Anchor items were used within and between the different grades for the horizontal and vertical scaling of the data. The clusters contained easier or harder tasks from lower or higher grades. A total of 483 strongly anchored, but different clusters were developed from the items selected.

For optimizing the measurement error of the test, the clusters contained tasks from the same dimension of learning, ranging in task difficulty for the different grade levels. That is, students received more tasks from one learning dimension if those tasks were originally prepared for students in lower or higher grades. The structure of the test of mathematical knowledge is presented in **Table 2** paralleled the structure of the reading and science tests. Based on this structure, 162 different tests (nine in each grade and each domain) were constructed from the item banks for the vertical scaling of students in Grades 1–8.

TABLE 2 | The structure of the tests in mathematics by cluster of tasks for each grade level.


M, mathematics; D, disciplinary dimension; A, application dimension; R, reasoning dimension; 1–6, grade for which the task was originally designed; (NUMBER), number of items in the cluster.

In Grades 1–3, instructions were provided in written form, onscreen, and with a pre-recorded voiceover to avoid any reading difficulties and to ensure greater validity of the assessments. Thus, students used headphones during the administration of the tests. After listening to the instructions, they indicated their answer by using the mouse or keyboard (in the case of desktop computers, which are most commonly used in Hungarian schools) or by directly tapping, typing or dragging the elements of the tasks using their fingers on tablets.

The tasks presented in **Figure 1** assess students' mathematical and scientific reasoning. Based on the framework for the diagnostic assessment of mathematics (Csapó and Szendrei, 2011) and science (Csapó and Szabó, 2012), the main questions in this psychological dimension related to how well mathematics and science education was adjusted to students' psychological development, how learning mathematics and science could contribute to the development of specific reasoning skills and how effectively they could stimulate students' general cognitive development. Items developed to measure the psychological dimension of learning encompassed a long list of skills, such as inductive reasoning, deductive reasoning, analogical reasoning, combinatorial reasoning, systematization skills, proportional reasoning and correlative reasoning. Two examples of tasks for assessing students' inductive reasoning are presented in **Figure 1**. Students had to discover regularities by detecting dissimilarities with respect to attributes of different objects. They completed the tasks by dragging the elements to different areas, thereby defining the proper sets. The scoring of all tasks was automated, including items with several correct answers.

**Figure 2** presents a task measuring student's science disciplinary knowledge and a mathematics tasks measuring the application dimension. In the science task, students retrieve disciplinary knowledge of phases of the water cycle. In the mathematics task, students have to select and place flowers – drag and drop – in the vase; only the number of flowers counts. The task measures the application of adding up to 10 in a realistic application context.

### Procedures

The tests were administered over a period of 7 weeks in computer rooms within the participating schools during regular school hours. Each test lasted approximately 45 min. Test sessions were supervised by teachers who had been thoroughly trained in test administration. The tests were delivered on the eDia online platform. After students entered the system and chose the domain (reading, mathematics, or science), the system randomly selected a test for that student from the nine tests available in the appropriate grade level.

To learn to use the program, students were provided with instructions and a trial (warm-up) task with immediate feedback. This instruction included: (1) a yellow bar at the top of the screen to show how far along they were on the test; (2) they had to click on the speaker icon to listen to the task instructions; (3) they had to click on the "next" button to move on to the next task; (4) pupils in Grades 1 and 2 received extra warm-up tasks to enhance keyboarding and mouse skills; and (5) after completing the last task, participants received immediate visual feedback with a display of 1 to 10 balloons, where the number of balloons was proportionate to their achievement.

The feedback system available for the teacher was more elaborate. Due to the large number of students and items, the Rasch analyses were run with the built-in analytic module in the eDia system. As the tasks in the item bank were scaled using IRT, it was possible to compare students' achievement. Teachers received feedback on students' achievement both as a percentage of correct items and as ability scores. For each grade and domain, the national average achievement (ability score) was set at 500 with a standard deviation of 100 (Carlson, 2009; Ferrão et al., 2015; Weeks, 2018). This was the point of reference for interpreting students' achievement.

We used confirmatory factor analyses (CFA) within structural equation modeling (SEM) (Bollen, 1989) to test the underlying measurement models of reading, mathematics and science knowledge in the three dimensions of learning: psychological (reasoning), application (literacy), and disciplinary knowledge, respectively (RQ 1). We used the preferred estimator for categorical variables; the adjusted weighted least squares mean and variance (WLSMV) (Muthén and Muthén, 2012). We tested a 3-dimensional model to distinguish the three different dimensions of learning, and we also tested a 1-dimensional model with all three dimensions combined under one general factor. In order to test which model fitted the data better, we carried out a special χ 2 -difference test in Mplus. We also used CFA to test the underlying measurement model, and to determine the invariance behavior of the psychological dimension across the three domains of learning (RQ 2).

To establish a developmentally valid scale, we used the Rasch model with the vertical and horizontal scaling of the data (RQs 2 and 4) and then a linear transformation of the logit metric. As indicated above, for each domain and at each grade level, the mean achievement of each dimension was set to 500 with a standard deviation of 100. We used path models to test the effect and predictive power of school learning on the psychological dimension of learning (RQ 3).

### RESULTS

### The Psychological Dimension of Learning

Results showed that the psychological (reasoning/thinking), application and disciplinary dimensions of learning can be distinguished empirically and are independent of domain and grade. The χ 2 -difference test in Mplus showed that the 3-dimensional model fitted significantly better than the 1 dimensional model in each grade and in each domain (see **Tables 3**–**5** for reading, mathematics and science, respectively). Generally, the 3-dimensional measurement model for each domain showed a good model fit (**Tables 3**–**5**), based on Hu and Bentler's (1999) recommended cut-off values. The comparative fit index (CFI) and the Tucker–Lewis index (TLI) values above 0.95 and the root mean square error of approximation (RMSEA) below 0.06 indicated a good global model fit.


TABLE 3 | Goodness of fit indices for testing the dimensionality of reading from Grades 1 to 8.

df, degrees of freedom; CFI, comparative fit index; TLI, Tucker–Lewis index; RMSEA, root mean square error of approximation; χ <sup>2</sup> and df were estimated by WLSMV. 1χ <sup>2</sup> was estimated with the difference test procedure in MPlus (see Muthén and Muthén, 2012). C.I., confidence interval.

TABLE 4 | Goodness of fit indices for testing the dimensionality of mathematics from Grades 1 to 8.


df, degrees of freedom; CFI, comparative fit index; TLI, Tucker–Lewis index; RMSEA, root mean square error of approximation; χ <sup>2</sup> and df were estimated by WLSMV. 1χ <sup>2</sup> was estimated with the difference test procedure in MPlus (see Muthén and Muthén, 2012). C.I., confidence interval.


df, degrees of freedom; CFI, comparative fit index; TLI, Tucker–Lewis index; RMSEA, root mean square error of approximation; χ <sup>2</sup> and df were estimated by WLSMV. 1χ <sup>2</sup> was estimated with the difference test procedure in MPlus (see Muthén and Muthén, 2012). C.I., confidence interval.

In most cases, the 3-dimensional models fitted the data significantly better than that the 1-dimensional models. In some cases, mostly in Grades 7 and 8, the 3-dimensional model fit indices were lower. This could have been because the tasks were originally developed for students in lower grades.

The fit indices dropped in the case of mathematics and science in Grade 8 but were significantly higher than that of the 1-dimensional model. Thus, the psychological, application and disciplinary dimensions of learning could be distinguished. The psychological dimension of learning could be made visible independently of the measured domain in everyday educational settings, thus supporting Hypothesis 1.

## The Psychological Dimension of Learning Across Domains

The bivariate correlations of the psychological dimensions between pairs of domains (mathematics and reading, mathematics and science, and reading and science) ranged from 0.29 to 0.49 and were statistically significant (**Table 6**). At each grade level, the correlations of the psychological dimension (reasoning/thinking) tended to be the highest between mathematics and reading and lowest between mathematics and science. The strongest set of correlations, independent of the measured domain, was found in Grade 8, indicating that the

TABLE 6 | Correlations of the psychological dimension between pairs of domains from Grades 1 to 8.


All coefficients are significant at p < 0.0001 level.

psychological dimension of learning in reading, mathematics and science were highly correlated, but not identical constructs.

The invariance in the psychological dimension of learning across the three domains was supported by comparing the 3-dimensional measurement model, which distinguishes the psychological dimension of reading, mathematics and science, and the 1-dimensional measurement model, which combines the psychological dimension of the different learning domains under a single factor. The special χ 2 -difference test in Mplus showed that the 3-dimensional model fitted significantly better at each grade level than the 1-dimensional model (**Table 7**).

### The Rate of Development in the Psychological Dimension

**Figure 3** presents the mean cognitive development scale scores in the psychological dimension of learning reading, mathematics and science. Please note that in each domain, the mean score of Grade 8 students was set at 500 with a standard deviation of 100, thereby constructing the point of reference for interpreting students' achievement. This means that we cannot compare the development of the psychological dimension of learning across domains, but we can compare the rate of development.

We found that the amount and rate of cognitive development were almost the same in each domain between Grades 6 and 8 and that there was no appreciable development in reading and science between Grades 2 and 6. The greatest rate of progress occurred in Grade 1 in reading and science, but not mathematics. Generally, there was a steady increase in the psychological dimension of learning in mathematics, especially in the first 4 years of schooling. The results confirmed our hypothesis that children's cognitive development is slow (Molnár et al., 2013; Molnár et al., 2017), thus indicating the importance of the explicit development in this dimension in school lessons. Overall, these results highlighted the importance, sensitivity and potential of the development of thinking skills in the early years of schooling.

### Relationship Between the Three Dimension of Learning

The possibility and practical relevance of separating the psychological dimension of learning can be explored from another perspective by examining the proportion of its variance that remains unexplained if the more readily visible disciplinary and application dimensions (referred to together as school knowledge) are taken into account. Technically, these dimensions may be considered as potential predictors of the psychological dimension.

We used continuous factor indicators in SEM analyses to examine the relationships between school knowledge and the

TABLE 7 | Goodness of fit indices for testing the dimensionality of the psychological dimension in reading, mathematics, and science using 1- and 3-dimensional models for Grades 1 to 8.


df, degrees of freedom; CFI, comparative fit index; TLI, Tucker–Lewis index; RMSEA, root mean square error of approximation; χ <sup>2</sup> and df were estimated by WLSMV. 1χ <sup>2</sup> was estimated with the difference test procedure in MPlus (see Muthén and Muthén, 2012). C.I., confidence interval.

FIGURE 3 | The speed of the cognitive development in the psychological dimension of learning within the domains of mathematics, science and reading (Please note, that in each measured domain the mean of the 8th graders' achievement was artificially set to 500 with a standard deviation of 100).

psychological dimension of learning in each domain. School knowledge as a latent factor was specified as the application and disciplinary dimensions of learning. According to the results, school knowledge predicted the psychological dimension of learning in all domains, but a significant amount of variance remained unexplained (see **Figures 4**–**6**). This indicates that existing aspects of the psychological dimension of learning can be separated from school knowledge as measured by the disciplinary and application parts of students' knowledge. That is, it is relevant to measure the psychological dimension of learning in addition to measuring the disciplinary and application dimensions of learning. So our hypothesis was confirmed.

The amount of explained variance was statistically significant and almost the same for mathematics and reading and somewhat higher for science. This suggests that there may be more common reasoning aspects in the three dimensions of science. The model for each domain fitted well (CFI = 1.000, TLI = 1.000, RMSEA = 0.000).

### Gender Difference in the Psychological Dimension of Learning

In the present study, girls outperformed boys in the psychological dimension of learning in reading, mathematics and science (Mathematics: F = 0.272, t = −6.696, p < 0.001; Science: F = 3.578, t = −11.525, p < 0.001; Reading: F = 3.224, t = −4,370, p < 0.001); however, this varied by grade level (see **Table 8**). The largest, statistically significant differences in favor of girls were found in Grades 4 and 5, where girls outperformed boys in all three domains, and in Grades 6 to 8, where girls outperformed boys in two of the three domains. Girls also outperformed boys in reading in Grades 3–8, in mathematics in Grades 1 and 4–6, and in science in Grades 1, 4, 5, 7, and 8.

In this section, we examine gender differences among Grade 8 students – the grade level of students in PISA, TIMSS and our study. The results confirm our hypotheses that an assessment which focuses on students' disciplinary knowledge or application does not replace an assessment of the psychological dimension of learning. In the case of mathematics, no gender differences were detected in the application and psychological dimensions of learning, but girls scored significantly higher, on average, than boys in the disciplinary dimension of learning. The results were different in the case of science. There were no gender differences in the application dimension of science learning. Boys achieved significantly higher in the psychological dimension.

### DISCUSSION

Previous research has already identified several characteristics of learning reading, mathematics and science. However, it has mainly focussed on only one dimension; either the disciplinary

FIGURE 6 | A structural model of reading school knowledge as a predictor of students' cognitive development in the domain of the psychological dimension of reading.


TABLE 8 | Gender differences in the psychological dimension of learning in reading, mathematics and science in Grades 1 to 8.

R, reading; M, mathematics; S, science; F, F-value; t, t-value; p, significance level; d, Cohen-d.

dimension or the application dimension of learning, and on the reading, mathematics and science learning of older students. There have been significant attempts to concentrate on the application and reasoning dimensions, but educational practice has mostly focussed on the assessment of the content of individual curriculum subjects. The application of knowledge has seldom been assessed, although the PISA assessments have highlighted its importance. Because of the lack of easy-to-use assessments, the psychological dimension of learning (cognitive development and reasoning) remains

hidden. Therefore, neither the students nor their teachers receive feedback on level or development in this dimension. This study provides evidence that the psychological dimension of learning can be made visible and that technology-based assessment may be applied in an everyday educational context. This evidence highlights the importance of the assessment and the explicit development of the psychological dimension of learning in a school context. Further, it points to gender differences in the developmental rate of the psychological dimension of learning in favor of girls, although this varies by grade and domain.

Results support our hypotheses that the three dimensions of learning can be distinguished empirically and can be assessed separately. The 3-dimensional frameworks derived from previous research, including international comparative studies (Csapó and Szendrei, 2011; Csapó and Csépe, 2012; Csapó and Szabó, 2012), showed relatively good validity, and the results from the current analyses confirmed that they may form evidence-based foundations for diagnostic assessment. The most important findings from these analyses was that the psychological dimension of learning can be measured at the primary school level in the context of three of the most important domains of learning – reading, mathematics and science.

The present results also confirmed that, although the roots of the psychological development of different domains are universal and the domains of learning build on each other (Molnár and Csapó, 2019), there are still significant developmental differences between them. While there is a close connection between the development of early literacy and numeracy, and later mathematics learning builds on reading, and science builds on both (McKeachie, 1987), our results support the notion that the transfer is not obvious between the different domain contexts. There were statistically significant correlations between the development scores in the psychological dimension of reading, mathematics and science learning, but they were not identical constructs.

Previous studies have indicated that children's cognitive development is slow (Molnár et al., 2013, 2017) but that it can be taught effectively (de Koning et al., 2002; Klauer and Phye, 2008; Perret, 2015). Our results confirmed both of these notions as there was no appreciable development in the psychological dimension of learning in reading and science for students in Grades 2–6, and students' cognitive development was the most steady (and effective) in mathematics, where the greatest development took place in the first years of schooling. This confirms previous research findings and highlights the potential of developing thinking skills in the early years of schooling.

The results of the SEM indicated the complex nature of learning in reading, mathematics and science. An examination of the predictive power of school knowledge on the psychological dimension of learning showed that the disciplinary and application dimensions of learning together predicted the psychological dimension of learning at a moderate, but statistically significant level, while a significant amount of variance remained unexplained. This indicates that school knowledge in reading, mathematics, and science can contribute to the development of the psychological dimension of learning

and can stimulate students' general cognitive development, but the transfer effect may not be high. The results suggest that aspects of the psychological dimension of learning exist and can be separated from the learning dimensions assessed most often at school and in international comparative studies. This highlights the importance and relevance of developing measures of the psychological dimension of learning as well.

To provide context to interpret the size of the gender difference in the psychological dimension of learning in reading, mathematics and science, we compared our results to findings on gender differences in the most prominent international comparative studies. The gender differences in the international studies at Grade 4 and 8 were found in our study. We found gender differences in reading over almost all the primary school grade levels, including Grades 4 and 8, indicating that girls perform better in reading, irrespective of the dimension of learning.

### LIMITATIONS OF THE PRESENT STUDY

As the PISA 2015, TIMSS 2015, and PIRLS 2016 studies have also indicated, there are large differences between countries not only on the level of reading, mathematics and science performance, but also in gender differences. Therefore, results found in one country cannot be generalized across countries and cultures. Although general trends have been found, the generalizability of the results may be limited. The method we applied in this study was generalizable and may be useful for making the psychological dimension of learning visible in any educational context. A further limitation of the study could be the results of the "common method bias" and "test motivation" as possible sources of shared variance across tests and domains. Participation in the study was voluntary, and although the large sample sizes and the diversity of the schools made the results sufficiently robust, the actual samples were not nationally representative. Thus, the present study does not provide a complete picture of the Hungarian education system. Nevertheless, the analyses did reveal some generalizable trends.

## CONCLUSION

The 3-dimensional frameworks for the diagnostic assessment used in the present study were devised on the basis of current results from a number of research fields ranging from cognitive neuroscience to research on cognitive development, standard setting and the theoretical frameworks of large-scale international comparative studies. The item banks for assessing reading, mathematics and science were developed through the careful mapping of assessment tasks onto frameworks. The next step in scientifically establishing and further developing the diagnostic system is to empirically validate the 3-dimensional framework. We first presented the results of the comprehensive analyses in this study. In the present analyses, we focused on the psychological dimension of learning, which determines the

dimensions of disciplinary knowledge and application, but is less visible or observable in the school context.

The results confirmed the theoretical foundations of the project and made clear that the psychological dimension can be distinguished and measured in the context of the most important domains of learning in the beginning phase of schooling. These findings indicate directions for further research as well. Item development for this study was based on the theoretical frameworks without empirical evidence of dimensionality. Based on the empirical confirmation of the three dimensions in this study, the validity of the assessment scales constructed from the item banks, may be improved by exploring how well the items fit particular scales.

Establishing scales empirically to assess the psychological dimension of learning paves the way to improving learning as well. The evidence that cognitive development is measurable provides a basis for large-scale systematic diagnostic monitoring of the development of students' thinking skills, one of the most sorely lacking elements in the current spectrum of assessment practices. It also supports different types of intervention studies from teacher-initiated practical improvements to well-controlled, randomized experiments.

### ETHICS STATEMENT

The authors only had access to anonymized data, and hence an ethics approval and parental consent were not required as per applicable institutional and national guidelines and regulations. The assessment data collected for this study formed integrated parts of the normal educational processes of the participating

### REFERENCES


schools. The coding system for the online platform masked students' identity, the researchers would thus have been unable to tie the data to the students. The results from the low-stakes diagnostic assessments were only disclosed to the participating students (as immediate feedback) and to their teachers. Because of the anonymity and low-stakes testing design of the assessment process, it was not required or possible to request and obtain written informed parental consent from the participants.

### AUTHOR CONTRIBUTIONS

GM and BC took responsibility for the content, including participation in the concept, design, analysis, drafting the manuscript, writing and final approval of the manuscript, and agreed to be accountable for all aspects of the study.

## FUNDING

This study was funded by OTKA K115497 and EFOP 3.2.15.

### ACKNOWLEDGMENTS

We would like to express our appreciation to the members of the eDia Research Group, who provided continuous support for the participating schools during data collection. A special thanks go to Dr. Géza Makay and Dóra Mokri, who assisted us with data filtering and scaling. We also wish to thank the two the reviewers for their valuable comments on earlier versions of this manuscript.




**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2019 Molnár and Csapó. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

# The Skilled, the Knowledgeable, and the Motivated: Investigating the Strategic Allocation of Time on Task in a Computer-Based Assessment

#### Johannes Naumann\*

Institute of Educational Research, University of Wuppertal, Wuppertal, Germany

#### Edited by:

Ronny Scherer, University of Oslo, Norway

### Reviewed by:

Marlit Annalena Lindner, Christian-Albrechts-Universität zu Kiel, Germany Gaston Saux, National Council for Scientific and Technical Research (CONICET), Argentina

> \*Correspondence: Johannes Naumann j.naumann@uni-wuppertal.de

#### Specialty section:

This article was submitted to Educational Psychology, a section of the journal Frontiers in Psychology

Received: 02 January 2019 Accepted: 04 June 2019 Published: 27 June 2019

#### Citation:

Naumann J (2019) The Skilled, the Knowledgeable, and the Motivated: Investigating the Strategic Allocation of Time on Task in a Computer-Based Assessment. Front. Psychol. 10:1429. doi: 10.3389/fpsyg.2019.01429 In large scale low stakes assessments, students usually choose their own speed at which to work on tasks. At the same time, previous research has shown that in hard tasks, the time students invest is a positive predictor of task performance. From this perspective, a relevant question is whether student dispositions other than the targeted skill might affect students' time on task behavior, thus potentially affecting their task performance and in turn their estimated skill in the target domain. Using PISA 2009 computer based assessment data, the present research investigated for the domain of reading digital text whether three variables that can be assumed to predict performance in digital reading tasks, comprehension skill, enjoyment of reading, and knowledge of reading strategies would also predict how much time students would devote to digital reading tasks, and in particular, whether they would adapt time on task to task difficulty. To address this question, two linear mixed models were estimated that predicted the time students spent on a task, and the average time students spent on relevant pages within each task, by the interaction of task difficulty with comprehension skill, enjoyment of reading, and knowledge of reading strategies. To account for time on task being nested in students and tasks, random effects for persons and tasks were included. The interaction of task difficulty with gender and Socio-Economic Status (SES) was included for control purposes. Models were estimated individually for 19 countries, and results integrated meta-analytically. In line with predictions, for both time on task indicators, significant positive interactions were found with comprehension skill, enjoyment of reading, and knowledge of reading strategies. These interactions indicated that in students with high comprehension skill, enjoyment of reading, and knowledge of reading strategies there was a stronger association of task difficulty with time on task than in students low in either of these variables. Thus, skilled comprehenders, students enjoying reading, and students in command of reading strategies behaved more adaptively than lower skilled, motivated, or knowledgeable students. Implications of these findings for the validity of self-paced computer-based assessments are discussed.

Keywords: time on task, PISA, educational assessment, test taking motivation, reading skill, reading strategies, validity

### INTRODUCTION

fpsyg-10-01429 June 26, 2019 Time: 15:44 # 2

In educational assessments, the goal is to infer a test-taker's latent ability from their performance on a number of tasks. From a psychological perspective however, it is never the latent ability per se that determines a test-taker's performance. For the notion of a latent variable to be meaningful, and for the latent variable to be of explanatory value, there has to be some notion of which psychological (and/or neural) mechanisms account for the latent variable taking on a specific value within a specific individual (e.g., Sternberg, 1986; Borsboom et al., 2003). This means that it is always specific cognitive and metacognitive, as well as motivational, processes that are executed during the test takers' engagement with the assessment tasks, which determine the test takers' responses, and thus their estimated abilities. One fundamental process test-takers need to engage in is the allocation of time to individual tasks. This is for two reasons: Firstly, even assessments that are not supposed to be "speeded", i.e., where test-takers are assumed to have ample time to complete all tasks, in fact do have a time limit. Thus, even in these assessments testtakers need to employ some sort of metacognitive strategy to allocate time to individual tasks. Secondly, the time testtakers spend on assessment tasks is a fairly strong predictor of their task performance, where the strength and direction of the association is dependent on characteristics of both the test-taker and the tasks. Apparently, it is especially hard tasks, that cannot be solved by routine cognitive processing, but instead require deliberate, controlled cognitive processing (see Schneider and Shiffrin, 1977; Shiffrin and Schneider, 1977), or metacognitive processing (see Pressley et al., 1989; Winne and Hadwin, 1998) where positive associations between time on task and task performance ("time on task effects") arise. This is e.g., true for tasks from domains such as problem solving in technology-based environments (Goldhammer et al., 2014) or reading digital text (Naumann and Goldhammer, 2017). Against this background, it appears beneficial for a test-taker to invest their time especially in hard tasks. Thus, a natural question seems to be which characteristics of a test taker, either cognitive or motivational, will put them in a position where they adequately allocate their cognitive resources, and thus their time on task, to a task's difficulty. The present research addresses this question for the domain of reading digital text (see e.g., OECD, 2011; Naumann, 2015; Cho et al., 2018). In the following, I will address the ideas that especially students skilled in comprehension ("The skilled"), students knowledgeable of reading strategies ("The knowledgeable"), and students who enjoy reading as such ("The motivated") are successful in adapting the time they invest in a digital reading task to the tasks' difficulty, both overall and regarding the processing of relevant parts of the text materials. These ideas will be derived from describing digital reading as task-oriented reading from the perspective of Rouet et al.'s (2017; see also Britt et al., 2018) RESOLV (REading as problem SOLVing)-model, from Pressley et al.'s (1989) model of the Good Information Processor, as well as the literature on item position effects in assessments (e.g., Debeer et al., 2014), and their moderation through motivation (e.g., Nagy et al., 2018a) and self-control (Lindner et al., 2017).

### Comprehension Skill and Task Representation

Reading in an assessment situation is an instance of task-oriented reading (e.g., Vidal-Abarca et al., 2010; Salmerón et al., 2015b; Serrano et al., 2018). In many situations, reading as an activity also is not only the processing of textual information to the end that an adequate situation model of the text contents is being built, as described by cognitive models of text comprehension such as Kintsch's (1998) theory. Rather, especially in opaque information environments such as on line, or when faced with multiple texts that might propose conflicting stances, accomplishing a reading task will entail elements of problem solving (Rouet et al., 2017). When a person reads to solve a task in a reading assessment, they first need to build a representation of the task's requirements. This includes a judgement of whether the question might be answered by a mere memory search (which will not be the case in most reading assessments, which are designed to not rely on prior knowledge). Then, the person will have to judge which parts of the text, or in a multiple text or hypertext reading scenario, which texts are likely to provide the information needed to answer the question. In addition, the task model might include a judgement of the task's difficulty, and thus the required degree of scrutiny in processing the textual information. Consider e.g., the task in **Figure 1**. In this task, students need to compose an e-mail, containing a recommendation to a friend concerning visiting a concert. To accomplish this, students have first to realize that they will need to consult the text. Then, they need to figure out where to find information on the two concerts mentioned in the task instructions, and to match these with the information in the e-mail. As there is no obvious (literal) match between the e-mail and the text on the menu labels in the Seraing Cultural Center's website, they need to figure out a navigation route, finding the Center's program, either by "Date" or by "Event type" to get by the required information. To adequately process this information, they need to figure out they have to evaluate it on a semantic level to judge the concert descriptions against the preferences mentioned in the e-mail. In short, students will have to develop a notion that the task displayed in **Figure 1** is a fairly complex one which requires a good deal of cognitive effort to be solved.

Consider, in contrast, the task displayed in **Figure 2**. Solving this task is possible on the basis of comparatively shallow processing that on a mere lexical level matches the name "Heritage Days" appearing in the question to the same name appearing on the page. The only inferencing needed was due to restrictions on screen resolutions in the assessment, students needed to scroll down to find the relevant information. An appropriate task model in this instance will include the fact that only limited cognitive resources, and time, will be needed to solve it (see also OECD, 2015; Naumann and Goldhammer, 2017).

It is likely that skilled comprehenders will be in a better position to arrive at the judgement that the task displayed in

**Figure 1** needs ample time to be invested in it, while the task displayed in **Figure 2** might be solved relatively quickly. Similar to the earlier MD-Trace-Model ("Multiple-Document Task-based Relevance Assessment and Content Extraction", see Rouet and Britt, 2011), the RESOLV model postulates a process whereby initially only very coarse reading goals are being set. These reading goals are then constantly updated, and the information acquired is judged against some standard specifying whether enough, and correct, information was acquired to meet the reading goal. According to the standards involved in this process, readers may e.g., judge that they need to re-read a passage, that a passage might be skipped, that it might be sufficient to just skim the passage (e.g., the website in **Figure 2** for the phrase "heritage days"), or that it might be necessary to carefully read a passage, such as the concert descriptions in the task displayed in **Figure 1**.

Previous research has indeed found that skilled comprehenders are better in making decision such as these, compared to lesser skilled comprehenders. One central ingredient of building an adequate task model is to note when, and what, information to search for. In line with the notion that an adequate task model is built more easily by better comprehenders, Mañá et al. (2017) found that decisions to search a text for information was predicted by comprehension skills. Moreover, these authors found that only students with average to good comprehension skills had their search decision, and subsequently task performance, boosted in a condition with a delay between reading the text and reading the questions. In line with these results, Hahnel et al. (2018) found that skilled comprehenders were more likely to seek out additional information when necessary in a task that required the evaluation of on line information provided in Search Engine Results Pages (SERPs).

Again in line with the idea that comprehension skills are a condition for building adequate task models, both Cerdán et al. (2011) and Salmerón et al. (2015a) found that students with higher comprehension skills when studying a text comprising multiple documents were much better in selecting relevant materials, and discarding irrelevant materials. This difference was especially pronounced when there were surface cues present, such as a literal match between a phrase in a passage and in the question, but (other than in the task in **Figure 2**) the passage was in fact irrelevant. Thus, in this scenario, it apparently was good comprehenders who built a task model that (correctly) contained the notion that the surface cue was misleading, and a deeper semantic analysis of the relation between question and text was needed. Similar results were reported by Rouet et al. (2011). These authors found that students in higher grades were less likely to be distracted by semantically irrelevant cues, such as capitalizing, when they had to select hyperlinks from a SERP, than were students in lower grades. A second study showed that indeed parts of this effect could be attributed to students in higher grades having better comprehension skills.

Thus, all in all, if the construction of an adequate task model, that correctly specifies the amount of cognitive effort that has to be invested into a task, is driven by good comprehension skills, we might expect good comprehenders to be better at adapting their time on task to task difficulty in a digital reading situation.

### Reading Strategies and Monitoring

As already mentioned in the introductory section of this article, readers in an assessment need to regulate their allocation of time to tasks. Allocating time on task, and monitoring this allocation through the course of completing a reading task can be seen as an instance of the application of cognitive (e.g., planning) and metacognitive (e.g., monitoring) strategies (see Weinstein and Mayer, 1986). As Pressley et al. (1989, p. 858) put it: "Good strategy users employ efficient procedures to accomplish complex, novel tasks. . . They possess essential metacognitive knowledge for implementing strategies, including knowing when and where each strategy might be useful, as well as the costs associated with the strategy, such as the amount of cognitive effort it requires" [emphasis added]. In line with this notion, a number of studies have found that in basic cognitive tasks, subjects tended to align their allocation of time to task difficulty. For instance, Dufresne and Kobasigawa (1989) found that when children in grades 1, 3, 5, and 7 were given a paired association task, where items in one condition were hard (unrelated) and in one condition were easy (related), 5th and 7th graders spent more time on studying the hard, as compared to the easy items, while 1st and 3rd graders showed no such adaptation of study time (see Lockl and Schneider, 2002, for a replication). Consistent with the idea that these differences in study time reflect metacognitive regulation, Lockl and Schneider (2003) demonstrated that indeed judgements of learning ease (estimated effort to learn the items) were higher for hard than for easy items. Consistent with the idea that subjects differ in their ability to effectively regulate their actual study behavior, they also found that 3rd graders showed higher associations between judgements of learning ease, and actual study time than 1st graders.

Such negative associations between judgement of learning ease (the task being perceived as easy) and time on task are however, not uniformly found. For example, Son and Metcalfe (2000, experiment 1), had undergraduate students' study eight biographies of famous people, and answer questions about them. Using these rather complex materials (compared to those used by Dufresne and Kobasigawa, 1989; Lockl and Schneider, 2002), Son and Metcalfe found that students indeed spent less time studying the biographies they then judged to be harder. One caveat in this case is however, that judgements of effort were confounded with judgements of interest: Not only were the biographies studied longer that were judged to be easier, but also those that were perceived as more interesting. Thus, it might have been the case that the judgement of effort at least amongst other reflected a lack of interest: The subjectively less interesting biographies were studied quicker, and at the same time judged harder just because they were less interesting and thus more effort would have to be put in, to compensate for the lacking interest.

All in all, there appears to be ample, though not unanimous, evidence that students who are able to metacognitively regulate their learning activities spend more time on harder, and less time on easier tasks. There is however, only little direct evidence how knowledge of reading strategies – apart and on top of

comprehension skill – would shape the time on task behavior of adolescent students in task-oriented digital reading scenarios, that is, in tasks that are way more complex than even the biographies studied by Son and Metcalfe (2000). Once again from the perspective of the RESOLV-Model, we might expect students knowledgeable of metacognitive reading strategies to be especially apt to align their time on task with task difficulty. This is because during the reading or (in the case of an assessment) task solution process, the task model, i.e., a representation of the reading goal and the resources required and available to achieve it, needs to be constantly updated, and this updating metacognitively regulated (Rouet et al., 2017, see last section, see also Winne and Hadwin, 1998).

### Reading Enjoyment and Test-Taking Motivation

Even students who are in good command of comprehension skills, and possess the reading strategy knowledge to successfully build, and through the course of task completion maintain, an adequate task model, might not all alike be motivated to put in the cognitive effort that is required to solve especially hard digital reading tasks. Amongst other lines of research, this is evidenced by studies investigating position effects in low stakes assessments such as PISA. Usually, students' performance declines over the course of an assessment in the sense that the same task will have a lower probability of being answered correctly when it is presented later in the assessment, conditional on a student's skill (Debeer and Janssen, 2013; Debeer et al., 2014; Borgonovi and Biecek, 2016; Weirich et al., 2017; Nagy et al., 2018a). Not all groups of students however are prone to show position effects to equal degrees. For example, Borgonovi and Biecek (2016) analyzed position effects using data from the PISA major domains in 2006, 2009, and 2012, i.e., mathematics, reading, and science, respectively. They found that performance declines due to item positions in each domain to be strong especially in boys, and in students coming from lower socio economic status (SES) backgrounds. Similarly, Nagy et al. (2018b) found that especially for the domain of reading, position effects were strong in boys, and lower SES students.

What mechanisms might account for item position effects in general, and for inter-individual variance in the strength of these effects? The decline in performance in general has been attributed to students, over the course of the assessment, being less willing and/or able to put effort into solving the assessment tasks. For example, Weirich et al. (2017) measured test taking effort at two points in time during 9410 ninth-graders' completion of a science assessment in Germany. They found not only position effects, but these effects, on an individual level, were predicted by the change in test-taking effort that occurred between the two points in time. Lindner and colleagues (Lindner et al., 2017, 2018) discuss position effects in the context of exercising selfcontrol. They define self-control in accordance with Inzlicht and Schmeichel's (2012) process model of self-control. According to this model, exercising self-control at one point in time will decrease especially the motivation to attend to aversive tasks, and increase the likelihood of attendance to pleasing stimuli at a later point in time. Consistent with this idea, Lindner et al. (2018) found that the decline of performance over the course of a 140 min assessment of mathematics and science was predicted by waning state self-control, measured at seven points in time. Also consistent with this idea, Lindner et al. (2017) found that participants who had been forced to exercise self-control in a later mathematics assessment task exhibited a steeper decline in performance (i.e., stronger position effects) than participants who had not had to exercise self-control. However, contrary to their expectations Lindner et al. (2017) did not find any effects of self-control expenditure on time on task as an indicator of task engagement.

According to Inzlicht and Schmeichel (2012), it is especially effort-requiring and for this reason aversive tasks that are affected by previous expenditure of self-control. In the context of cognitive assessments, this assumption implies that waning selfcontrol (and thus a decline in performance) should be strong especially in those students who perceive the assessment tasks as aversive. A reading task, for instance, might be especially aversive for a person who struggles already with basic reading processes, such as decoding, and in general does not enjoy reading. A fluent reader, in contrast, who also enjoys reading as an activity, from this perspective should be much less prone to exhibit position effects. In line with these ideas, Nagy et al. (2018a) indeed found position effects in a reading assessment to decrease with increasing decoding skill and reading enjoyment on the student level.

Taken altogether, we might expect, both from previous research, and from the perspective of theoretical models such as Inzlicht and Schmeichel's (2012) model of self-control, that the adaptation of time on task to task difficulty is dependent not only on cognitive variables such as comprehension skill and knowledge of reading strategies, but also on motivational variables. In particular, we might expect that especially students who perceive reading as an enjoyable activity might be willing to invest extra time when encountering a hard task. Students for whom reading is aversive, in contrast, might refrain from this investment, so that the adaptation of time on task to task difficulty should be especially pronounced in motivated readers, who report a high level of reading enjoyment.

### The Present Research

To the best of the author's knowledge, there is yet no study that investigates how in reading digital text, students' adaptation of total time on task to task difficulty is conjointly predicted by comprehension skill, knowledge of reading strategies and reading enjoyment. In task-oriented reading of multiple texts in general, and in task-oriented reading situations using digital text in particular, readers need to select which texts, or parts of the text available, to access and to use, in which order to accomplish their goals, and which to discard ("navigation", see Lawless and Schrader, 2008; Naumann, 2015; Salmerón et al., 2018). Then they have to decide for each text or part of a text selected, how much cognitive effort they want to invest into processing. Naturally, especially in hard tasks, it seems beneficial to devote time to processing task-relevant parts of the available materials (see e.g., Rouet and Le Bigot, 2007). Thus, besides

investigating the differential adaptation of total time on task to task difficulty, the present research specifically examined how the time students spend on relevant parts of the text stimulus is adapted to task difficulty by students varying in comprehension skill, knowledge of reading strategies, and reading enjoyment. These questions are addressed using data from one of the first computer-based large-scale assessments, the PISA 2009 Digital Reading Assessment.

The Digital Reading Assessment was an International Option in PISA 2009, which was chosen by 19 countries and economies. It was targeted specifically at students' skill in engaging with, comprehending, and using digital texts that were prevalent at the time the assessment was conceived (early 2007 to early 2008), such as websites (personal, educational or corporate), blogs, e-mails, or forums. It was comprised of a total of 29 tasks, which were distributed across nine units. Each unit consisted of a text stimulus and between one and four tasks. Each text stimulus was made up of several pages, which in most cases belonged to different texts, such as an e-mail and a website (see **Figure 1**). Tasks differed in how many pages students needed to access to complete the task, with some tasks requiring to read only the task's prompting page (see **Figure 2**), and some tasks requiring the student to perform as many as 13 steps of navigation. Besides pages necessary to complete the task, tasks also varied in their number of relevant pages. Relevant pages were defined as those pages that either contained information that needed, or could be used to solve each task, or that needed to be visited in order to arrive at this information. In addition, pages were considered relevant that, from their labels, could be assumed to hold information instrumental either to solve the task, or to complete navigation, such as a "site map". The mean number of relevant pages was 3.61 (SD = 3.42, Md = 2, Min = 1, Max = 14). However, in each task, all pages of the unit's text stimulus were available to students, making it possible to visit not only pages that were relevant to the task, but also non-relevant pages.

The PISA 2009 Digital Reading data set lends itself to address the issues raised in a couple of ways. First, computer-based assessments allow for the measurement of time on task, and more so, for a detailed investigation of what parts of a task stimulus (in this case: the text[s]) students encountered for how long, and in which sequence. This makes it possible to derive measures of task engagement, such as the average time spent on relevant pages, which are not routinely available from paper and pencil tests (see Greiff et al., 2015). Second, a number of tasks large enough to model a random effect for tasks is available. Thus, other than in fixed effects models such as ANOVA or OLS regression, which allow generalization only to other persons, but not to other situations, conditions, or tasks than those specifically employed in the respective design, here the obtained results can in principle be generalized to other tasks that were constructed according to the same framework through modeling task as a random effect (De Boeck, 2008). Third, since reading was a major domain in PISA 2009, rather detailed student-level measures are available, not only as to their comprehension skill, but also as to their knowledge of reading strategies, and their enjoyment of reading. Finally, large scale databases provide not only good variation in terms of students' backgrounds, but also good opportunities to control for background variables such as SES and gender. In the present case, this seems especially crucial, as on the grounds of the results reported by Nagy et al. (2018b) and Borgonovi and Biecek (2016, see section "Reading Enjoyment and Test-Taking Motivation" above), it might well be expected that higher SES students and girls are more likely to adapt their time on task behavior to task difficulty than are their lower SES peers or boys: As it seems, higher SES students, as well as girls, are more prepared than their lower SES or male peers to maintain cognitive effort in an assessment. This means that these background variables also are likely to affect students' preparedness to adapt their time on task behavior to task difficulty. Thus, any analysis targeting time on task behavior conditional on task difficulty should control for the interaction of SES and gender with task difficulty.

### MATERIALS AND METHODS

#### Subjects

Subjects were those students that participated in the PISA 2009 Digital Reading Assessment and for whom time on task for at least two tasks, comprehension skill, knowledge of reading strategies, and enjoyment of reading were available (N = 32,669, country-wise 930 ≤ N ≤ 2800, see **Supplementary Material 1** for country-wise N's). Overall, there were 50% boys. There were between 46 and 53% boys in each sample. Due to PISA's sampling scheme, which samples students at the end of compulsory education, students were between 15.17 and 16.33 years old (M = 15.78, SD = 0.29; country-wise M between 15.67 and 15.87).

### Measures

#### Total Time on Task

Time on task was read from log files. It was defined as the time that elapsed between the onset of the task, and the time the student gave a response. It thus comprised the time a student spent reading the task instruction, reading potentially both relevant and irrelevant parts of the text, and deciding on a response. To account for the skew of the time on task distribution, the natural logarithm of the total time on task was used.

#### Time on Relevant Pages

To compute time on relevant pages, each navigation sequence was segmented by page transitions. Then the time elapsed between each transition to, and from, a page classified as task-relevant was summed up across each task-completion sequence. Since in each task the prompting page was defined as relevant, time on relevant pages also comprised the time spent reading the task instruction. It did however not comprise time a student might have spent reading task-irrelevant parts of the stimulus. Because tasks varied considerably in the number of relevant pages they comprised, time on relevant pages was standardized at the number of relevant pages available in each task. To account for the skewness of the distribution, the natural logarithm of time on relevant pages was used.

#### Comprehension Skill

fpsyg-10-01429 June 26, 2019 Time: 15:44 # 8

Comprehension skill was measured through the PISA 2009 print reading assessment. Being a major domain in 2009, print reading skill was measured with a total of 131 items in 37 units (a unit consists of a text stimulus accompanied by either a single, or multiple items). These 131 items were allocated to 13 clusters worth of approximately 30 min of testing time each. The clusters were assigned to 13 different booklets together with items from the PISA mathematics and science assessments. Each booklet contained four clusters. Of these 13 booklets, one contained four clusters of reading items, three contained three clusters of reading items, seven contained two clusters of reading items, and two contained one cluster of reading items. Thus, each student completed at least 30 min of print reading, with 12 out of 13 students completing at least 60 min (see OECD, 2012, p. 29–30 for details). Items had been constructed according to an assessment framework (OECD, 2009) specifying three different reading aspects, or cognitive operations: (1) Accessing and retrieving, (2) integrating and interpreting, and (3) reflecting and evaluating textual information, as well as two different text formats, continuous and non-continuous texts (see OECD, 2009). It is important to note that in both continuous and noncontinuous texts in the print reading assessment students were prompted with the complete text, thus, no navigation in the sense of physical access to text through hyperlinks was required. Comprehension skill was scaled according to the Rasch Model. Weighted Maximum Likelihood Estimates (WLEs) were used in the present analysis. The WLE reliability was 0.84 (see OECD, 2012, p. 194, Table 12.3).

#### Knowledge of Reading Strategies

Knowledge of reading strategies was measured with two reading scenarios. In each scenario students were prompted with a specific reading situation. These reading situations were the following: (a) "You have just read a long and rather difficult twopage text about fluctuations in the water level of a lake in Africa. You have to write a summary", and (b) "You have to understand and remember the information in a text". Each of these reading scenarios were accompanied by either 5 (summary scenario) or 6 (understanding and remembering scenario) possible strategies such as "I try to copy out accurately as many sentences as possible" (summary) or "I quickly read through the text twice" (understanding and remembering). In each scenario, each strategy had to be rated by students on a 6-point rating scale from "not useful at all" to "very useful". It is important to note that the students did not rank-order the strategies themselves, but rated them for their usefulness independently from each other", and these ratings were then in a second step rankedordered within each scenario and student. At the same time, the strategies had been rated, and rank-ordered, by reading experts. The scoring then was accomplished on the basis of the agreement between the rank-order of each student's ratings with the experts' ratings' rank-order. Specifically, 1 point was awarded for each pairwise comparison in students' ratings that agreed with the respective pairwise comparison in the experts' rating for those 9 (understanding and remembering) and 8 (summarizing) pairs of strategies where there was consensus amongst the experts which strategy was more useful. A point was only awarded when students, in agreement with experts, ranked a strategy to be more useful than another. Thus, when two strategies that entered the score were ranked as equally useful by a student, no point was awarded (see OECD, 2012, p. 282). The possible score thus ranged between 0 (no agreement) and 17 (agreement in all 17 pairwise comparisons considered).<sup>1</sup> The reliability (Cronbach's α) for the 17 pairwise comparisons entering the score was 0.84 in the present sample, the EAP reliability was 0.86.

#### Reading Enjoyment

Enjoyment of reading was measured through 11 items such as "Reading is one of my favorite hobbies" or "For me, reading is a waste of time", which were to be answered on a 4-point Likert scale ranging from "Strongly disagree" to "Strongly agree". Item wordings and item parameters can be found in OECD (2012, p. 290). For the present research, the enjoyment of reading index provided in the OECD PISA 2009 data base was used. Reading enjoyment was scaled according to the partial credit model, providing a Weighted Maximum Likelihood Estimate (WLE) for each student. The reliability (Cronbach's α) for the present sample was 0.89.

#### Task Difficulty

Task difficulty was defined using the item difficulties of the PISA 2009 digital reading items. In PISA, items are scaled according to the Rasch model. The simple logistic model is applied to dichotomous items, while partial credit items are scaled according to the partial credit model (Masters, 1982). Of the 29 reading tasks in the digital reading assessment, eight had partial credit. Item difficulties (delta) were taken from the international calibration of the PISA 2009 digital reading items, which are provided in OECD (2012, Table A4, p. 343). For partial credit items, this parameter marks the location of the latent ability continuum where the likelihoods of a responses in the highest and the lowest response category are equal (see e.g., Adams et al., 2012).

#### Socio-Economic Status (SES)

To measure students' SES, the PISA ESCS index was used, which is composed of students' parents' occupational status, students' parents' education, and wealth, as well as cultural and educational resources in students' homes (including, but not limited to, the number of books at home). Technically, the ESCS is a factor score from a principal component's analysis of the HISEI (highest parental occupation amongst a student's parents), and the PISA home possessions index (HOMEPOS). Details on how the ESCS was computed in PISA 2009 can be found in OECD (2012, p. 312–313).

#### Procedure

Students were tested in schools during school hours. First, students completed the paper-based cognitive assessment

<sup>1</sup>Note that in PISA 2009, two separate indices were built on the basis of the two scenarios. In the present research, the two scenarios were combined into one score in accordance with the intentions of the authors of the original instrument from where idea of measuring strategy knowledge employed in PISA 2009, as well as the scenarios and to-be-rated strategies were derived (see Artelt et al., 2009).

(reading, mathematics and science), which lasted for two hours. Students could take a break after one hour. Afterwards, the student questionnaire was administered. Last, students completed the computer-based reading assessment. In PISA 2009 digital reading skill was the only domain in the computer-based assessment. Digital reading items were presented in a secure test environment where a browser was simulated that had all typical features of commercial web browsers at the time the assessment was conceived. Items were presented unit by unit, and in each item, the unit's text(s) were accessible, regardless of whether they were relevant to the item at hand or not. After giving a response, students could not go back to correct their response. Testing time in the Digital Reading Assessment was 40 min. Students knew in advance how much time in total there was to complete the assessment. In addition, students first completed a 10-min tutorial where they could make themselves familiar with the testing environment and simulated web browser. The assessment was not speeded, as indicated by a small number of not-reached items (0.4 on average, see OECD, 2012, chapter 12).

All testing and other data collection instruments and procedures were approved by the PISA governing board, composed of country representatives of all countries that participated in the assessment, as well as by the PISA consortium, led by the Australian Council for Educational Research. Implementation of data collection and management was overseen by national centers, led by national project managers, in each country (see OECD, 2012, p. 24–25 for details). The data that are used for the present research are either in the public domain, and can be found at http://www.oecd.org/pisa/data/ (accessed March 01, 2019), or, where this was not the case, the author had received written consent from OECD to use the Digital Reading Assessment log file data for scientific purposes to be published in scholarly journals. An ethics approval was thus not required for this study as it presents a secondary analysis of OECD data. The author of the present article at no point had access to information identifying individual subjects.

### Statistical Modeling Approach Linear Mixed Model and Estimation

To account for item-specific response times being nested both in items and students, a linear mixed model (LMM) framework was employed that specified crossed random effects for student and item intercepts, and an additional random effect for schools to account for the fact of students being nested in schools due to the PISA sampling procedure. The central research questions were addressed by regressing time on task on the student level variables comprehension skill, knowledge of reading strategies, and reading enjoyment, and the task-level variable task difficulty, as well as, most importantly, the interaction of each student level variable with task difficulty. On top of the main effects and the three twoway interactions of comprehension skill, knowledge of reading strategies, and reading enjoyment with task difficulty, the model contained all other possible two, three and four-way interactions between the four theoretically relevant variables. Gender and SES were entered as control variables. Since the theoretically relevant effects were two-way interactions involving task difficulty, the two-way interaction of each gender and SES with task difficulty was entered into the model as well. No other or higher-order interaction terms involving gender and SES were specified.

All models were estimated in the R environment (R Development Core Team, 2008) using the function lmer from the package lme4 (Bates et al., 2015), version 1.1-15. For better interpretability of regression coefficients, all metric variables were centered and standardized within each country or economy. This means that regression coefficients represent expected changes in the criterion variable in terms of its within-country standard deviation, per within-country standard deviation of each predictor. Standard deviations of all variables in the analyses did not vary much across countries (see **Supplementary Material 1**). Gender was entered dummy-coded with girls as the reference group.

#### Integration of Country-Specific Results

Country-specific results (fixed effects) were integrated using a random-effects meta-analytic model (Hedges and Vevea, 1998), using the R-package metafor (Viechtbauer, 2010). Meta-analysis lends itself for the analysis of data such as the present for multiple reasons. In educational assessments such as PISA, sampling occurs at the level of countries, so that an analysis pooling data from all countries would not be appropriate. However, besides effects for individual countries, it is of interest how an effect turns out in general, i.e., across countries. A randomeffects meta-analytic model that discriminates a fixed (total) effect from a random, study-specific effect seems especially suitable in this situation: The fixed effect may be interpreted as a general effect, which is the same across countries. The variance of the study (i.e., country) specific effect gives an estimate, and allows a significance test, for the variance of county specifics adding to the total effect size, over and above sampling variance.<sup>2</sup> To conduct the meta-analysis for each effect, one vector was created for each effect containing the countryspecific estimates of each effect through reading the respective effect from the respective lmer object using the function fixef from the lme4 package. A second vector containing each effect's standard error for each country was created using the se.fixef function from the package arm (Gelman and Su, 2016). These two vectors (after taking the square of each effect's standard error to arrive at the variance) were given to the rma function from the metafor package. An alpha level of 0.05 was set for all significance tests.

<sup>2</sup>Theoretically, an alternative to the meta-analytic approach used here would have been a model where country is treated as a random effect, and a random slope is estimated for each effect across countries. However, apart from the fact that given the number of fixed effects in the present analysis, such a model would probably would have been computationally intractable, it would only tell usif an effect varies across countries, but not in which way. Including country as another fixed effect, and estimating its interaction with each of the other fixed effects in the analysis would have added at least another 20 fixed effects to an already complex model. Thus, in the present case, the meta-analytic approach appeared to be the best compromise between comprehensiveness and parsimony that could be found.

#### Illustration of Interaction Effects Through Simple Slopes

fpsyg-10-01429 June 26, 2019 Time: 15:44 # 10

To illustrate the interaction effects between task difficulty and comprehension skill, knowledge of reading strategies, and reading enjoyment, respectively, simple slopes were computed and tested for significance at the upper and lower boundaries of the respective distributions (2.5th and 97.5th percentiles, or ±1.96 standard deviations). For comprehension skill (and task difficulty) these percentiles represent the boundaries between the highest and second to highest competency level (levels 5 and 6), and the lowest and second to lowest competency level (levels 1a and 1b) respectively (see OECD, 2010, for the interpretation and description of reading competency levels). The values at which to compute simple slopes were chosen for knowledge of reading strategies and reading enjoyment in accordance. It is important to note that irrespective of the values chosen for the computation of simple slopes, the interaction effect as such relates to the whole sample, and simple slopes could, in principle be computed for any value of each predictor in the model (see Aiken et al., 2003).

### RESULTS

Means, standard deviations and correlations for all variables in the analyses pooled across countries and economies are provided in **Table 1**. Country-specific statistics are provided in **Supplementary Material 1**.

### Random Effects

There was significant variation of time on task, as well as time on relevant pages, between tasks, subjects, and schools in each country and economy. The corresponding variance components can be seen in detail in the model summaries that are provided as **Supplementary Material 2**. **Supplementary Material 3** provides the respective significance tests. In the following, all estimates are meta-analytic fixed effects across countries and economies. Country-specific effects can be found in **Supplementary Material 2**. Most of the fixed effects of theoretical interest showed significant variability across countries and economies, over and above sampling variance. Since, however, this variability in the present research was not of theoretical interest, the estimates and the significance of between-country variance is presented as **Supplementary Material 4**. In the following, if a fixed effect showed no variance across countries over and above sampling variance (the exception from the rule), this is explicitly mentioned.

### Fixed Effects

#### Main Effects of Task Difficulty Comprehension Skill, Strategy Knowledge and Reading Enjoyment

As expected, there was a significant main effect of task difficulty, meaning that students on average took more time in harder tasks (meta-analytic effect: b = 0.39, SE = 0.02, 95%-CI: [0.35; 0.43]), and on average spent more time on task-relevant pages (meta-analytic effect: b = 0.18, SE = 0.03, 95%-CI: [0.13; 0.24]). Neither main effect of task difficulty varied across countries over and above sampling variance. Also, both time on task indicators were positively predicted by comprehension skill. More skilled comprehenders spent more time on the tasks in general, and they spent more time on relevant pages (meta-analytic effect for both time on task indicators: b = 0.09, SE = 0.01, 95%-CI: [0.08; 0.10]).

On top of the main effect for comprehension skill, there was a positive main effect of strategy knowledge on both time on task indicators. For both time on task indicators this effect was b = 0.03 (SE < 0.01), 95%-CI: [0.02; 0.03]. In addition to the main effects of comprehension skill and strategy knowledge, reading enjoyment had a positive main effect, meaning that students enjoying reading both spent more time on the tasks in total (meta-analytic effect: b = 0.02, SE < 0.01, 95%-CI: [0.01; 0.03]), and on relevant pages (meta-analytic effect: b = 0.02, SE < 0.01, 95%-CI: [0.01; 0.02]).

### Interactions of Task Difficulty With Comprehension Skill, Strategy Knowledge and Reading Enjoyment Comprehension Skill

The main effects of task difficulty and comprehension skill were qualified by a significant positive two-way interaction (see **Figure 3**, left panel, and **Figure 4** for an illustration). Metaanalytically, this interaction amounted to b = 0.09 (SE < 0.01), 95%-CI: [0.08; 0.09] for total time on task, and b = 0.08 (SE < 0.01), 95%-CI: [0.07; 0.08] for time on relevant pages (see the left hand panel in **Figure 3**), representing a mediumsized effect each.


a k = 640,482 task responses. <sup>b</sup>Seconds. <sup>c</sup>Averaged across the number of relevant pages available. <sup>d</sup>n = 32,699 students. <sup>e</sup>1 = female, 2 = male.

To further interpret these interactions, simple slopes were computed depicting the effect of task difficulty in very strong comprehenders and very weak comprehenders (zcomprehension = ±1.96, see **Figure 4** for an illustration). Likewise, the effect of comprehension skills was estimated in very hard and very easy items (zdifficulty = ± 1.96). These analyses revealed the following: There was a strong effect of task difficulty on both time on task indicators in strong readers, amounting meta-analytically

to b = 0.60 (SE = 0.02), 95%-CI: [0.52; 0.61] for total time on task, and to b = 0.33 (SE = 0.03), 95%-CI: [0.28; 0.39] for time on relevant pages. Both these slopes had no significant variance across countries. For poor comprehenders at the lower end of the comprehension skill distribution the effect of task difficulty on total time on task was much reduced, though still significant, the meta-analytical effect was b = 0.22 (SE = 0.02), 95%-CI: [0.18; 0.26]. No significant effect of task difficulty on time on relevant

FIGURE 4 | Simple slopes for the regression of total time on task and time on relevant pages on task difficulty in students high (97.5th perc.) and low (2.5th perc.) in

comprehension skill for one sample country (Australia). Data points are raw data. Regression intercepts and slopes are model-based estimates.

pages was found in poor comprehenders, b = 0.03 (SE = 0.03), 95%-CI: [−0.02; 0.09]. Once again, these two simple slopes displayed no variance over and above sampling variance.

Correspondingly, in hard tasks, there was a strong positive association of comprehension skill with both total time on task, b = 0.26 (SE = 0.01), 95%-CI: [0.23; 0.28], and time on relevant pages, b = 0.23 (SE = 0.01), 95%-CI: [0.22; 0.26]. In easy tasks, in contrast, this association was negative for both total time on task, b = −0.08 (SE = 0.01), 95%-CI: [−0.10; −0.07], and time on relevant pages, b = −0.06 (SE = 0.01), 95%-CI: [−0.07; −0.05].

Taken altogether, these results suggest the following: Skilled comprehenders align their total time on task, as well as the time they spend on task-relevant hypertext pages, closely to the tasks' difficulties. In contrast, much less of such an adaptive behavior occurs in poor comprehenders. These readers show some alignment of their total time on task with task difficulty, but none of the time they spend on relevant parts of the text. Correspondingly, when tasks were hard, skilled comprehenders appeared to invest more time in these tasks than poor comprehenders. Easy tasks in contrast were more quickly solved by skilled, as opposed to poor comprehenders.

#### Knowledge of Reading Strategies

The positive main effect of strategy knowledge on both total time on task and time on relevant pages was in each case qualified by a significant positive interaction with task difficulty, amounting to b = 0.02 (SE < 0.01), 95%-CI: [0.02; 0.02] both for total time on task and time on relevant pages (see the middle panel in **Figure 3**, and **Figure 5** for an illustration), which represented a small effect each. To interpret theses interactions, simple slopes were computed to estimate the effect of task difficulty for students at the upper and lower ends of the strategy knowledge distribution (zstrategyknowledge = ± 1.96), and, correspondingly, the effect of strategy knowledge in easy and hard items. For students high in knowledge of reading strategies, the effect of task difficulty on total time on task was estimated as b = 0.44 (SE = 0.02), 95%-CI: [0.40; 0.47], and on time on relevant pages as b = 0.22 (SE = 0.03), 95%-CI: [0.16; 0.29]. Both these effects were homogeneous across countries and economies. For students low in knowledge of reading strategies, the effects of task difficulty on time on task were still significant, but reduced in magnitude. They amounted to b = 0.35 (SE = 0.02), 95%-CI: [0.31; 0.39] for total time on task and b = 0.15 (SE = 0.03), 95%-CI: [0.09; 0.21] for time on relevant pages. Once again, these two effects showed no variability over and above sampling variance across countries and economies.

In hard tasks, the effect of strategy knowledge on total time on task was estimated as b = 0.07 (SE = 0.01), 95%-CI: [0.06; 0.08], and the effect on time on relevant pages as b = 0.06 (SE < 0.01), 95%-CI: [0.05; 0.07]. These positive associations were reversed to negative in easy tasks, where the effect of strategy knowledge on total time on task was b = −0.02 (SE < 0.01), 95%-CI: [−0.01; −0.02], and on time on relevant pages b = −0.01 (SE < 0.01), 95%-CI: [−0.01; 0.00].

Taken together these results suggest that over and above the effect of comprehension skill, students with better knowledge of reading strategies do a better job in aligning their time on task behavior with task difficulty. Students with better knowledge of reading strategies at the same time invest more time in hard tasks, and are quicker in solving easy tasks, than their less knowledgeable peers.

#### Reading Enjoyment

As were the main effects of comprehension skill and knowledge of reading strategies, the main effect of reading enjoyment was moderated by task difficulty though a significant positive interaction, b = 0.02 (SE < 0.01), 95%-CI: [0.01; 0.02] for both total time on task and time on relevant pages (see the right hand panel in **Figure 3**, and **Figure 6** for an illustration), which represented a small effect each. Simple slopes analyses (see **Figure 6** for an illustration) revealed that in students high

in reading enjoyment, there were strong or medium sized effects of task difficulty on both total time on task, b = 0.43 (SE = 0.02), 95%-CI: [0.39; 0.47], and time on relevant pages, b = 0.22 (SE = 0.03), 95%-CI: [0.16; 0.27]. These effects were reduced, but remained positive and significant in students low in reading enjoyment, where they amounted to b = 0.36 (SE = 0.02), 95%-CI: [0.32; 0.40] for total time on task, and b = 0.16 (SE = 0.03), 95%- CI: [0.09; 0.21]. All simple slopes for task difficulty in students low and high in reading enjoyment did not display variance over and above sampling variance.

As for comprehension skill and knowledge of reading strategies, a positive effect of enjoyment of reading was found in hard tasks, which amounted to b = 0.05 (SE = 0.01), 95%-CI: [0.04; 0.06] for both total time on task and time on relevant pages. In easy tasks, this effect was once again reversed to negative, and amounted to b = −0.02 (SE < 0.01), 95%-CI: [−0.03; 0.00] for total time on task, and b = −0.01 (SE < 0.01), 95%- CI: [−0.02; 0.00]. Thus, on top of the corresponding effects for comprehension skill and knowledge of reading strategies, students who enjoy reading more appear to invest more time in difficult tasks, but are quicker when they work on easy tasks than their peers who report less enjoyment in reading. It should be noted though that the negative effect of enjoyment was small, and that simple slopes were computed for tasks at the lower and upper end of the task difficulty distribution. Thus, for easy to moderately difficult tasks, the effect for enjoyment in reading on time on task would be zero, or slightly positive.

### DISCUSSION

The present article examined the task-adaptive allocation of time, and time spent on relevant pages, while reading digital text, dependent on students' comprehension skills, knowledge of reading strategies, and enjoyment of reading. Although these three student characteristics are positively correlated (see **Table 1** and **Supplementary Material 1**), independent effects (that is, while controlling for each other) could be secured, indicating that students high in each of these variables showed a more pronounced adaptation, both of total time on task and of time on relevant pages, to the tasks' difficulties. This was evidenced by significant positive interactions of each these student characteristics with task difficulty in predicting time on task and time on relevant pages, which were found consistently across 19 countries and economies (the only exception being Colombia and Hungary, where no interaction of task difficulty with reading enjoyment was found, see **Figure 3**).

### The Present Results Viewed From Previous Theory and Findings

These results are much in line with research from cognitive, educational, and social psychology that describes how students build models of the task when reading, how they monitor the reading process, and how they maintain effort when encountering a lengthy assessment comprising of multiple tasks, such as PISA. Specifically, the finding that time on task and task relevant pages are more positively predicted by task difficulty in strong comprehenders is much in line with the RESOLV model (Rouet et al., 2017), as strong comprehenders can be expected to be better in creating adequate task models. It is also in line with previous research pointing to better comprehenders behaving more task-adequate when it comes to selecting relevant, and discarding non-relevant text materials (Cerdán et al., 2011; Salmerón et al., 2015a). The result that knowledge of reading strategies is predictive of the adaptivity of time on task behavior also is in line with the RESLOV model, as well with earlier models of metacognitive engagement while learning, such as Winne and Hadwin's (1998) COPES (Conditions, Operations, Procedures, Evaluations, Standards) model. Finally, the interaction of task

difficulty with reading enjoyment is consistent with research describing position effects, or performance declines, in low-stakes assessments as a result of failing self-control and, as a result, motivation to mobilize mental resources (e.g., Lindner et al., 2017; Nagy et al., 2018a). These effects are moderated by students' enjoyment of reading, presumably because these students view the assessment task as less aversive, and thus suffer less from failing self-control (Inzlicht and Schmeichel, 2012). From this perspective, it was to be expected, that students enjoying reading as an activity would also be more likely to invest time especially in hard tasks. This latter result is also nicely aligned with recent descriptions of "engaged" reading as proposed by Guthrie and colleagues (Guthrie et al., 2012). In their model, a direct predictor of reading achievement is behavioral engagement, which they also coin "dedication" (p. 604). Behavioral engagement in itself is dependent on motivations to read. In the present context, we might well assume behaviorally dedicated students especially those who allocate their time especially in hard tasks, and devote extra time especially to reading relevant parts of the text when the task is hard.

### Implications of the Present Results for Assessment and Education

As mentioned in the introductory part of this article, completion of an assessment task, and, in turn, the estimated ability of a student is not merely the reflection of a latent variable. Rather, it is always the result of intertwined cognitive and motivational processes carried out at time of task completion. One of these processes is the task-adequate mobilization of cognitive resources, and thus the expenditure of time. From the perspective of the present results thus the question arises whether time on task, or time spent on relevant pages, is governed by variables that can be regarded as part of the to-be-measured construct "digital reading skill". In other words: If a crucial process of task engagement, that is predictive of task performance, is functionally dependent on processes and dispositions that are clearly outside the definition of the targeted construct, this would pose a threat to validity arguments made on the basis of the respective test scores (AERA et al., 2014). The largest interaction effects found in the present research were those of task difficulty with comprehension skill. Comprehension however clearly is part of the construct "digital reading", as digital reading is reading in the first place. Thus, if a student is in a better position to solve a digital reading task due to better comprehension skill in part because these superior comprehension skills enable them to better align their effort with the task's requirements, this does not necessarily pose a threat to the assessment's validity. Rather, one might argue, it describes an additional pathway whereby good comprehension skills predict good performance in digital reading, and thus explain the positive correlation that is usually found between offline and online measures of reading skill and performance (e.g., Coiro, 2011; OECD, 2011; Naumann and Salmerón, 2016).

A similar argument might be made for knowledge of reading strategies. A long tradition of previous research has pointed to the necessity of strategic control especially in reading situations encountering digital text, web-based text, hypertext, or multiple texts (e.g., Bannert, 2003; Azevedo and Cromley, 2004; Naumann et al., 2008; see Cho and Afflerbarch, 2017 for an overview). If, however from a construct perspective metacognitive regulation is one central aspect of reading digital text, it would be counterintuitive to view it as a threat to validity when knowledge of reading strategies governs the adaptive allocation of time on task, and possibly thereby performance on tasks. Rather, as for comprehension skill, the present results evidence one particular mechanism by which knowledge of reading strategies might translate itself into successful reading of digital text.

This notion does not necessarily hold for enjoyment of reading. According to the reasoning put forward in the present research, students high in reading enjoyment do a better job in aligning their time on task behavior with task difficulty because they see reading as less an aversive task. For this reason, it is easier for them than for their peers lower in reading enjoyment to maintain effort and invest time in difficult tasks. Thus, according to the present reasoning, the positive association of reading enjoyment, or reading motivation in general, and reading skill, does not only arise because students higher in reading enjoyment, or motivation, come from higher SES backgrounds, from where they also can acquire better skill (e.g., Artelt et al., 2010). Also, it is not (only) that higher enjoyment or motivation longitudinally bring about better skills, or the reverse (e.g., Becker et al., 2010; Retelsdorf et al., 2011). Rather, just like comprehension skill and knowledge of reading strategies, reading enjoyment seems to be among the variables that govern the process of task engagement in the assessment situation itself and thereby may bring about better task performance and thus a higher level of estimated skill.

Other than comprehension skill and knowledge of reading strategies however, reading enjoyment is not necessarily to be seen as a part of the construct "skill in reading digital text". In other words: A skilled digital reader, who is not in command of comprehension skills is as self-contradictory an idea as a skilled digital reader, who is not in possession of knowledge of reading strategies. In contrast to this, a skilled digital reader who simply does not enjoy reading might be a rare observation, as reading skill and enjoyment are usually positively correlated. The notion of such a reader, however, is not at all a contradictory idea.

From these perspectives, practical implications for the design of assessments, and practical implications for reading in other task-oriented reading situations such as learning are not quite aligned with one another: The finding that reading enjoyment, even if to only a small extent, enhances the adaptive allocation of time might pose a threat to valid interpretations of test scores. On the other hand, it once again highlights the crucial role of motivation in bringing about dedicated and engaged reading behavior, which in turn has been found to be a crucial determinant of learning from text (Guthrie et al., 2012, 2013). This, in turn, once again highlights the need for students to develop motivational traits and attitudes that help them to put in the effort required to cope with difficult and demanding digital texts. Obviously, this notion holds also for knowledge of reading strategies, and, last not least, comprehension skills. Putting students in a position to adequately mobilize cognitive resources when dealing with digital text seems especially important, as digital text to an increased degree requires students not only to "navigate" (see section "The present research" above), but also to evaluate text (Salmerón et al., 2018), a process which is cognitively demanding (Richter and Maier, 2017), and which many students find difficult to perform (e.g., Brante and Strømsø, 2018).

### Limitations and Directions

fpsyg-10-01429 June 26, 2019 Time: 15:44 # 15

Obviously, the interpretations of the present results put forward here are not without alternative. This is a result of the correlational nature of many large-scale assessment data sets, the present amongst them. This means that there is a host of personrelated variables that might, in theory, account for the present results but were unaccounted for in the present research. One candidate here is for example dispositional, or trait self-control, a variable that was found to be related to test-taking effort (Lindner et al., 2017), and thus may very well predict how well students are prepared to align their time on task-behavior to task difficulties. Another variable not taken into account here are specifics of students' preparedness to cope with digital text, such as their navigation skills. Against the background of navigation being a central requirement of reading digital text (Salmerón et al., 2018), students' preparedness to cope with navigation demands might also govern how much time they are prepared to invest in hard, and how little time they might need to complete easy digital reading tasks. Future research thus should seek out additional variables that might affect students' preparedness to adapt their time on task behavior. Analyses such as these might also explain why some lesser skilled readers in fact did align their time on task behavior with task difficulties, while others did not (see **Figure 4**): Perhaps some poorer comprehenders are in possession of other skills than comprehension, which compensate for their lesser comprehension skill, allowing them to nevertheless building an adequate task model. For instance, recent research has shown that problem solving skills interact with comprehension skills in predicting digital reading in such a compensatory fashion (Naumann et al., 2018).

A second limitation comes from the fact that the three predictors used in the present analyses were measured with largely varying numbers of items (although the reliabilities were comparable). Thus, in an assessment using a more comprehensive measurement of reading enjoyment, or knowledge of reading strategies, the interactions of these variables with task difficulty might have been even stronger, maybe at the expense of the interaction between comprehension skill and task difficulty. Future research will have to seek out whether the small effect size for the interaction between reading enjoyment and task difficulty is indeed a function of the comparatively small number of items, or if – to the contrary – after controlling for comprehension skill, there is little variance left to be explained for reading enjoyment due to these variables being positively correlated (see **Table 1** and **Supplementary Material 1**).

In a similar vein, future research should overcome not only the limited number of items, but also the limited operationalization of reading motivation used in the present research. For example, in real-life task oriented reading it might well be the case that topic interest is even more important than reading enjoyment in shaping the interaction between task difficulty and time on task: It might well be that a person who only moderately enjoys reading (and might even be a modest comprehender) will invest time even in a hard task if they have a very high interest in the topic. Research such as this however must be left to future experiments, as large-scale reading assessments usually cannot provide data on topic interest due to the variety of topics addressed by the texts in the assessment. Finally, future research into the role of motivational variables might consider not only linear (as in the present research), but also more complex nonlinear effects. A motivated reader for example, who however is in possession of only moderate comprehension skills, might adapt time on task behavior to task difficulty in a non-linear fashion. Such a reader might invest time especially in moderately difficult tasks, while realizing that very hard tasks are beyond their skill level.

A third limitation, and possible avenue for future research, comes from the fact that only one domain was investigated in this research. Future studies might look at how e.g., the time on task behavior in mathematics might be shaped by students' mathematical skills. For example, the ability to "formulate" a mathematical problem, i.e., to "translate from a real-world setting to the domain of mathematics and provide the realworld problem with mathematical structure, representations, and specificity" (OECD, 2013, p. 28) might be conceptually related to building an adequate task model in a reading task. Also, subjective interest in mathematics might moderate the task difficulty-time on task relationship in a fashion similar to the respective effects of reading enjoyment that were found in the present research. With reference to tasks, requirements in the present research were operationalized as the tasks' overall difficulties, as estimated by the international calibration of the Digital Reading Assessment items (OECD, 2012). Building on the present results, future research might seek out which specific features of a digital reading task that might make it "hard" (on the word, sentence, text, or intertextual level) in particular drives time on task behavior in conjunction with person level variables such as the ones addressed here. From an analysis such as this, the question might also be addressed how digital reading assessment tasks might be constructed in a way that variables such as reading enjoyment, or other person level variables that are not part of the targeted construct, do not interact with task features in bringing about task engagement processes that presumably impact task performance and thus estimated abilities. With large scale assessments such as TIMSS or PISA moving toward being computer-based in general (Mullis, 2017; OECD, 2017), analyses such as these could be carried out routinely as part of field trials, and thereby potentially increase the validity of the assessments and in turn the veridicality of conclusions drawn for educational policy and practice.

### ETHICS STATEMENT

This study is based on a secondary analysis of OECD PISA 2009 data. Data collection was in accordance with APA ethical standards. The author had OECD's permission to utilize the PISA 2009 Digital Reading Data log files, where not publicly available, for scholarly research and thanks the OECD for this permission.

### AUTHOR CONTRIBUTIONS

fpsyg-10-01429 June 26, 2019 Time: 15:44 # 16

JN helped in conceiving the assessment materials, conceived and conducted the analyses, and wrote the manuscript.

### FUNDING

This study was supported by the German Ministry of Education and Science (BMBF), Grant 01LSA1504A.

#### REFERENCES


### ACKNOWLEDGMENTS

I would like to thank the BMBF for their support. I would like to thank the OECD for permission to use the PISA 2009 Digital Reading Logfile Data for scholarly work. I would also like to thank Mr. Adrian Peach for careful proofreading of this manuscript.

### SUPPLEMENTARY MATERIAL

The Supplementary Material for this article can be found online at: https://www.frontiersin.org/articles/10.3389/fpsyg. 2019.01429/full#supplementary-material


modeling progressive depletion patterns. PloS One 12:e0180149. doi: 10.1371/ journal.pone.0180149


**Conflict of Interest Statement:** The author declares that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2019 Naumann. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

# Fine-Grained Assessment of Children's Text Comprehension Skills

*Marije den Ouden1,2 , Jos Keuning1 and Theo Eggen1,2 \**

*1 Cito, Arnhem, Netherlands, 2 Faculty of Behavioural, Management and Social Sciences, University of Twente, Enschede, Netherlands*

#### *Edited by:*

*Frank Goldhammer, German Institute for International Educational Research (LG), Germany*

#### *Reviewed by:*

*Maj-Britt Isberner, University of Kassel, Germany Tobias Doerfler, Heidelberg University of Education, Germany*

> *\*Correspondence: Theo Eggen theo.eggen@cito.nl*

#### *Specialty section:*

*This article was submitted to Educational Psychology, a section of the journal Frontiers in Psychology*

*Received: 23 October 2018 Accepted: 20 May 2019 Published: 28 June 2019*

#### *Citation:*

*den Ouden M, Keuning J and Eggen T (2019) Fine-Grained Assessment of Children's Text Comprehension Skills. Front. Psychol. 10:1313. doi: 10.3389/fpsyg.2019.01313*

Text comprehension is an essential skill for achievement in personal, academic, and professional life. Therefore, it is tremendously important that children's text comprehension skills are actively monitored from an early stage. Text comprehension is, however, a complex process in which different reading abilities continuously interact with each other on the word, sentence, and text levels. In educational practice, various tests are used to measure these different reading abilities in isolation, which makes it very difficult to understand why a child scores high or low on a specific reading test and to adequately tailor reading instruction to the child's needs. Dynamic assessment has the potential to offer insights and guidance to teachers as cognitive processes that are important for learning are examined. In dynamic tests, students receive mediation through instruction when answering test questions. Although computer-based dynamic assessment in the reading domain holds potential, there is almost no support for the validity of dynamic measures of text comprehension. The aim of the present study is to determine design principles for the intended use of computer-based dynamic assessment of text comprehension. Based on the dynamic assessment literature, we developed a model for assessing the different reading abilities in conjunction. The assumption is that this model gives a fine-grained view of children's strengths and weaknesses in text comprehension and provides detailed information on children's instructional needs. The model was applied in a computer-based (fourth-grade) reading assessment and evaluated in practice through a three-group experimental design. We examined whether it is possible to (1) measure different aspects of the reading process in conjunction in order to obtain a full understanding of children's text comprehension skills, (2) measure children's learning potential in text comprehension, and (3) provide information on their instructional needs. The results show that while the model helped in explaining the children's text comprehension scores, unexpectedly, mediation did not clearly lead to progress in text comprehension. Based on the outcomes, we substantiate design principles for computer-based dynamic assessment of text comprehension.

Keywords: computer-based assessment, design principles, dynamic assessment, instructional needs, learning potential, reading process, text comprehension

## INTRODUCTION

Text comprehension is an important skill for personal fulfillment and for achieving academic and professional success. Nevertheless, it is also a very complex skill involving different cognitive abilities that interact on different levels. At lower levels, word identification skills and knowledge of word meanings are essential for understanding text (Perfetti and Hart, 2001; Perfetti, 2017). At higher levels, text comprehension is influenced by the ability to make inferences or monitor comprehension (Perfetti et al., 2005). This complex nature results in a variety of possible causes underlying problems encountered in text comprehension (Cain and Oakhill, 2006; Colenbrander et al., 2016; Kleinsz et al., 2017). Whereas most primary school teachers underline the importance of developing good text comprehension skills, they also point to difficulties understanding the reading problems children encounter. We aim to develop a framework for fine-grained assessment of text comprehension skills that supports teachers in understanding children's text comprehension problems.

### Measuring Text Comprehension Skills

Research on text comprehension has advanced a number of theories on the different parts of the reading process. Due to the complex nature of text comprehension, interactive models of the reading process arguably provide the best framework for understanding and studying this concept (Stanovich, 1980;

Perfetti, 1999; Cain et al., 2017). These models have in common that they describe the reading process of interaction on different levels (e.g., word, sentence, and text levels) and often make a distinction between processing information explicitly stated in the text and deriving information implicitly stated in the text. One of the most influential models is the constructionintegration model (Van Dijk and Kintsch, 1983; Kintsch, 1988, 1998), which describes the reciprocal relation between the *construction* of a text-based model and its *integration* into a situation model. A distinction is made between combining all information that is explicitly stated in the text on the word, sentence, and text levels (text model) and interpreting this information, together with prior knowledge, as a coherent whole (situation model). Verhoeven and Perfetti (2008) place greater emphasis on the role of word knowledge by conceptualizing text comprehension as an interaction between word identification and word-to-text integration (see **Figure 1**). Words are identified by combining orthographic, phonological, and semantic representations. The quality of these representations significantly influences text comprehension (Perfetti and Hart, 2001). Identified words can be linked to each other in order to give meaning to a sentence, and sentences can be linked through inferences based on explicit (text model) and implicit (situation model) information.

In order to obtain a full understanding of children's text comprehension skills, educational assessment should cover the various aspects of the reading process. In educational practice,

a variety of tests are used to measure these different aspects. For example, nationally standardized tests (NSTs) are deployed for student monitoring, i.e., monitoring students' progress on skills as text comprehension, vocabulary, and word decoding. Additionally, tests originating from teaching materials are administered to evaluate knowledge acquired through education. Consequently, different aspects of the reading process are evaluated through different reading tests, and even tests that are supposed to measure the same construct show only modest intercorrelation (Nation and Snowling, 1997; Keenan et al., 2008). This fragmentary way of measuring reading ability problematizes the interpretation of the test results in a coherent way. Therefore, this way of measuring makes it very difficult to understand why a child scores high or low on a specific reading test and to adequately tailor reading instruction to the child's needs. Moreover, measuring different aspects of the reading process in isolation is questionable in terms of its interactive nature. It might also be difficult to eliminate every aspect other than that intended for measurement. For example, poor vocabulary can result in an underestimation of inference-making skills (Segers and Verhoeven, 2016; Daugaard et al., 2017; Swart et al., 2017). These issues could be addressed by measuring text comprehension in a more comprehensive way, i.e., measuring different aspects of the reading process in conjunction.

Furthermore, commonly used tests usually provide insufficient diagnostic information, e.g., information about students' misconceptions and learning potential (Fani and Rashtchi, 2015). Thus, these tests provide only little support for teachers in aligning their reading instruction to the educational needs of their students. *Dynamic assessment* has the potential to offer insights and guidance to teachers as cognitive processes that are important for learning are examined (Lidz and Elliott, 2000; Elliott, 2003). In dynamic tests, students receive mediation through instruction when answering test questions. Dynamic assessment of text comprehension skills can provide teachers with information to identify students' capabilities as well as their specific needs for training in the reading domain (Dörfler et al., 2017).

### Dynamic Assessment

The idea of dynamic assessment is based on Vygotsky's (1978) theory of the zone of proximal development (ZPD), wherein human abilities are perceived in a constant state of flux and are sensitive to sources of mediation that can feed learning mechanisms. Lantolf and Poehner (2007) describe two approaches to dynamic assessment: the interactionist and interventionist approaches. The interactionist approach involves the traditional dynamic assessments, whereby the type and amount of instruction provided depend on one-on-one interaction between the teacher and student. The instruction is completely attuned to the responsiveness of the student (Lantolf and Poehner, 2007). In the interactionist approach, the goal is to reach the maximum performance for each individual student. By contrast, the interventionist approach involves standardized instruction that is arranged in advance and quantified during the assessment. This approach focuses on determining the amount and nature of instruction a student needs in order to reach a pre-specified performance level. An interventionist dynamic assessment is less time-consuming, and its results are more comparable across students, since every student is tested according to the same procedure. It enhances efficiency in terms of the number of students that can be tested simultaneously, especially when the assessment is digitalized (Poehner and Lantolf, 2013).

Computer-based interventionist dynamic assessment can be elaborated through different designs. Sternberg and Grigorenko (2002) distinguish between the sandwich and cake designs. The sandwich design can be defined as a test-train-test design in which a pretest is followed by some intervention or instruction (see **Figure 2**), and a posttest comparable to the pretest is subsequently administered to all students. With this design, one can determine the extent to which students are able to improve when instruction is offered (Tzuriel, 2000). Performance before and after this instruction can be compared in order to examine students' ZPD or their potential to learn. The cake design can be defined as a train-within-test design in which instruction follows immediately after an incorrect response to an item (see **Figure 2**). The instruction can be presented as a graded series of instructional hints that guide the student toward the correct response, referred to as the graduated prompts approach (Brown and Ferrara, 1985; Campione and Brown, 1987). This approach determines the amount of aid a student needs to solve the problem (Tzuriel, 2000). The number of hints needed to find the correct response is often used as an indication of students' ZPD or learning potential.

### Model for Fine-Grained Assessment

Measuring text comprehension in a comprehensive and dynamic way given the discussed purpose holds some challenges. First, a test of this nature should provide a full understanding of students' text comprehension skills. All parts and interactions of the reading process should ideally be examined in conjunction. Using the model of Verhoeven and Perfetti (2008), these can be summarized in the constructs word-form

(above) with pre-, posttest, and separate training sessions and a train-withintest design (below) with only one session, including training parts (Dörfler et al., 2009).

knowledge (orthographic and phonological representations), word-meaning knowledge (semantic representations), local cohesion inferences (word-to-text integration), and global understanding (text and situation model). Moreover, this test should inform teachers about the educational needs of their students as well as of the efficacy of intervention. Furthermore, the administration of this test should be feasible. It should take a limited amount of time and ought to be clearly beneficial to both the teacher and student. All these considerations have been accounted for in the assessment model presented in **Figure 3**, which presents an amalgamation of the sandwich and cake designs.

Both the sandwich and cake designs show some difficulties when used for dynamic reading assessment. In the cake design, quantifying the amount of instruction students need in order to find correct responses can provide an indication of their learning potential; however, it does not allow for modeling the effect of instruction. In the sandwich design, change in performance level caused by the training can be modeled. However, this overall effect cannot be linked to specific types of instruction, since there is only one intervention phase. Multiple training sessions and posttests can address this issue but would nonetheless be highly time-consuming. By combining the sandwich and cake designs, the overall effect of instruction (i.e., learning potential) can be determined and can also be linked to the amount and nature of the instruction offered.

Following the proposed assessment model, a test with the same set of items measuring global understanding is presented in two respective measurement occasions. At the first measurement occasion, a set of items is presented without instruction, and at the second measurement occasion, a set of items is presented with item-level instructions. The instructions consist of several supportive scaffolding questions related to word-form knowledge, word-meaning knowledge, and local cohesion inferences, along with corresponding feedback. At the second measurement occasion, children are thus trained in successfully completing the global text comprehension task by first teaching them the necessary knowledge at word- and sentence level.

### Scaffolding and Feedback

As discussed earlier, dynamic assessment is characterized by the inclusion of instruction during test administration. In this way, dynamic tests provide information about the educational needs of students as well as possible intervention. Research has showed that students with similar initial abilities can benefit differentially from instruction (Tzuriel, 2000). Moreover, different shortcomings in the process of reading might require different approaches from teachers in providing guidance and instruction (Fuchs et al., 2012). In the proposed design, the instruction phase consists of scaffolding questions and feedback on the responses to these questions.

Scaffolding can be defined as providing cognitive support by breaking down tasks into smaller, more manageable parts that are within the student's understanding (Dennen and Burner, 2008). In the case of reading comprehension, determining the main idea of the text is a cognitively demanding task that can be broken down into smaller tasks as determining (the meaning of) important words and making required inferences between sentences or paragraphs. According to Vygotsky's (1978) theory of ZPD, students can achieve their potential level of development if scaffolding is applied to them (Magno, 2010), which can be applied in the form of questions, as recommended by Feng (2009). By using a series of scaffolding questions that focus on different cognitive abilities on different levels of the reading process, we can gradually guide a student toward global understanding of a text. Also, we can determine the extent to which a student is capable of making necessary intermediate steps for gaining global understanding of the text. Moreover, different aspects of the reading process can be measured in this way.

Feedback on students' responses to scaffolding questions is essential for letting them acquire the intended knowledge. Itembased feedback can be presented as either verification or elaboration.

Elaborated feedback is more effective than verification; however, they are most effective when combined (Dörfler et al., 2009; Van der Kleij et al., 2015). Verification feedback simply consists of a confirmation of an (in)correct response. Elaborated feedback could contain error-specific explanations and solution-oriented prompts or could address meta-cognitive processes. In the proposed design, standardized solutionoriented prompts are preferable, since non-contingent feedback has been shown to be more predictive of future achievement than contingent feedback in dynamic assessment (Caffrey et al., 2008).

### The Present Study

Although computer-based dynamic assessment in the reading domain holds potential, there are only a few approaches to dynamic assessment available, and thus, there is almost no support for the validity of dynamic measures of text comprehension (Dörfler et al., 2017). The aim of the present study is to determine design principles for the intended use of computer-based dynamic assessment of text comprehension. The proposed, theoretically based assessment model was applied in a computer-based dynamic assessment for text comprehension and tested and evaluated in practice. We examined whether it is possible to (1) measure different aspects of the reading process in conjunction in order to obtain a full understanding of children's text comprehension skills, (2) measure children's learning potential in text comprehension, and (3) provide information on their instructional needs. Learning potential was defined as the difference between two measurements occasions, one in which a global understanding task was administered without scaffolding and one in which the same task was administered in combination with several supportive scaffolding questions related to word-form knowledge, word-meaning knowledge, and local cohesion inferences. In this study, learning potential thus reflected the child's ability to use the help they get in completing the global understanding task. Instructional needs referred to the children's performance on the different scaffolding questions. Failure on one specific subskill implied that there was a need for additional instruction on that subskill. Based on the conclusions, we substantiate design principles for computer-based dynamic assessment of text comprehension.

### MATERIALS AND METHODS

### Participants

The study was conducted in cooperation with a school consortium of which four schools participated with their fourth-grade students. Three schools participated with one school class, and one school participated with two school classes. The schools were located in neighborhoods with average and above-average scores in income, employment, and education level, in comparison with the national standard (The Netherlands Institute for Social Research, 2016). The pretest was administered to 169 fourth-grade students aged approximately 10–11 years old. From the pre- to posttest, one school class consisting of 29 students dropped out.

### Materials

#### Texts

A total of 80 texts were selected from a database managed by Cito Institute for Educational Measurement. The database contained texts from existing sources, e.g., children's books, informative books, and websites. Texts were evaluated by T-scan, an analysis tool for Dutch texts to assess the complexity of the text (Pander Maat et al., 2014). The selected texts were found to be appropriate in terms of difficulty following an evaluation of different text attributes, e.g., word difficulty, sentence complexity, verbiage, and referential and causal coherence. Both informative and narrative texts were included. The selected texts contained between 112 and 295 words, averaging 205 words.

#### Tasks

For every text, four tasks were constructed and screened by a group of reading experts. All tasks corresponding to one text were constructed by one reading expert, screened by two other reading experts and, when necessary, adjusted by the first reading expert. The tasks covered different parts of the reading process, as modeled by Verhoeven and Perfetti (2008), as they represented the constructs *word-form knowledge*, *word-meaning knowledge*, *local cohesion inferences*, and *global understanding*.

The different tasks were separately pre-examined in a trial with paper-based tests. Each test consisted of 40 tasks that measured the same construct. Each test was administered to at least two school classes, which resulted in 40–97 administrations per test with a total of 629 administrations. In this preliminary research, all tasks were found to be highly reliable and appropriate with respect to level of difficulty. The resulting item bank consisted of 80 texts and corresponding tasks and was used for the assembly of the final test. Item statistics (i.e., percentage of correct answers and item-total correlation) were used for the test assembly so as to ensure item quality and to maximize task reliability. Texts with too hard (percentage correct < 0.35) or too easy (percentage correct > 0.90) items, or items with a low item-total correlation (<0.20), were not included in the final test.

The final test consisted of 30 texts and was administered twice, as displayed by the squares in **Figure 3**. During a pretest, each text was presented with one task regarding global understanding of the text. During a posttest, each text was presented with up to four tasks; one task regarding global understanding of the text preceded by, depending on the experimental condition, up to three scaffolding questions with feedback. An example of the tasks is shown in **Figure 4**.

#### *Global Understanding*

For every text, the students were asked about the main idea of the text in a multiple choice question with four possible choices. This task measured the ability to integrate all the

FIGURE 4 | Examples of the four tasks regarding (from above to below) word-form knowledge (SQ1), word-meaning knowledge (SQ2), local cohesion inferences (SQ3), and global understanding.

information provided by the text into a situation model. During the pretest, children had to derive this information from the text themselves. During the posttest, children could use the acquired knowledge from the preceded scaffolding and feedback as guidance for finding the correct response.

#### *Word-Form Knowledge (SQ1)*

For every text, children were asked to type in three words in three separate open-ended questions. The words were blurred in the text and presented to the students auditory (see upper part of **Figure 4**). As feedback on an incorrect response, the correct word form was shown in the text for 3 s. This task measured the quality of phonological and orthographical representations of words that were essential for understanding the text. By applying scaffolding and feedback on word-form knowledge, the children could get acquainted with the key words of the text.

#### *Word-Meaning Knowledge (SQ2)*

For every text, the students were asked for the meaning of two words in two separate multiple choice questions, each with three possible choices of word definitions. The word in question was bolded in the text. The feedback on an incorrect response included a picture of the word in question. This task measured the quality of the semantic representations of words that were essential for understanding the text. By applying scaffolding and feedback on word-meaning knowledge, children received information about the meaning of the key words of the text.

#### *Local Cohesion Inferences (SQ3)*

For every text, the students were asked to make an inference, relevant for understanding the main idea of the text, in one multiple choice question with four possible choices. As feedback on an incorrect response, the relevant phrases or sentences were highlighted in yellow in the text. This task measured the ability to integrate text phrases that were essential for understanding the text. By applying scaffolding and feedback on inference-making, the children were encouraged to think about the cohesion of different text parts.

#### Procedure

The pretest was divided into two subtests, each with 15 texts that were administered on separate occasions on the same day. The posttest was divided into three subtests, each with 10 texts that were administered on separate occasions spread over two consecutive days. All test administrations took place in the classroom, with a duration of 45 min for each occasion. The posttest was administered 4 weeks after the pretest.

All groups received the same pretest. For the posttest, all students were randomly assigned, within the school classes, to one of three conditions. The first experimental condition (*n* = 47) received the posttest that included all three different types of scaffolding and feedback for every text, SQ1, SQ2, and SQ3. The second experimental condition (*n* = 48) received the posttest that included two different types of scaffolding and feedback for every text, SQ1 and SQ2. The control condition received the posttest that included no scaffolding or feedback (*n* = 45). To ensure active processing of feedback, the students had a second attempt at the scaffolding questions following an incorrect response.

#### Statistical Analyses

In order to determine to which extent we were able to measure different aspects of the reading process in conjunction, the psychometric quality (i.e., reliability and validity) of the developed test was investigated. Classical test and item analyses were conducted for all scales. Internal consistency was assessed with Cronbach's alpha (*α*), a lower-bound estimate of reliability, with a value of ≥0.80 indicating good reliability, a value of ≥0.70 indicating sufficient reliability, and a value of <0.70 indicating insufficient reliability (Evers et al., 2010).

Furthermore, construct validity was evaluated through the analysis of a multitrait-multimethod matrix (MTMM; Campbell and Fiske, 1959). For this MTMM, scores on the dynamic assessment scales were linked to previously obtained scores on NSTs for text comprehension, vocabulary, orthography, and math. These tests were administered 4 months earlier with the purpose of monitoring students' progress through primary school. Pearson correlation (*r*) between the scores on the subscales of the dynamic assessment and NSTs was computed and interpreted as high when *r* ≥ 0.50, moderate when *r* ≥ 0.30, and low when *r* < 0.30 (Cohen, 1988).

In order to determine to which extent we were able to measure children's learning potential in text comprehension and to provide information on their instructional needs, we investigated learning potential and the effect of scaffolding and feedback on *global understanding*. First, the experimental conditions were compared to the control condition on the posttest performance after controlling for pretest performance through a regression analysis. Second, posttest performance was predicted through performance on the scaffolding types and the contribution of feedback.

### RESULTS

## Psychometric Quality

#### Reliability

In **Table 1**, the 30 *global understanding* items from the pretest together show good reliability (*α* = 0.82). The same items from the posttest showed even better reliability in the second experimental and control conditions (both *α* = 0.89). However, in the first experimental condition, these items showed very low reliability (*α* = 0.36). An overview of the missing percentage values per item on the posttest are shown in **Figure 5**. For every subtest, the missing percentage values in both experimental conditions increased considerably as the test continued, indicating that the test was excessively long in these conditions; a large proportion of the children were not able to finish the subtests.

When the observations for the last four items of every subtest were excluded from the analyses, Cronbach's alpha for the *global understanding* scale exceeded 0.70 in the first experimental condition (*α* = 0.73) and decreased only slightly in the second experimental and control conditions (*α* = 0.85 and *α* = 0.83). Thus, in the first experimental condition, the observations made for the last few items of every subtest showed a negative effect on the reliability of the total scale. Therefore, we chose to proceed with all analyses with only the items corresponding to the six texts presented at the beginning of every post-subtest, leaving a total of 18 texts. As presented in **Table 2**, the corresponding 54 *word-form knowledge* items together showed good reliability in both experimental conditions (*α* = 0.90 and *α* = 0.91) as well as the 36 *word-meaning knowledge* items (*α* = 0.79 and *α* = 0.88). The 18 *local cohesion inference* items, which were only administered in the first experimental condition, together showed poor reliability (*α* = 0.47). Therefore, we cannot make any statements about the children's ability to make local cohesion inferences.

#### Validity

The multitrait-multimethod matrix (MTMM) for the dynamic assessment scales and NSTs is presented in **Table 3**. The scale *global understanding* shows a high correlation with the NST that measures the similar construct of text comprehension


(*r* = 0.51) as well as the NST measuring construct vocabulary (*r* = 0.50). The scale *word-meaning knowledge* shows a high correlation with the NST that measures the similar construct of vocabulary (*r* = 0.52) and a slightly higher correlation with the NST measuring the construct of text comprehension (*r* = 0.59). The scale *word-form knowledge* shows a high correlation with the NST that measures the similar construct of orthography (*r* = 0.83) and lower correlations with the NSTs measuring less related constructs. Furthermore, from the intercorrelations between the subscales, we can conclude that *word-form knowledge* discriminates better with *global understanding* and *word-meaning knowledge* (*r* = 0.47 and *r* = 0.52) than the latter do among themselves (*r* = 0.68).

#### Learning Potential and Instructional Needs

To determine children's learning potential, posttest performance on *global understanding* was predicted through the conditions after controlling for pretest performance. Compared to the control condition, both experimental conditions showed a negative effect on posttest performance, indicating that scaffolding deteriorated posttest performance (see **Table 4**).

To determine whether we were still able to provide information about children's instructional needs, posttest performance on *global understanding* was predicted by performance on the scaffolding tasks and the contribution of feedback. Scaffolding was operationalized as a percentage of the items that were answered correctly during the first attempt. Feedback was operationalized as a percentage of the items that were answered incorrectly during the first attempt and correctly during the second attempt. Since both experimental conditions received word-level scaffolding and feedback, we chose to include both groups in the same model, with the condition as a control variable and the first condition as the reference category. The predictors explained a significant part of the variation in posttest performance, *R*<sup>2</sup> = 0.477, *F*(5, 89) = 16.25, *p* < 0.001. From the results shown in **Table 5**, we can conclude that scaffolding on both *word-form knowledge* (*β* = 0.209, *p* = 0.073) and *word-meaning knowledge* (*β* = 0.784, *p* < 0.001) was a relevant

predictor of global understanding. The feedback showed no significant contribution. Although no significant effects could be proved, the high standardized beta for feedback on *wordmeaning knowledge* suggests the potential relevance of this type of feedback (*β* = 0.212, *p* = 0.240).

Since the experimental conditions did not perform better on *global understanding* than the control condition, children's learning potential could not be assessed. Within the experimental conditions, however, scaffolding proved to be relevant for explaining performance on *global understanding*. Therefore, we were able to provide diagnostic information on children's text comprehension skills.

### DISCUSSION

In order to define design principles for fine-grained assessment of text comprehension skills, a computer-based dynamic assessment based on the proposed assessment model was developed and evaluated in an experimental design. We examined to what extent we were able to measure a combination of the different aspects of the reading process by evaluating the quality of all scales. We found that a large proportion of the children in both experimental conditions were unable to finish the subtests of the posttest, indicating that these tests were excessively long. In relation to *global understanding*, the test length showed a negative effect on the reliability of the scale in the experimental


condition, where children received both word- and sentencelevel scaffolding and feedback. Thus, in particular, scaffolding and feedback on the sentence level (i.e., *local cohesion inferences*) resulted in inconsistent response behavior on the *global understanding* scale, indicating concentration and motivation challenges. The inclusion of six texts per subtest proved to be the maximum for obtaining a reliable *global understanding* scale. Proceeding with the analyses with only these texts, w*ordform knowledge* and *word-meaning knowledge* were also evaluated to be reliable scales.

TABLE 4 | Regression coefficients predicting posttest performance controlled for pretest performance.


*R2 = 0.387, F(3, 136) = 28.61, p < 0.001. \*Condition 3 served as the reference category.*

TABLE 5 | Regression coefficients predicting posttest performance on global understanding.


*Scaffolding was operationalized as a percentage of the items that were answered correctly during the first attempt. Feedback was operationalized as a percentage of the items that were answered incorrectly during the first attempt and correctly during the second attempt. \*Condition 1 served as the reference category.*

TABLE 3 | Multitrait-multimethod matrix for the dynamic assessment scales and nationally standardized tests.


*The dynamic assessment scales are based on the posttest data for Conditions 1 and 2 together. The values in parentheses represent the reliability (α) of the scale. The values of the validity diagonal are shown in bold and are expected to be high. All correlations were found to be statistically significant different from zero (p < 0.05).*

The *local cohesion inferences* scale was found to be highly unreliable on the computer-based dynamic assessment, though it was found to be perfectly reliable and well-constructed when administered in isolation with a paper-based test in the preliminary research. Two possible explanations are conceivable for this difference. First, these items could function differently on a computer-based test than on a paper-based test. To find the correct response to the items, it is necessary to read the text and find the relevant text phrases. Reading a text presented on a computer screen usually entails higher cognitive workload than reading a text presented on paper (Mangen et al., 2013). Second, the items could function differently when administered in isolation, compared to when they are administered together in a series of tasks. For every text, the children were first presented with the items regarding their word knowledge. In order to find the correct responses to these items, reading the text could have helped, though it was not necessary. Finding the correct response to the *local cohesion inferences* item, however, required the children to use information from the text. Moreover, making inferences is perceived as a higher-level ability and is, therefore, more cognitively demanding than activating word knowledge, which is perceived as a lower-level ability. The change in the required approach to problem-solving might have caused confusion or motivational problems.

We, therefore, concluded that the underlying constructs measured by the scales *global understanding* and *word-meaning knowledge* overlapped considerably. Also, both scales showed almost equal coherence with other tests that measured text comprehension and vocabulary in isolation. This could be explained by the essential role of vocabulary in text comprehension as well as by the inability of the other tests to measure text comprehension and vocabulary as separate abilities, since these abilities continuously interact and influence each other (Verhoeven et al., 2011; Oakhill and Cain, 2012). Correlations of 0.80 between tests for reading comprehension and vocabulary are not uncommon (e.g., Tomesen et al., 2017, 2018), and this supports the assumption that different reading abilities should be measured in conjunction.

To determine whether we were able to measure learning potential, the posttest performance on *global understanding* was compared for the experimental conditions versus the control condition after controlling for the pretest performance. On average, those children who received scaffolding and feedback were found to perform slightly worse than those who received no scaffolding or feedback. This could be variously explained.

As discussed earlier, children might benefit differently from instruction. Therefore, linking learning potential to specific characteristics might provide more meaningful information, since it allows for the identification of groups with similar educational needs. However, the sample size of the present study was too small to determine learning potential for smaller groups. A larger sample size would also allow for the estimation of test scores with the use of item-response theory models. These models can provide more accurate scores, as they take into account the difference between the difficulty of an item and the ability level of a student.

The most likely explanation for the lack of finding a positive effect of scaffolding and feedback concerns the possible incomparability between pre- and posttest performance on *global understanding* as well as between the experimental and control groups. When the *global understanding* task is integrated in a series of tasks, the conditions under which the children perform change. The required change in approach to problem solving between the different tasks, as pointed out earlier with respect to the *local cohesion inferences* scale, can affect children's performance or the difficulty of the tasks. Shifting the focus from measuring learning potential to measuring instructional needs would provide teachers with more valuable information.

Another probable explanation is that the information retrieved from the scaffolding and feedback was not used for the *global understanding* task. The children did not receive explicit information about the structure of the test; consequently, they themselves had to realize that they could use the previously collected information to solve the task. Moreover, previous research suggests that computer-delivered elaborated feedback is likely to be neglected in a low-stakes assessment setting on higher-order processes of text comprehension (Golke et al., 2015). Motivating children to use the information provided might address this problem.

To determine whether we were able to provide information on children's instructional needs, performance on *global understanding* was predicted by performance on scaffolding and the contribution of feedback. Scaffolding on *word-form knowledge* and *word-meaning knowledge* proved to be relevant for *global understanding*. The contribution of feedback on *wordmeaning knowledge* could not be proved, although there were indications that it might be proved in a larger sample. Previous research has indicated the efficacy of using pictures when learning new words (Gruhn et al., 2019, under review). Therefore, further research is necessary to determine the contribution of this type of feedback within the assessment model. Feedback on *word-form knowledge* showed no contribution to the prediction of *global understanding*. This might be due to the lack of repetition (Gruhn et al., 2019).

Thus, some important information on children's instructional needs could be provided. However, further research is required, since inference-making skills could not be reliably assessed, and being able to integrate multiple sentences is essential for achieving *global understanding* of a text (Best et al., 2005).

## CONCLUSION

Based on the findings we can conclude that the assessment model can be used as framework for fine-grained assessment of text comprehension skills when some design principles are taken into account. First, children should be informed about the test structure in advance. In this way, they can be explicitly instructed to use information retrieved from the scaffolding or feedback. Second, (sub)tests should be relatively short so as to avoid fatigue effects that cause biased results. We recommend a maximum of six short (ca. 200 words) texts per subtest. Third, a pretest where one's ability is measured in isolation might not be a good baseline for establishing learning potential. Further research is required, however, since the inability to establish learning potential in the present study might have had other causes. In any case, comparability of response behavior elicited by pre- and posttest measures should be examined beforehand. Fourth, the negative effects of changes in the required approach to problem-solving or the cognitive workload between different tasks should be diminished. These effects might be reduced by presenting a visual indication of the level of difficulty or a sign reflecting the task type of every task. In the present study, the change from the *wordmeaning knowledge* task to the *local cohesion inference* task seems to cause problems in particular. In addition to a visual indicator or (warning) sign, reducing the cognitive workload for this specific task may also contribute. This could be achieved by adding a new, in-between task that serves as an extra intermediate step or by directly highlighting the relevant text passages instead of only after an incorrect response. However, attention must be paid to the influence of such adjustments on the validity of the task itself and the following tasks.

To conclude, we have tried to bridge the first gap between theory and assessment by evaluating the theoretically based assessment model for fine-grained assessment of text comprehension skills in practice. We were able to measure a combination of different aspects of the reading process. Furthermore, we suggested that it might be more valuable to focus on instructional needs rather than on learning potential. Through the design principles discussed, we can move further toward fine-grained assessment of text comprehension skills.

### ETHICS STATEMENT

We obtained written informed consent from all participating schools and the children's parents were informed about the study by letter. The parents had the opportunity to refuse participation of their child in the study. Active written

### REFERENCES


informed parental consent was not considered necessary because we only collected non-special personal data; biographical information or any other private or healthrelated data were not part of our study. This particular sub-study was conducted under a research grant of the Netherlands Initiative for Education Research (NWO/NRO dossiernummer 405-15-548), which was approved by a scientific and ethical review committee. Therefore, because of institutional (Cito) and national (NOW/NRO) guidelines and regulations, the separate approval by an Ethics Committee was not required.

### AUTHOR'S NOTE

The following data were collected in our study: (1) Correct/ incorrect scores on a dynamic assessment for text comprehension and (2) test scores for reading comprehension, spelling, vocabulary and math from a Dutch Student Monitoring System.

### AUTHOR CONTRIBUTIONS

MO is a PhD student. JK is the daily supervisor. TE is the supervisor of the project. The authors are equally responsible for the content.

### FUNDING

Our study was conducted under a research grant of the Netherlands Initiative for Education Research (NWO/NRO filing reference 405-15-548) and was approved by a scientific review committee.


**Conflict of Interest Statement:** TE and JK were employed by Cito in The Netherlands. They are affiliated with Stichting Cito (Foundation Cito), the not-forprofit part of Cito that is dedicated to applied scientific research into educational measurement. MO was employed at the University of Twente.

*Copyright © 2019 den Ouden, Keuning and Eggen. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.*

# Online Diagnostic Assessment in Support of Personalized Teaching and Learning: The eDia System

*Benő Csapó1 \* and Gyöngyvér Molnár <sup>2</sup>*

*1MTA-SZTE Research Group on the Development of Competencies, University of Szeged, Szeged, Hungary, 2Department of Learning and Instruction, University of Szeged, Szeged, Hungary*

#### *Edited by:*

*Frank Goldhammer, German Institute for International Educational Research (LG), Germany*

#### *Reviewed by:*

*Birgit Schütze, University of Münster, Germany Diego Zapata-Rivera, Educational Testing Service, United States*

> *\*Correspondence: Benő Csapó csapo@edpsy.u-szeged.hu*

#### *Specialty section:*

*This article was submitted to Educational Psychology, a section of the journal Frontiers in Psychology*

*Received: 17 December 2018 Accepted: 17 June 2019 Published: 03 July 2019*

#### *Citation:*

*Csapó B and Molnár G (2019) Online Diagnostic Assessment in Support of Personalized Teaching and Learning: The eDia System. Front. Psychol. 10:1522. doi: 10.3389/fpsyg.2019.01522*

The aims of this paper are: to provide a comprehensive introduction to eDia, an online diagnostic assessment system; to show how the use of technology can contribute to solve certain crucial problems in education by supporting the personalization of learning; and to offer a general reference for further eDia-based studies. The primary function for which the system is designed is to provide regular diagnostic feedback in three main domains of education, reading, mathematics, and science, from the beginning of schooling to the end of the 6 years of primary education. The cognitive foundations of the system, the assessment frameworks, are based on a three-dimensional approach in each domain, distinguishing the psychological (reasoning), the application, and the disciplinary (curricular content) dimensions of learning. The frameworks have been carefully mapped into item banks containing over a 1,000 innovative (multimedia-supported) items in each dimension. The online assessments were piloted, and the system has been operating in experimental mode in over 1,000 schools for several years. This paper outlines the theoretical foundations of the eDia system and summarizes how results from research on the cognitive sciences, learning and instruction, and technology-based assessment have been integrated into a working system designed to assess a large population of students. The paper describes the main functions of eDia and discusses how it supports item writing, constructing tests, online test delivery, automated scoring, data processing, scaling and the provision of feedback both for students and teachers. It shows how diagnostic assessments can be implemented in school practice to facilitate differentiated instruction through regular measurements and to provide instruments for teachers to make formative assessments. Beyond its main function (supporting development toward personalizing education), the eDia platform has been used for assessments in a number of areas from pre-school to higher education both in Hungary and in a number of other countries as well. The paper also reviews results from eDia-based studies and highlights how technology-based assessment extends the possibilities of educational research by making more constructs measurable.

Keywords: technology-based assessment, online assessment, diagnostic assessment, assessment framework, item banking

## INTRODUCTION

The eDia online assessment system has been built and developed by the Centre for Research on Learning and Instruction, University of Szeged. The principal function for which the system is designed is to provide regular diagnostic information in three main domains of education, reading, mathematics, and science, from the beginning of schooling to the end of the 6 years of primary education. In its present form, the eDia system is an integrated assessment system that is based on sophisticated frameworks and supports assessment processes from item development through test administration and data analyses to well-interpretable feedback. It is one realization of the "integrated, learning-centered assessment systems" envisioned by Pellegrino and Quellmalz (2010).

One of the main challenges of school education stems from the fact that students are different. Looking at the problem from a historical perspective, two main approaches may be identified as school systems have attempted to respond to this challenge: (1) selecting students (ability grouping, tracking, etc.) in the hope that homogeneous classrooms can be set up and (2) accepting different students for heterogeneous classrooms, then differentiating instruction to adjust teaching to the different individual needs of the students (personalization, individualization, etc.). The first option has failed, mostly for two reasons: (1) students are different not only in one dimension but also in a number of different ways, with the differences changing dynamically over time; therefore, (2) the intention of selection has generally resulted in social selection (segregation) with numerous negative side effects. The second option is more promising, and a number of progressive initiatives have emerged in recent decades. However, there have also been a great many difficulties that have stood in the way of personalizing learning; among these, the most prominent is continuously identifying the critical differences between students, differences that determine successful learning options. The most crucial issue in teaching a heterogeneous classroom is teaching students with temporary or permanent difficulties in learning, thus requiring that the difficulties that block their progress be identified.

From a cognitive point of view, the core of the problem was best conceptualized by Ausubel in his frequently cited observation: "The most important single factor influencing learning is what the learner already knows. Ascertain this and teach him accordingly" (Ausubel, 1968, p. vi). As simple as this idea is, it is equally as difficult to implement in heterogeneous classrooms. To realize this in practice, teachers should know "what the learner already knows." The problem of "knowing what students know," as has been formulated by several authors (Pellegrino et al., 2001; Opfer et al., 2012), has been solved in general, but making this knowledge useable in practice, teachers should know in "real time," or at least should receive feedback with sufficient frequency to be able to adjust teaching to the knowledge currently possessed by learners. It is clear that due to material costs and human resources requirements, systematic large-scale diagnostic assessments cannot be conducted with traditional instruments.

In this paper, we first outline the theoretical foundations of the eDia system, including the role of diagnostic assessment, the content of assessment, and the ways to use feedback. Then, we introduce the eDia system, describe its structure, and highlight how technology serves its functions. Finally, we review research studies that have been carried out using eDia.

Throughout this paper, we emphasize that there are a number of innovations that technology brings into numerous aspects of instructional processes, including assessment. However, currently, there is still unexploited potential in the use of technology, including the possibilities of personalizing learning, adjusting teaching and learning processes to the individual needs of students. From a cognitive point of view, if students are always taught what they are prepared for (as Vygotsky's theory of the zone of proximal development proposes), then they will better comprehend and master the teaching material. From an affective perspective, if each student individually always faces an optimally challenging learning task (as Csíkszentmihályi's theory of optimal experiences proposes, see Csíkszentmihályi, 2000), both boredom and anxiety are eliminated from learning processes and maintains motivation. The optimal level of challenge supports students' need for competence, which has a positive impact on students' intrinsic motivation as well (Ryan and Deci, 2000a,b). We notice here that large item banks also allow personalization of assessment so that each student receives tests adjusted to their actual developmental level (adaptive testing), thus reducing anxiety in the assessment process as well. Both cognitive and affective demands require regular, personalized feedback, which is what eDia is designed for.

### THEORETICAL FRAMEWORK

The eDia system constitutes the core of a complex, novel educational model which synthesizes a number of progressive initiatives to improve education. It is designed to support learning and development in the first phase of schooling and takes into account certain realities that determine the possibilities of using technologies. We consider three sets of conditions under which problems must be solved.


more efficient, with both quality and equity potentially ensured simultaneously; however, teaching in heterogeneous classes may be more difficult. The major challenge is to adjust instruction to the individual needs of every student. Diagnostic assessment may help, as it provides information on the actual developmental level of each pupil.

3. We assume that regular feedback is essential for learning. A major trend to provide students with proper feedback has been promoted through formative assessments. We agree with its importance, but at the same time, we assume that teachers are not able to observe every major aspect of learning without an objective assessment instrument. Furthermore, traditional paper-based instruments are not suitable for rapid and frequent feedback. Technology-based diagnostic assessments may fill this gap.

Given these conditions, four major research trends offer results for integration and synthesis that serve as a theoretical foundation for a complex online diagnostic assessment system. (1) In research and development, there is a shift from summative to formative assessment, which provides immediate feedback and direct support for learning. (2) Technology-based assessment has shown enormous progress in the past decade, and ICT infrastructure in schools has improved so that assessment can enter into everyday school practice. (3) Progress in cognitive and educational psychology has produced results which have not yet been exploited in practice and which may contribute to a solution for certain crucial problems, especially in the first year of schooling. (4) Finally, a number of promising models for personalizing learning has had limited influence on practice, mostly because of the lack of easy-to-use assessment instruments. Although efforts within this latter (4) trend highlight the need for regular diagnostic feedback and the reformed teaching methods provide adequate educational context for the assessments, in this section, we only deal in detail with the first (1–3) trends as they have determined the development of the eDia system more directly.

### Formative and Diagnostic Assessment

Large-scale international assessment programs (Trends in International Mathematics and Science Studies – TIMSS, Progress in International Reading Literacy Study – PIRLS, and Program for International Student Assessment – PISA) have had an immense impact on the development of educational systems in many different ways and have inspired the introduction or expansion of national assessment programs. These programs have also advanced testing in a number of areas, including framework development, test administration, data analyses, and reporting. This progress has also highlighted some deficiencies in educational assessment from the perspective of practice as well, for example, the long time between test administration and feedback, the limited usefulness of summative test results with regard to personalized intervention, and the lack or limitations of student-level feedback in general. Another source of dissatisfaction with testing has been the way summative tests have been used in certain countries, especially for high-stakes assessments, e.g., for test-based accountability. These types of testing have caused some negative effects, such as teaching for testing and test score inflation (see, e.g., Koretz, 2018), as well as harmful influence on school climate and teacher stress (Saeki et al., 2018).

These deficiencies have lent a new impetus for other directions in the development of educational assessment and shifted the focus of attention from summative to formative assessment (Clarke, 2001, 2005; Ainsworth and Viegut, 2006; Bennett and Gitomer, 2009; Bennett, 2011; Sheard and Chambers, 2014), or assessment for learning, as it is often called (Black et al., 2003; Hattie and Brown, 2007; Heitink et al., 2016), or diagnostic assessment, to use yet another term (Leighton and Gierl, 2007). There are many different ways formative assessment is used in practice, but a common feature of these assessments is that they reflect students' learning needs, facilitate understanding in a given context and provide students with immediate feedback (Black and Wiliam, 1998a,b; Black et al., 2004; Good, 2011). There is no sharp distinction between formative and diagnostic assessment, nor does a universal definition for diagnostic assessment exist. However, it is usually described as a kind of assessment which focuses on problems, explores possible difficulties, assesses if students are prepared for a learning task, and thus may measure prerequisite knowledge as well. Furthermore, diagnostic assessment is often followed by a kind of "therapy": compensatory instruction to eliminate obstacles and offer various forms of supportive activities (e.g., in mathematics: Brendefur et al., 2018), which facilitates databased decision making (e.g., in reading: Filderman et al., 2018).

One typical and most traditional form of formative assessment takes place in the context of classroom interaction, with evaluation based on teachers' observation and personal judgment. Further forms are evaluations of students' work and learning artifacts (performances, presentations, essays, worksheets, projects, documents, lab results, etc.). Although there is a need for frequent personal feedback from teachers, the subjective nature has prompted the use of objective instruments; thus, formative tests have been proposed for this purpose. As these tests have been customized and adjusted to contexts and actual needs, they have usually been teacher-made tests of questionable psychometric quality. Formative tests have been used most systematically in personalized models of instruction, but in any case, their production, administration, and scoring have required immense resources. The use of technology has been proposed to solve these problems, to support certain aspects of the assessments (Feng and Heffernan, 2005; Brown et al., 2008; Feng et al., 2009) or to devise comprehensive assessment systems (Perie et al., 2009).

### Evolution of Technology-Based Assessment

Although technology-based assessment (TBA) is almost as old as the computer itself, modern TBA has a much shorter history. Its potential in assessment has been clear for decades, but it has required several initiatives and the development of the infrastructure at schools to fulfill its promise. We review here only a few major projects and programs that have aided in the realization of eDia as well.

The European Union has launched several initiatives to modernize education, including the expansion of educational assessments to new areas with new technologies. The EU's Joint Research Centre has organized conferences and workshops to collect experience with TBA projects (Scheuermann and Guimarães Pereira, 2008). One such workshop was held in Reykjavik, Iceland, in September–October 2008 with the participation of over 100 experts presenting several parallel developments (Scheuermann and Björnsson, 2009). Among other software, the TAO program (open source software developed by the Centre de Recherche Public Henri Tudor and EMACS, University of Luxembourg) was introduced in several presentations, indicating that it was not only being used in the PISA studies but also in national initiatives as well (Csapó et al., 2009; Haldane, 2009). The MicroDYN approach (Greiff and Funke, 2009), which later became the core of the PISA 2012 problem-solving assessment and which is also implemented in eDia, was also presented at this meeting. In a volume based on the workshop presentations, three chapters summarized the results of the PISA Computer-Based Assessment of Science by authors from the participating countries (Iceland, Korea, and Denmark; see Halldórsson et al., 2009; Lee, 2009; Sørensen and Andersen, 2009). A chapter in the same volume by Kozma (2009) was also published, which was a call for action to assess and teach the 21st-century skills, a manifesto of the program started around that time.

The Assessment and Teaching of 21st-Century Skills (ATC21S) project was located at the intersection of two major trends in research and development: the need to re-define the purpose of education in the new millennium with a greater focus on the skills required in modern societies and to make these skills measurable through TBA. In the first phase of the project, four working groups were formed to define the targeted skills (Binkley et al., 2012) and to explore methodological, psychometric (Wilson et al., 2012), and technological (Csapó et al., 2012) issues, as well as contextual and environmental issues (Scardamalia et al., 2012). The volume that published the results contained a further chapter on the policy frameworks for the assessments (Darling-Hammond, 2012). In the second phase, the project focused on two prominent and closely related 21st-century skills, collaborative problem-solving and learning in digital networks (Griffin and Care, 2015), thus also contributing to the theoretical and empirical foundations for the 2015 PISA collaborative problem-solving assessment.

The PISA assessments have had an impact on the development of TBA in two major ways: (1) they have advanced the technological background and (2) they have tested the preparedness of individual countries for the assessments, identified deficiencies and exercised some pressure to ensure the necessary conditions to make large-scale TBA possible. The application of TBA started in 2006, when Computer-Based Assessment of Science was an optional domain (OECD, 2010). Only three countries completed the assessments (Denmark, Iceland, and Korea), but this provided an impetus for TBA within PISA. In 2009, the assessment of digital reading was an optional domain. Altogether countries participated, making the comparison of achievement in print and digital reading possible and exploring the new information-processing demands of networking and hyperlinking (OECD, 2011).

The 2012 PISA cycle brought a breakthrough in two respects. First, although paper-based tests remained the main delivery method, the TBA version of assessments was offered as an option for reading and mathematics, making the two delivery methods comparable and linking paper-based and TBA achievement (OECD, 2013). Second, in this cycle, dynamic (creative) problem-solving was the fourth, innovative assessment domain; it used simulation and interaction for the first time on PISA (OECD, 2014). This assessment has had a further impact on the development of TBA. The members of the problem-solving expert group continued meeting, invited further researchers in the field, and published an edited volume, which reported a number of further applications of and innovation in TBA (Csapó and Funke, 2017). The computerized solutions devised for the interaction in the assessment of dynamic problem-solving were adapted and further developed; they were used in 2015 for interactive science items (OECD, 2016) and for collaborative problem-solving (OECD, 2017). In 2015, the transition of PISA to TBA was complete, with all the assessments administered by computer.

The projects and programs reviewed here have influenced the development of the eDia system in several ways. PISA re-defined the content to be measured, while ATC21S linked the skills and technology used for assessment and highlighted the importance of framework development. The technology was developed in interaction with the communities running the projects under review; the major forum, beyond several meetings at conferences, was the Szeged Workshop on Educational Evaluation, held annually at the University of Szeged between 2009 and 2016. The programs reviewed here focused on summative testing among older age groups (secondary schools), underscoring the lack of formative assessment and neglecting the needs of younger students, while recent research in education has emphasized both aspects. The experiences gained from the technological realization of these programs (e.g., the itembuilder technology) have been transferred to diagnostic assessments, and eDia has extended them with a number of novel solutions (e.g., item banking, a feedback system, visualization, etc.).

Beyond the developments reviewed here, a parallel evolution took place related to computer-aided instruction (Chauhan, 2017) and intelligent tutoring systems (Kulik and Fletcher, 2016) with significant assessment and feedback components (Conejo et al., 2004). The rapid development of online learning has also advanced TBA, including progress in adaptive testing (e.g., Conejo et al., 2004) and most recently in learning analytics (Avella et al., 2016), which broadens the possibilities of assessing students' learning and forms of feedback. Strategies based on several forms of computer-aided instruction and online learning designed for older students limit the role of teachers and teach students in specific domains (see, e.g., Chi et al., 2010). They open a different route for personalization and only partially overlap with the type of assessment-based differentiation for which the eDia system is devised (as for these differences, see also Scandura, 2017).

### Determining What to Measure: Three-Dimensional Frameworks for Diagnostic Assessments

Previous assessment projects have stressed the importance of defining the content of assessments, and this is even more significant for diagnostic assessments in the early phases of schooling. Diagnosis requires not only a better understanding of the teaching and learning processes but also the cognitive and affective development of pupils as well. Therefore, framework development has been a prominent component in establishing the eDia system. With a brief description of framework development, we demonstrate that only the use of technology (large item banks and assessments tailored to students' individual needs) has made it a realistic goal to differentiate the special aspects of learning by defining the three dimensions of assessments.

The reading, mathematics, and science frameworks have been based on a three-dimensional model of learning outcomes. This model takes into account the traditions of defining learning objectives (e.g., creating taxonomies, developing curricula and setting standards; see Csapó, 2004, 2010) and recent research findings in fields ranging from cognitive neuroscience (e.g., Ansari and Coch, 2006) through early childhood education (e.g., McLachlan et al., 2018) to research on teaching and learning in the domains assessed.

The most traditional dimension of learning outcomes is mastering the learning material, i.e., subject matter knowledge, represented in textbooks and defined more generally in the school curricula. This type of knowledge is the easiest for teachers to observe. The most frequently assessed and graded dimension, it is termed the *disciplinary dimension* in the diagnostic frameworks. It has been the central part of many curriculum- or textbook-oriented summative assessments as well as of the first international assessment programs. The PISA frameworks have re-defined the conception of valid knowledge and expanded the interpretation of literacy in a parallel form for the three assessment domains (e.g., OECD, 1999, 2003). The same type of knowledge is assessed in the eDia diagnostic system, which is called the *application dimension*. The third dimension focuses on students' cognitive development, the processes underlying learning, which is called the *psychological dimension* (for the cognitive foundations, see also the CBAL approach, Bennett, 2010). Although PISA also assesses disciplinary knowledge in mathematics and science, it does so through the applications, while the psychological dimension appears in the innovative domain (e.g., complex problem-solving in 2003, creative problem-solving in 2012, and collaborative problem-solving in 2015). The predecessors to TIMSS focused on knowledge defined in the curricula of the participating countries, so the main resource was disciplinary knowledge, while recent frameworks deal with content, application, and reasoning as well (see, e.g., Mullis et al., 2001, 2005) somewhat similar to the eDia framework. None of the large-scale international assessment programs can measure how well disciplinary knowledge defined in the actual curricula is mastered, but it is defined and assessed in the disciplinary dimension of the diagnostic system.

The three-dimensional frameworks for reading (Csapó and Csépe, 2012), mathematics (Csapó and Szendrei, 2011), and science (Csapó and Szabó, 2012) have been developed by experts in the particular domains and dimensions. In the three domains, a total of nine dimensions are distinguished and defined; the theoretical foundation and previous research on each one are presented in a chapter in the framework volumes. There are similarities between mathematics and science, while reading is somewhat different. The theoretical chapters are followed by the detailed frameworks developed for primary school Grades 1–6. The descriptions are illustrated by sample items showing possible computerized, multimedia-supported item formats to assess a particular dimension. These frameworks served as training materials for the item writers, who then carefully mapped the frameworks into assessment items (over 1,500 items per dimension). They were also used to familiarize the teachers who use eDia with the content of the assessment. These items were empirically piloted, and a further set of books was published, one volume for each domain with detailed descriptions of the assessment dimensions and illustrated by a larger number of items taken from the item banks in the eDia system (Csapó et al., 2015a,b,c). These books help prepare teachers to use the system, to interpret the feedback provided by eDia, and to determine the intervention concluded from the assessment results. Sample items presented in these books also demonstrate that assessing certain aspects of learning (especially the psychological dimension) would be difficult (and almost impossible in school practice) without the use of technology.

The validity of the three-dimensional model has already been empirically tested. Based on the data collected *via* the eDia system, confirmatory factor analyses were performed separately in each grade for each domain. The results confirmed that, although there are usually significant correlations between the dimensions, they assess different psychological constructs (Molnár and Csapó, submitted). The psychometric indicators for the assessments (e.g., reliability) are constantly monitored, items with poor parameters are modified or deleted from the system, and new items are added to improve coverage of the content defined in the frameworks. (Results from quality improvement processes will be published elsewhere.)

## THE eDia SYSTEM

The eDia system began being built in April 2007, when researchers at the University of Szeged implemented the TAO open source software (Plichart et al., 2004) on university servers and began to explore possibilities for it in close cooperation with and with the continuous support of the developers of TAO at the Centre de Recherche Public Henri Tudor, University of Luxembourg. Several pilot studies were completed with TAO, as well as a media effect study to compare the paperand-pencil and online administration of an inductive reasoning test (Csapó et al., 2009). Although the first results were promising, and by that time several TAO modules had been used in the PISA assessments as well, it soon became obvious that TAO had not been designed for the type of diagnostic assessment system the researchers had aimed to build. This led to a decision to develop new software from scratch optimized for the complex requirements of the diagnostic assessments.

The eDia online diagnostic assessment system can be divided into two main parts. One is the hardware infrastructure (a server farm) and the software that operates the system. This has been developed and optimized for diagnostic assessment, e.g., being continuously accessible for the entire Grade 1–6 student population (up to 600,000 students), and for the management of large item banks (with tens of thousands of items). In addition, this infrastructure can also be used for several other assessment purposes. The other part is the main content of the system, the item banks prepared for the diagnostic assessment of reading, mathematics, and science.

The eDia system is functionally ready for the implementation of systematic assessments and has operated in experimental mode since 2015. At present, there are more than 1,000 partner schools (approx. one-third of the primary schools in Hungary), where it is used on a regular basis. It contains over 25,000 items. The software has been continuously developed, with both the number of partner schools and the number of items available in the system growing.

Currently, three different testing procedures are run with eDia. There are central assessments initiated by the assessment center three times in a school year, at the beginning, in the middle, and at the end of the year. These assessments provide data to establish item parameters and normative reference points. There are teacher-initiated assessments which are used for frequent diagnostic assessments adjusted to the needs of a class or of individual students. The teachers may compile tests out of the items available in the item banks for their own assessment activities. Furthermore, there is testing for research in numerous projects using either items from the item banks or specific tests developed for research purposes.

#### Structure of the System: Functions to Serve the Needs of Educational Practice Item Writing

The system contains an item builder module that makes the task of item writing as easy as writing multimedia documents. Item developers receive extensive training in the content of the assessment and in test theory and psychometrics, enabling them to master the use of the item builder module easily (Molnár et al., 2015a,b, 2018). Items are written online, with the draft versions of items undergoing several phases of review (content, language, technical fitness, and format) before they are entered into the item pool for empirical testing. A number of tools are available to support item writing, including templates and scoring schemes. Several items can be created for one stimulus or a set of closely related stimuli; these items together form the tasks. The items in a task can be moved (e.g., added to a test) together.

#### Test Editing

In the present mode of operating the system, tests consisting of a number of tasks form the units of the assessment. Tests may be constructed out of the tasks in several ways. Typically, booklets are formed out of the tasks, and then they can be combined variously into tests, for example, to eliminate the position effect or to optimize linking/anchoring options. Tests can be constructed with adaptive testing techniques, i.e., based on the answers given to all previous items or to items present in the last cluster, to minimize the difference between the students' ability level and the test difficulty level.

#### Online Test Delivery

Students complete the diagnostic tests as part of their school activity using the available school infrastructure. The tests can be done practically from any device equipped with an internet browser, but the items are optimized for keyboard, mouse, and a large screen. For central assessments, there is an approx. two-week window when eDia is open for the actual assessment. Teacher-initiated testing can take place any time teachers find it useful (at this phase, they are not influenced on how frequently they use it). Students have a specific secret assessment identification code to log into the system.

#### Automated Scoring

The eDia system is designed for both automated and human scoring. However, the items in the item banks that are prepared for the regular diagnostic assessments are scored automatically, with human scoring reserved for research and specific applications. Automatic scoring makes it possible to provide immediate feedback, and it is necessary for the rapid scoring of a large number of assessments. The system offers a variety of scoring options, adjusted to item type and form of response capture.

#### Built-In Data Processing and Statistical Analyses

The eDia system contains a statistical analytics module, which can perform every computation required by the assessment from descriptive statistics through classical test theory to IRT modeling. The computations are programmed using the open source "R" programming language and are continuously adapted to the developing system. The data can be exported from the system for further analyses.

#### Teacher-Assembled Tests

Teachers have been encouraged to use objective assessment instruments since the very beginning of educational testing; however, most tests available for classroom assessment are summative tests. Such tests are difficult to adapt to the actual needs of a class, not to mention individual students. Another option is teacher-made tests, but the time and resources needed to prepare and score them hinder practical use. The teacherassembled tests in eDia fill this gap. Participating teachers are granted access to the item banks, so they can assemble tests out of available tasks. These tests can then be administered to individual students, a group of students or an entire class, with the results made available immediately after testing. Models for the co-existence of centrally initiated tests and teachers' assessment are under development. The current model is that central assessments serve a screening function, while teacherinitiated tests are mostly used for formative and diagnostic purposes if needed. Further options are being explored, e.g., automated recommendations for testing based on previous assessment results.

#### Feedback

At present, there are two basic forms of feedback. One is the immediate feedback students receive right after the test has been completed in the form of percentage of total score of a particular test. Another form is contextualized information based on normative reference data, available only after the central assessments. After the general assessments, both students and teachers receive detailed information about the results for each assessment dimension. Students may download a PDF file with a detailed description of the content of the assessment and their own achievement compared to the national norm and class mean. Teachers receive similar information on their students individually in each dimension as well as a comprehensive, contextualized picture of their class, comparing it to other members of the same age group in the entire school, school district, region, and country. This feedback is provided in graphic form as well to help teachers comprehend and use the data.

#### Scaling and Setting Norms

An IRT model is used to establish assessment scales. There are nine distinct scales in the eDia system as they are defined in the assessment framework; each one is developed separately. Establishing normative scales is a long process, one which requires several steps in the case of the eDia system. The results of the end-of-year assessments are used to establish the scales. In the first step, separate norms are defined for the different grades, with the mean for a grade set for 500 with a SD of 100. This phase has already been completed, and the 54 (6 grades × 3 domains × 3 dimensions) reference scales have been established.

The next step is to devise developmental scales with vertical scaling of the data, linking the achievement of the different grades. This can be done easily with a psychological dimension, where a more or less continuous development can be assumed. As cognitive development is stimulated by out-of-school experiences as well, there may be large differences within a given cohort; some students' achievement may be closer to the mean for a different cohort. Thus, linking the grades causes no difficulties. These considerations are only partially appropriate for the application dimensions, while the disciplinary dimensions are based on the material taught. Therefore, students in a particular grade may only be offered tasks from earlier grades, but not from later ones. Due to these complications, the first vertical scales for the psychological dimensions have already been prepared (see Molnár and Csapó, submitted), but vertical scaling in the other two dimensions requires more sophisticated statistical procedures (e.g., multidimensional IRT).

Finally, longitudinal scales will also be devised, making it possible to monitor student progress and to observe how they progress within a given period, compared to his/her previous and others' mean change. Developing such scales requires even more care and time and is especially difficult because collecting longitudinal data from the period covered by eDia takes at least 5 years, while the social and contextual conditions are also rapidly changing in the meantime. On the other hand, eDia does not provide high-stakes testing, nor is producing trend data a requirement. Thus, it can be flexible in establishing normative scales. Whatever the means used for scaling, scale development should also serve the formative, diagnostic function of the system.

### Novel Item Formats for Improving the Quality of Testing

Quality of testing can be defined in terms of validity (including predictive and diagnostic validity), reliability, and objectivity. In this section, we show how new item formats made possible by technology can improve the quality of testing. A number of media effect studies have been carried out in past decades to explore most aspects of assessments. The quality of TBA is usually compared to paper-and-pencil or face-to-face testing, so we also compare the eDia items to these traditional testing modes. Technology offers numerous new options both in presenting stimuli and in capturing students' responses that are not possible through traditional testing modes; in addition, technology improves objectivity and validity significantly (for a detailed discussion of technological issues, see Csapó et al., 2012).

#### New Forms of Stimuli

Use of technology expands the possibilities of creating more life-like situations and using more authentic stimuli. There are three ways to develop computer-based tests, tasks, and items. First, tests/tasks/items can be prepared according to traditional approaches with designs based on paper-and-pencil techniques. Texts, static images, schematic figures, and graphs are also available on paper, but their richness and variety represent an added value of TBA. We call these kinds of computer-based tasks first-generation tasks (Molnár et al., 2017). Secondgeneration tests contain tasks with new formats, including multimedia (e.g., animation, video and audio), constructed response, automatic item generation, and automatic scoring tests (Pachler et al., 2010), thus increasing the level of authenticity and the power of assessment. These types of tasks cannot be administered in paper-and-pencil format. Finally, thirdgeneration tests dramatically increase the level of reality and the number of ways students can demonstrate their skills as they allow students to interact with complex scenarios (e.g., complex problem-solving items in the MicroDYN approach), simulations (html documents to imitate a closed internet environment), situations (e.g., GeoGebra elements), and dynamically changing items and/or to collaborate online with other students to solve dynamically changing, interactive problemsolving items. All of these options are implemented and available for item development in the eDia system.

Any kind of multimedia, animation, video, voice, etc. provides authentic content, improves validity, and serves specific functions. Special accommodations can be embedded into technologybased tests; for example, validity of test results can be enhanced by providing instructions both in an on-screen written form and with a pre-recorded voice, thereby preventing failures caused by students' reading difficulties. Thus, in the eDia system, students in Grades 1–3 can listen to instructions on headphones while the tests are being administered. It is also possible to standardize the test environment by controlling the presentation of information in different ways (e.g., timing and a given number of repetitions).

#### New Forms for Response Capture

Use of technology changes not only the forms of stimuli but also those of response capture. In the traditional test environment, response capture happened basically by circling, ticking, X-ing, underlining or writing letters, numbers, words or sentences. The TBA environment expands these options, but this expansion strongly depends on the technology used. There are different possibilities for response capture in the case of a tablet or a desktop computer. The eDia system is prepared for both. However, as the keyboard and mouse are used for input in most Hungarian schools, the eDia task responses are optimized for them.

The TBA environment makes it possible to expand the possibilities of manipulation with task elements and to realize the following forms of response capture with a mouse: (1) clicking on form elements (radio button and checkbox), (2) using a drop-down menu, (3) clicking on pictures or parts of pictures, (4) clicking on texts or parts of texts, (5) coloring shapes or pictures or parts of them by clicking, (6) sequencing by ordering mouse clicks, (7) connecting two task elements with lines or arrows, (8) constructing answers with on-screen manipulations with drag-and-drop letters, words, sentences, numbers, shapes, pictures, voices, sounds, animations, simulations, etc., that is, all kinds of task elements, and (9) using sliders and functions or other changeable and interactive task elements. Other possibilities are available with the keyboard, such as typing letters, numbers, and words. Logging and analyzing log data by measuring response time, mouse movement, and navigation sequence to describe the activity of the students during testing can also contribute to more elaborated feedback; however, further studies are required to explore how to use these methods more effectively. All these possibilities for logging students' activities while they respond to items are available in the eDia system.

Complex Item Formats: Interactivity and Simulation The eDia system was prepared to administer third-generation tests. The MicroDYN-based assessment of problem-solving (Greiff and Funke, 2009; Greiff et al., 2013; Molnár and Csapó, 2018) is available with a large number of items. One of the benefits of MicroDYN is that it allows various independent and dependent variables, and different connections may be defined between them for the simulated systems. The difficulty level of the task may thus easily be changed. A further expansion of this conception is the assessment of collaborative problem-solving. It makes it possible to use a real human-human scenario during data collection (Pásztor-Kovács et al., 2018). This allows more social interaction, compared to the PISA 2015 collaborative problem-solving assessment, which used human-agent interaction (OECD, 2017). Further simulation-based items were used on an ICT literacy test (Tongori, 2018). These complex item formats have been used for assessments beyond the diagnostic system and for experimentation and research, and these experiences will also be applied to the diagnostic assessments.

### BEYOND DIAGNOSTIC ASSESSMENT: eDia AS A RESEARCH INSTRUMENT

Beyond its main purpose of providing diagnostic assessments, the eDia platform has been used in a number of other domains and in research projects as well. In this section, we review the research in which data were collected by eDia.

### Further Assessment Domains Implemented in eDia

At present, there are over 20 further domains (called minor domains) for which tests or test batteries are implemented on the eDia platform. The principle in general is that different tests are prepared for the different age groups linked with anchor items.

Supporting the kindergarten-school transition with assessment instruments is one of the current extensions of the eDia. First, the DIFER test battery, a broadly used face-to-face instrument, was digitized, and then the traditional and online delivery methods were compared. Results from the media effect study indicated that the two versions (face-to-face vs. online) were equivalent and that the digitized version was not only more convenient to use, but the objectivity and reliability had also improved on some subtests (Csapó et al., 2014). Based on these experiences, a new school readiness test battery has been developed and optimized for online assessment, which can be used in kindergarten with tablets (Csapó et al., 2017, 2018).

Several instruments were devised for assessments of curricular areas beyond the three major domains. The media effect on composing skills was studied with primary school students (Nagy, 2015). A test of musical abilities used pre-recorded sound stimuli for melody and rhythm (Asztalos and Csapó, 2017). Several tests were prepared for English and German as a Second Language (reading, listening, and vocabulary), while the TBA made it possible to use authentic voice recordings to assess listening skills (Vígh et al., 2015; Nikolov and Csapó, 2017, 2018; Habók and Magyar, 2018a, 2019). Assessments of visual skills benefitted especially from the possibilities of rich illustrations (Kárpáti et al., 2015). Online tests have also been prepared for cross-curricular competencies, such as learning to learn (Habók, 2015; Vainikainen et al., 2015), health literacy (Nagy et al., 2015), financial literacy (Tóth, 2015), ICT literacy (Molnár et al., 2015b), and civic competencies (Kinyó, 2015).

Assessment of a variety of reasoning skills is embedded in the mathematics and science psychology dimension, mostly operational reasoning skills. However, there are some skills that play a distinct role in learning and cognitive development; therefore, comprehensive instruments have been prepared to assess them. Inductive reasoning is one of the most frequently assessed higher-order thinking skills, and several inductive reasoning tests have been developed for the eDia as well. First, a widely used paper-and-pencil inductive reasoning test (verbal and numerical analogies, and number series, see Csapó, 1997) was migrated to the digital platform (Csapó et al., 2009). Later, other tests based on Klauer's model (see, e.g., Klauer and Phye, 2008) were prepared (Molnár et al., 2013) and used in a number of national and international projects. Specific item formats were developed to assess dynamic problem-solving (the MicryDYN base, see Molnár and Pásztor-Kovács, 2015; Csapó and Molnár, 2017a), collaborative problem-solving (e.g., interactivity and communicating with pre-defined messages, see Pásztor-Kovács et al., 2018), creativity (divergent thinking and a program for counting rare solutions, see Pásztor et al., 2015), and combinatorial reasoning (drag-and-drop to combine elements and an algorithm to distinguish valid and invalid combinations, see Pásztor et al., 2015).

Tests, test batteries, and questionnaires beyond the cognitive domain are also implemented through eDia. Some of them are essential for successful learning, but because of the lack of easy-to-use instruments, they are rarely assessed. Motivation is one such affective attribute, and a related mastery motivation questionnaire is available on eDia (Józsa et al., 2015; Zsolnai and Kasik, 2015), as well as a self-regulated foreign language learning strategy questionnaire (Habók and Magyar, 2018b). The PISA 2020 learning strategy questionnaire (Artelt et al., 2003) has also been implemented and used in several projects (e.g., Csapó and Molnár, 2017a). Experimenting with the assessment of further affective and social skills is also in progress (e.g., Zsolnai and Kasik, 2015).

The eDia platform has been used in higher education. For example, in 2015, the University of Szeged introduced an assessment system to explore how well incoming students are prepared for university studies. In the first year, six tests were administered through eDia: Hungarian language and literature (with a strong reading comprehension component), mathematics, history, science and English as a foreign language as well as a dynamic problem-solving test (Csapó and Molnár, 2017a). Since then, the system has evolved further (Molnár and Csapó, 2019b).

### Applications of eDia in International Assessments; Comparative Studies

The eDia system has been used for research within international collaborative projects carried out by the University of Szeged Centre for Research on Learning and Instruction and supports investigations by PhD students at the Doctoral School of Education at the same university. In this section, we review some results of these efforts, highlighting new opportunities for educational research offered by the online assessment.

In Finland, the Centre for Educational Assessment, University of Helsinki, cooperates with Vantaa city schools in using tablets in everyday teaching and learning processes. Within the framework of this project, Hungarian tests were translated into Finnish and assessments were carried out in both countries using the same instruments, with the tests delivered from the University of Szeged servers (Hotulainen et al., 2018; Pásztor et al., 2018). The first results may indicate the impact of frequent testing, but further studies would be required to uncover the mechanisms.

The tests for assessing thinking skills implemented in the eDia have been used in several international studies. The knowledge acquisition phase of dynamic problem-solving involves two further skills, combinatorial reasoning (systematically combining possible values of independent variables) and inductive reasoning (rule induction and generalizing the experience of interactions). The relationships of these skills were explored; the dynamic problem-solving tests, together with combinatorial and inductive reasoning tests were translated into Chinese and were administered to Chinese students. The results indicated a stronger impact of combinatorial reasoning than that of inductive reasoning (Wu and Molnár, 2018a). The relationship between problem-solving, creativity, inductive reasoning, and working memory was explored in a similar study (Wu and Molnár, 2018b). In Namibia, the relationship between scientific reasoning and motivation to learn science was examined (Kambeyo et al., 2017) as well as the possibilities of online assessment of scientific inquiry skills. These studies indicated that online assessment is feasible even with a modest school infrastructure.

Another set of studies was completed on learning foreign languages in three countries, Mongolia (Ragchaa, 2017), Kazakhstan (Akhmetova and Csapó, 2018), and Azerbaijan (Karimova and Csapó, 2018), where the two most frequently studied foreign languages are English and Russian. Thus, these countries offer different contexts and sets of conditions than those of Hungary, where the main foreign languages are English and German (see, e.g., Nikolov and Csapó, 2018). Another difference is that these countries use the Cyrillic alphabet. Several research questions were explored in these studies on learning foreign languages with eDia-based instruments, including the development of receptive skills, self-concept and learning strategies.

### Assessment Platform for the Hungarian Educational Longitudinal Program

The Hungarian Educational Longitudinal Program (HELP) was launched in 2003 and is maintained by the SZTE-MTA Research Group on the Development of Competencies (Csapó, 2007). A new cohort (a nationally representative sample of approx. 6,000 students) is added to the program every 4 years, with students being monitored from the beginning of schooling to the end of compulsory education. Data collection has focused on three main domains, reading, mathematics, and science, and data are systematically collected on a number of cognitive, affective, and contextual variables. The online assessment has been gradually introduced to the data collection effort (e.g., languages have been tested online, see Nikolov and Csapó, 2018), with the cohort that entered school in 2015 having been exclusively assessed with the eDia instruments. The benefit of longitudinal research from the perspective of developing the diagnostic system is that it offers a nationally representative sample for scale development and for determining the predictive power of certain instruments (e.g., school readiness tests, see Csapó et al., 2018).

### DISCUSSION AND CONCLUSIONS

### Practical Relevance and Limitations of the Online Assessment

Systematic feedback is a basic condition for the operation and development of any complex system and providing students and teachers with an inexpensive, easy-to-use, valid, and reliable assessment system may significantly contribute to solving certain crucial problems of education today. Making it possible to measure the different dimensions of learning separately, especially the mostly hidden psychological dimension, i.e., thinking and cognitive development may support meaningful learning and a deeper conceptual understanding. (Empirical studies concerning these assumptions are in progress; see also Molnár and Csapó, 2019a).

Teachers see the differences between their students and realize if some of their students fail, but without proper instruments teachers cannot determine the nature and magnitude of the differences with precision. Diagnostic assessments support the personalization of learning, adjusting teaching to students' personal needs. Teachers routinely use certain types of formative assessment (mostly based on their subjective observation), and we may assume that with better instruments they will teach better. However, we may not assume that they will be able to fully exploit the potential of online diagnostic assessments; they need training to empower them. Several training programs (from one-day introductory workshops to two-year training of assessment experts) are available within the framework of the project. Ideally, the teacher-training component is an in-service adaptation of research-based teacher education (see, e.g., Munthe and Rogne, 2015).

As there is a growing concern among teachers about highstakes testing and the use of its results for accountability (Tóth, 2011), monitoring their views on diagnostic assessment will be an important task. An indicator of acceptance of eDia is that teachers and schools have been participating in the assessments voluntarily, with informal communication confirming its acceptance as well. Formal surveys will be needed to gain a better understanding of teachers' opinions.

Finally, we have to emphasize that an assessment instrument alone does not improve the quality of learning; its practical impact depends on how the information it provides is used to change teaching and learning processes. To better use the power of feedback, the conception of classroom teaching should basically be changed; there is a need for new models of teaching and learning, where students' individual needs are better served. Such models have existed for decades, but the lack of appropriate tools has hindered large-scale use. In the most general terms, Mastery Learning is one such model, which, supported with online pre-tests and post-tests, may gain a new impetus (Csapó and Molnár, 2017b). There are also several promising new models which stress the role of regular feedback and use of assessment data made possible by TBA, e.g., data-based teaching (Datnow and Hubbard, 2016) and assessment-powered teaching (Sindelar, 2010). Experience in the areas of computer aided-instruction and tutoring systems (Kulik and Fletcher, 2016; Chauhan, 2017) may be used, especially in stimulating students' development in the psychological dimensions when diagnostic assessments indicate the need for such intervention.

### Further Research Prospects

Regular diagnostic assessments generate large databases and render it possible to make further sophisticated use of those that have already been started in other areas (see research on the "data revolution" and "big data"). Educational data mining and process mining have already produced results applicable in practice as well (Tóth et al., 2017). Certain methods developed within the paradigm of learning analytics may also be used to process databases produced by diagnostic assessments as well.

Log file analysis is the easiest and most appropriate new method for using new types of assessment data (metadata and log data). An easily recordable and already routinely used piece of information is the time students spend on certain activities when completing online tasks; time-on-task analyses, among other methods, may indicate students' attention and motivation. Some item types (combinatorial reasoning task enumerations, MicroDYN items and collaborative problem-solving activities) allow the recording of more detailed information on students' reasoning. Some analyses (e.g., latent class analyses) using data collected with eDia have already been conducted (Greiff et al., 2018; Molnár and Csapó, 2018), but further research is needed to find ways to make practical use of these results, adding new analytical modules to the eDia platform, creating new, log data-based indicators and supporting students' cognitive development in the long run.

### AUTHOR CONTRIBUTIONS

Both of the authors, BC and GM, certify that they have participated sufficiently in the study to take responsibility for the content, including writing and final approval of the manuscript. Each author agrees to be accountable for all aspects of the paper.

## FUNDING

Preparation of this article was funded by the Hungarian Academy of Sciences through the MTA-SZTE Research Group on the Development of Competencies and by OTKA K115497.

### ACKNOWLEDGMENTS

The authors wish to thank all the developers of eDia for their contributions, including framework developers, programmers,

### REFERENCES


item writers and support staff at the Centre for Research on Learning and Instruction. Special thanks to the teachers at our partner schools who have been encouraging us and using the diagnostic system in their everyday work.


*for large-scale testing*. eds. F. Scheuermann and J. Björnsson (Luxemburg: Office for Official Publications of the European Communities), 63–67.


[Diagnostic assessment of young learners' basic English and German vocabulary]" in *Online diagnosztikus mérések az iskola kezdő szakaszában [Online diagnostic assessments in the beginning phase of schooling]*. eds. B. Csapó, and A. Zsolnai (Budapest: Oktatáskutató és Fejlesztő Intézet), 13–33.


**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

*Copyright © 2019 Csapó and Molnár. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.*

# The Argument for a "Data Cube" for Large-Scale Psychometric Data

Alina A. von Davier\*, Pak Chung Wong, Steve Polyak and Michael Yudelson

*ACTNext, ACT Inc, Iowa City, IA, United States*

In recent years, work with educational testing data has changed due to the affordances provided by technology, the availability of large data sets, and by the advances made in data mining and machine learning. Consequently, data analysis has moved from traditional psychometrics to computational psychometrics. Despite advances in the methodology and the availability of the large data sets collected at each administration, the way assessment data is collected, stored, and analyzed by testing organizations is not conducive to these real-time, data intensive computational methods that can reveal new patterns and information about students. In this paper, we propose a new way to label, collect, and store data from large scale educational learning and assessment systems (LAS) using the concept of the "data cube." This paradigm will make the application of machine-learning, learning analytics, and complex analyses possible. It will also allow for storing the content for tests (items) and instruction (videos, simulations, items with scaffolds) as data, which opens up new avenues for personalized learning. This data paradigm will allow us to innovate at a scale far beyond the hypothesis-driven, small-scale research that has characterized educational research in the past.

#### Edited by:

*Frank Goldhammer, German Institute for International Educational Research (LG), Germany*

#### Reviewed by:

*Pei Sun, Tsinghua University, China Hendrik Drachsler, German Institute for International Educational Research (LG), Germany*

\*Correspondence:

*Alina A. von Davier Alina.vonDavier@act.org*

#### Specialty section:

*This article was submitted to Educational Psychology, a section of the journal Frontiers in Education*

Received: *19 November 2018* Accepted: *03 July 2019* Published: *18 July 2019*

#### Citation:

*von Davier AA, Wong PC, Polyak S and Yudelson M (2019) The Argument for a "Data Cube" for Large-Scale Psychometric Data. Front. Educ. 4:71. doi: 10.3389/feduc.2019.00071* Keywords: database alignment, learning analytics, diagnostic models, learning pathways, data standards

### INTRODUCTION

In recent years, work with educational testing data has changed due to the affordances provided by technology, availability of large data sets, and due to advances made in data mining and machine learning. Consequently, data analysis moved from traditional psychometrics to computational psychometrics. In the computational psychometrics framework, psychometric theory is blended with large scale, data-driven knowledge discovery (von Davier, 2017). Despite advances in the methodology and the availability of the large data sets collected at each test administration, the way the data (from multiple test forms at multiple test administrations) is currently collected, stored and analyzed by testing organizations is not conducive to these real-time, data intensive computational psychometrics and analytics methods that can reveal new patterns and information about students.

In this paper we primarily focus on data collected from large-scale standardized testing programs that have been around for decades and that have multiple administrations per year. Recently, many testing organizations have started to consider including performance or activitybased tasks in the assessments, developing formative assessments, or embedding assessments into the learning process, which led to new challenges around the data governance: data design, collection, alignment, and storage. Some of these challenges have similarities with those encountered and addressed in the field of learning analytics, in which multiple types of data are merged to provide a comprehensive picture of students' progress. For example, Bakharia et al. (2016), Cooper (2014) and Rayon et al. (2014) propose solutions for the interoperability of learning

**197**

data coming from multiple sources. In recent years, the testing organizations started to work with logfiles and even before the data exchange standards for activities and events, such as the Caliper or xAPI standards, have been developed, researchers have worked on designing the data schema for this type of rich data (see Hao et al., 2016). The approach presented in this paper conceptually builds on these approaches, being focused on the data governance for testing organizations.

### Database Alignment

In this paper, we propose a new way to label, collect, and store data from large scale educational learning and assessment systems (LAS) using the concept of the "data cube," which was introduced by data scientists in the past decade to deal with big data stratification problems in marketing contexts. This concept is also mentioned by Cooper (2014) in the context of interoperability for learning analytics. In statistics and data science the data cube is related to the concept of database alignment, where multiple databases are aligned on various dimensions under some prerequisites (see Gilbert et al., 2017). Applying this paradigm to educational test data is quite challenging, due to the lack of coherence of traditional content tagging, of a common identity management system for test-takers across testing instruments, of collaboration between psychometricians and data scientists, and until recently, of the lack of proven validity of the newly proposed machine learning methods for measurement. Currently, data for psychometrics is stored and analyzed as a two-dimensional matrix—item by examinee. In the time of big data, the expectation is not only that one has access to large volumes of data, but also that the data can be aligned and analyzed on different dimensions in real time—including various item features like content standards.

The best part is that the testing data available from the large testing organizations is valid (the test scores measure what they are supposed to measure, and these validity indices are known) and data privacy policies have been followed appropriately when the data was collected. These are two important features that support quality data and the statistical alignment of separate databases (see Gilbert et al., 2017).

### Data Cubes

The idea of relational databases has evolved over time, but the paradigm of the "data cube" is easy to describe. Obviously, the "data cube" is not a cube, given that different data-vectors are of different lengths. A (multidimensional) data cube is designed to organize the data by grouping it into different dimensions, indexing the data, and precomputing queries frequently. Psychometricians and data scientists can interactively navigate their data and visualize the results through slicing, dicing, drilling, rolling, and pivoting, which are various ways to query the data in a data science vocabulary. Because all the data are indexed and precomputed, a data cube query often runs significantly faster than standard queries. Once a data cube is built and precomputed, intuitive data projections on different dimensions can be applied to it through a number of operations. Traditional psychometric models can also be applied at scale and in real time in ways which were not possible before.

### Content as Data

Additionally, in this paper we expand the traditional definition of educational data (learning and testing data) to include the content (items, passages, scaffolding to support learning), taxonomies (educational standards, domain specification), the items' metadata (including item statistics, skills and attributes associated with each item), alongside the students' demographics, responses, and process data. Rayon et al. (2014) and Bakharia et al. (2016) also proposed including the content and context for learning data in their data interoperability structures for learning analytics, Scalable Competence Assessment through a Learning Analytics approach (SCALA), and Connected Learning Analytics (CLA) tool kit, respectively. The difference from their approach is in the specifics of the content for tests (items), usage in psychometrics (item banks with metadata), and domain structures such as taxonomies or learning progressions. In addition, we propose a natural language processing (NLP) perspective on these data types that facilitates the analysis and integration with the other types of data.

Any meaningful learning and assessment system is based on a good match of the samples of items and test takers, in terms of the difficulty and content on the items' side, and ability and educational needs on the students' side. In order to facilitate this match at scale, the responses to the test items, the items themselves and their metadata, and demographic data, need to be aligned. Traditionally, in testing data, we collected and stored the students' responses and the demographic data, but the items, instructional content, and the standards have been stored often as a narrative and often it has not been developed, tagged, or stored in a consistent way. There are numerous systems for authoring test content, from paper-based, to Excel spreadsheets, to sophisticated systems. Similarly, the taxonomies or theoretical frameworks by which the content is tagged are also stored in different formats and systems, again from paper to open-sources systems, such as OpenSALT. OpenSALT is an Open source **S**tandards **AL**ignment **T**ool that can be used to inspect, ingest, edit, export and build crosswalks of standards expressed using the IMS Global Competencies and Academic Standards Exchange (CASE) format; we will refer to data standards and models in more detail later in the paper. Some testing programs have welldesigned item banks where the items and their metadata are stored, but often the content metadata is not necessarily attached to a taxonomy.

We propose that we rewrite the taxonomies and standards as data in NLP structures that may take the form of sets, or mathematical vectors, and add these vectors as dimensions to the "data cube." Similarly, we should vectorize the items' metadata and/or item models and align them on different dimensions of the "cube."

### Data Lakes

The proposed data cube concept could be embedded within the larger context of psychometric data, such as ACT's data lake. At ACT, we are building the **LE**arning **A**nalytics **P**latform (LEAP) for which we proposed an updated version of this data-structure: the in-memory database technology that allows for newer interactive visualization tools to query a higher number of data dimensions interactively. A data lake is a storage solution based on an ability to host large amounts of unprocessed, raw data in the format the sender provides. This includes a range of data representations such as structured, semi-structured, and unstructured. Typically, in a data lake solution, the data structure, and the process for formally accessing it, are not defined until the point where access is required. An architecture for a data lake is typically based on a highly distributed, flexible, scalable storage solution like the Hadoop Distributed File System (HDFS). These types of tools are becoming familiar to testing organizations, as the volume and richness of event data increase. They also facilitate a parallel computational approach for the parameter estimation of complex psychometric models applied to large data sets (see von Davier, 2016).

### Data Standards for Exchange

Data standards allow those interoperating in a data ecosystem to access and work with this complex, high-dimensional data (see for example, Cooper, 2014). There are several data standards that exist in the education space which allow schools, testing, and learning companies to share information and build new knowledge, such as combining the test scores with the GPA, attendance data, and demographics for each student in order to identify meaningful patterns that may lead to differentiated instructions or interventions to help students improve. We will describe several of these standards and emphasize the need for universal adoption of data standards for better collaboration and better learning analytics at scale.

In the rest of the paper, we describe the evolution of data storage and the usefulness of the data cube paradigm for largescale psychometric data. We then describe the approach we are considering for testing and learning data (including the content). In the last section, we present preliminary results from a realdata example of the alignment of two taxonomies from the taxonomy-dimension in the "data cube."

### THE FOUNDATIONS OF THE DATA CUBE AND ITS EXTENSIONS

#### Background and Terminology

In computer science literature, a data cube is a multidimensional data structure, or a data array in a computer programming context. Despite the implicit 3D structural concept derived from the word "cube," a data cube can represent any number of data dimensions such as 1D, 2D. . . nD. In scientific computing studies, such as computational fluid dynamics, data structures similar to a data cube are often referred to as scalars (1D), vectors (2D), or tensors (3D). We will briefly discuss the concept of the relational data model (Codd, 1970) and the corresponding relational databases management system (RDBMS) developed in the 70's, followed by the concept of the data warehouse (Inmon, 1992; Devlin, 1996) developed in the 80's. Together they contributed to the development of the data cube (Gray et al., 1996) concept in the 90's.


### Relational Data Model and Relational Databases Management System (RDBMS)

In a relational data model, data are stored in a table with rows and columns that look similar to a spreadsheet, as shown in **Figure 1**. The columns are referred to as attributes or fields, the rows are called tuples or records, and the table that comprises a set of columns and rows is the relation in RDMBS literature.

The technology was developed when CPU speed was slow, memory was expensive, and disk space was limited. Consequently, design goals were influenced by the need to eliminate the redundancies (or duplicated information), such as "2015" in the Year column in **Figure 1**, through the concept of normalization. The data normalization process involves breaking down a large table into smaller ones through a series of normal forms (or procedures). The discussion of the normalization process is important, but beyond the scope of this paper. Readers are referred to Codd (1970) for further details.

Information retrieval from these normalized tables can be done by joining these tables through the use of unique keys identified during the normalization process. The standard RDBMS language for maintaining and querying a relational database is Structured Query Language (SQL). Variants of SQL can still be found in most modern day databases and spreadsheet systems.

### Data Warehousing

The concept of data warehousing was presented by Devlin and Murphy in 1988, as described by Hayes (2002). A data warehouse is primarily a data repository from one or more disparate sources, such as marketing or sales data. Within an enterprise system, such as those commonly found in many large organizations, it is not uncommon to find multiple systems operating independently, even though they all share the same stored data for market research, data mining, and decision support. The role of data warehousing is to eliminate the duplicated efforts in each decision support system. A data warehouse typically includes some business intelligence tools, tools to extract, transform, and load data into the repository, as well as tools to manage and retrieve the data. Running complex SQL queries on a large data warehouse, however, can be time consuming and too costly to be practical.

### Data Cube

Due to the limitations of the data warehousing described above, data scientists developed the data cube. A data cube is designed to organize the data by grouping it into different dimensions, indexing the data, and precomputing queries frequently. Because all the data are indexed and precomputed, a data cube query often runs significantly faster than a standard SQL query. In business intelligence applications, the data cube concept is often referred to as Online Analytical Processing (OLAP).

### Online Analytical Processing (OLAP) and Business Intelligence

The business sector developed OnLine Analytical Processing technology (OLAP) to conduct business intelligence analysis and look for insights. An OLAP data cube is indeed a multidimensional array of data. For example, the data cube in **Figure 2** represents the same relational data table shown in **Figure 1** with scores from multiple years (i.e., 2015–2017) of the same five students (Noah, Chloe, Ada, Jacob, and Emily) in three academic fields (Science, Math, and Technology). Once again, there is no limitation on the number of dimensions within an OLAP data cube; the 3D cube in **Figure 2** is simply for illustrative purposes. Once a data cube is built and precomputed, intuitive data projections (i.e., mapping of a set into a subset) can be applied to it through a number of operations.

Describing data as a cube has a lot of advantages when analyzing the data. Users can interactively navigate their data and visualize the results through slicing, dicing, drilling, rolling, and pivoting.

#### Slicing

Given a data cube, such as the one shown in **Figure 2**, users can, for example, extract a part of the data by slicing a rectangular portion of it from the cube, as highlighted in blue in **Figure 3A**. The result is a smaller cube that contains only the 2015 data in **Figure 3B**. Users can slice a cube along any dimension. For example, **Figure 4** shows an example of slicing along the Name dimension highlighted in blue, and **Figure 5** shows an example of slicing along the Subject dimension.

#### Dicing

The dicing operation is similar to slicing, except dicing allows users to pick specific values along multiple dimensions. In **Figure 6**, the dicing operation is applied to both Name (Chloe, Ada, and Jacob) and Subject (Calculus and Algebra) dimensions. The result is a small 2 × 3 × 3 cube shown in the second part of **Figure 6**.

#### Drilling

Drilling-up and -down are standard data navigation approaches for multi-dimensional data mining. Drilling-up often involves an aggregation (such as averaging) of a set of attributes, whereas drilling-down brings back the details of a prior drillingup process.

The drilling operation is particularly useful when dealing with core academic skills that can be best described as a hierarchy. For example, **Figure 7A** shows four skills of Mathematics (i.e., Number and Quantity; Operations, Algebra, and Functions; Geometry and Measurement; and Statistics and Probability) as defined by the ACT Holistic Framework (Camara et al., 2015). Each of these skill sets can be further divided into finer subskills. **Figure 7B** shows an example of dividing the Number and Quantity skill from **Figure 7A** into eight sub-skills—from Counting and Cardinality to Vectors and Matrices.

**Figure 8** shows a drill-down operation in a data cube that first slices along the Subject dimension with the value "Math."

The result is a slice of only the Math scores for all five names from 2015 to 2017 in **Figure 8**. The drilling-down operation in **Figure 8** then shows the single Math score that summarizes the three different Math sub-scores of Calculus, Algebra, and Topology. For example, Emily's 2015 Math score is 2, which is an average of his Calculus (1), Algebra (3), and Topology (2) scores as depicted in **Figure 8**.

The drilling-up operation can go beyond aggregation and can apply rules or mathematical equations to multiple dimensions of a cube and create a new dimension for the cube. The idea, which is similar to the application of a "function" on a spreadsheet, is often referred to as "rolling-up" a data cube.

#### Pivoting

Pivoting a data cube allows users to look at the cube via different perspectives. **Figure 9** depicts an example of pivoting the data cube from showing the Name vs. Subject front view in the first part of **Figure 9** to a Year vs. Subject in the third part of **Figure 9**,

which shows not just Emily's 2015 scores but also scores from 2016 and 2017. The 3D data cube is indeed rotated backward along the Subject dimension from the middle image to the last image in **Figure 9**.

### Beyond Data Cubes

Data cube applications, such as OLAP, take advantage of preaggregated data along dimension-levels and provide efficient database querying using languages such as MDX (2016). The more pre-aggregations done on the disk, the better the performance for users. However, all operations are conducted at disk level, which involves slow operation, and thus CPU load and latency issues. As the production cost of computer memory continues to go down and its computational performance continues to go up simultaneously, it has become evident that it is more practical to query data in the

memory instead of pre-aggregating data on the disk as OLAP data-cubes.

#### In-memory Computation

Today, researchers use computer clusters with as much as 1 TB of memory (or more) per computer node for high dimensional, in-memory database queries in interactive response time. For example, T-Rex (Wong et al., 2015) is able to query billions of data records in interactive response time using a Resource Description Framework<sup>1</sup> RDF 2014 database and the SPARQL (2008) query language running on a Linux cluster with 32 nodes of Intel Xeon processors and ∼24.5 TB of memory installed across the 32 nodes. Because such a large amount of information can be queued from a database in interactive time, the role of data warehouses continues to diminish in the big data era and as cloud computing becomes the norm.

### The Traditional Data Cubes Concept

Additionally, in-memory database technology allows researchers to develop newer interactive visualization tools to query a higher number of data dimensions interactively, which allows users to look at their data simultaneously from different perspectives. For example, T-Rex's "data facets" design, as shown in **Figure 10A**, shows seven data dimensions of a cybersecurity benchmark dataset available in the public domain. After the IP address 172.10.0.6 (in the SIP column) in

<sup>1</sup>https://en.wikipedia.org/wiki/Resource\_Description\_Framework

**Figure 10A** is selected, the data facets update the other six columns as shown in **Figure 10B** simultaneously. The query effort continues in **Figure 10B** where the IP address 172.10.1.102 is queried in the DIP column. **Figure 10C** shows the results after two consecutive queries, shown in green in the figure.

The spreadsheet-like visual layout in **Figure 10** performs more effectively than many traditional OLAP data interfaces found in business intelligence tools. Most importantly, the data facets design allows users to queue data in interactive time without the need for pre-aggregating data with pre-defined options. This video (Pacific Northwest National Laboratory, 2014) shows how T-Rex operates using a number of benchmark datasets available in the public domain.

The general in-memory data cube technology has extensive commercial and public domain support and is here to stay until the next great technology comes along.

### DATA CUBE AS PART OF A DATA LAKE SOLUTION AND THE LEAP FOR PSYCHOMETRIC DATA

The proposed data cube concept could be embedded within the larger context of collecting/pooling psychometric data in something that is known in the industry as a data lake (Miloslavskaya and Tolstoy, 2016). An example of this is ACT's data lake solution known as the LEarning Analytics Platform (LEAP). ACT's LEAP is a data lake is a storage solution based on an ability to host large amounts of unprocessed, raw data in the format the sender provides. This includes a range of data representations such as structured, semi-structured, and unstructured. Typically, in a data lake solution, the data structure, and the process for formally accessing it, are not defined until the point where access is required.

A data lake changes the typical process of: extract data, transform it (to a format suitable for querying) and load in to tables (ETL) into one favoring extract, load and transform (ELT), prioritizing the need to capture raw, streaming data prior to prescribing any specific transformation of the data. Thus, data transformation for future use in an analytic procedure is delayed until the need for running this procedure arises. We now describe how the technologies of a data lake help to embed the data cube analysis functionality we described above.

An architecture for a data lake is typically based on a highly distributed, flexible, scalable storage solution like the Hadoop Distributed File System (HDFS). In a nutshell, an HDFS instance is similar to a typical distributed file system, although it provides higher data throughput and access through the use of an implementation of the MapReduce algorithm. MapReduce here refers to the Google algorithm defined in Dean and Ghemawat (2008). ACT's LEAP implementation of this HDFS architecture is based on the industry solution: Hortonworks Data Platform (HDP) which is an easily accessed set of open source technologies. This stores and preserves data in any format given across a set of available servers as data streams (a flow of data) in stream event processors. These stream event processor uses an easy-to-use library for building highly scalable, distributed analyses in real time, such as learning events or (serious) game play events.

Using map/reduce task elements, data scientists and researchers can efficiently handle large volumes of incoming, raw data files. In the MapReduce paradigm:

"Users define the computation in terms of a map and a reduce function, and the underlying runtime system automatically parallelizes the computation across large-scale clusters of machines, handles machine failures, and schedules inter-machine communication to make efficient use of the network and disk" (Dean and Ghemawat, 2008).

Scripts for slicing, dicing, drilling, and pivoting [See Section Online Analytical Processing (OLAP) and Business Intelligence] in a data cube fashion can be written, executed, and shared via notebook-style interfaces such as those implemented by, for example, open source solutions such as Apache Zeppelin and Jupyter. Zeppelin and Jupyter are web based tools that allow users to create, edit, reuse, and run "data cube"-like analytics using a variety of languages (e.g., R, Python, Scala, etc.). Such scripts can access data on an underlying data source such as HDFS. Organizing analytical code into "notebooks" means combining the descriptive narration of the executed analytical or research methodology along with the code blocks and the results of running them. These scripts are sent to sets of computing machines (called clusters) that manage the process of executing the notebook in a scalable fashion. Data cube applications in the data lake solution typically run as independent sets of processes, coordinated by a main driver program.

### Data Standards for Exchange

While data lakes provide flexibility in storage and enable the creation of scaleable data cube analysis, it is also typically a good idea for those operating in a data ecosystem to select a suitable data standard for exchange. This makes it easier for those creating the data, transmitting, and receiving the data to avoid the need to create translations of the data from one system to the next. Data exchange standards allow for the alignment of databases (across various systems), and therefore, facilitate high connectivity of the data stored in the date cube. Specifically, the data exchange standards impose a data schema (names and descriptions of the variables, units, format, etc.) that allow data from multiple sources to be accessed in a similar way.

There are several data standards that exist in the education space that address the data exchange for different types of data, such as:


The Ed-Fi Data Standard was developed in order to address the needs of standard integration and organization of data in education. This integration and organization of information

<sup>2</sup>https://en.wikipedia.org/wiki/Schools\_Interoperability\_Framework (Retrieved May 7, 2018).

<sup>3</sup>https://www.ed-fi.org/

ranges across a broad set of data sources so it can be analyzed, filtered, and put to everyday use in various educational platforms and systems.

• Common Education Data Standards (CEDS)<sup>4</sup>

CEDS provides a lens for considering and capturing the data standards' relations and applied use in products and services. The area of emphasis for CEDS is on data items and representations across the pre-kindergarten, typical K-12 learning, learning beyond high school, as well as jobs and technical education, ongoing adult-based education, and into workforce areas as well.

	- IMS Caliper, which allows us to stream in assessment item responses and processes data that indicate dichotomous outcomes, processes, as well as grade/scoring.
	- IMS Global Competencies and Academic Standards Exchange (CASE), which allows us to import and export machine readable, hierarchical expressions of standards knowledge, skills, abilities and other characteristics (KSAOs). One of the notable examples could be found in (Rayon et al., 2014).

xAPI is a specification for education technology that enables collection of data on the wide range of experiences a person has (both online and offline). xAPI records data in a consistent format about an individual or a group of individual learners interacting with multiple technologies. The vocabulary of the xAPI is simple by design, and the rigor of the systems that are able to securely share data streams is high. On top of regulating data exchange, there exists a body of work toward using xAPI for aligning the isomorphic user data from multiple platforms (rf. Bakharia et al., 2016). An example of aligning activity across multiple social networking platforms is discussed. Also, concrete code and data snippets are given.

• OpenSalt<sup>7</sup>

We have built and released a tool called OpenSALT which is an Open-source Standards ALignment Tool that can be used to inspect, ingest, edit, export and build crosswalks of standards expressed using the IMS Global CASE format.

As we outlined in the data cube overview, we are interested in fusing several main data perspectives:

	- Examples of complex outcomes may include: partial credit results, media interaction results (play), engagement results, and process data (e.g., time

spent browsing), tutored interaction, synergetic activities (e.g., interactive labs).

	- Item contextualization that addresses multiple hypotheses of how the conceptualization is structured. Multiple hypotheses include accounts for human vs. machine indexing and alternative conceptualizations in the process for development.

The selection of which standards to use to accelerate or enhance the construction of data cubes (within data lakes) for large-scale psychometric data depend on the nature of the educational data for the application. For example, CASE is an emerging standard for injecting knowledge about academic competencies whereas something like xAPI is used to inject the direct feed of learner assessment results (potentially aligned to those CASE-based standards) in a standards-based way into a data cube.

By committing to these data standards, we can leverage the unique capability of the data lake (i.e., efficiently ingesting high volumes of raw data relating to item responses and item metadata) while also prescribing structured commitments to incoming data so that we can build robust, reliable processing scripts. The data cube concept then acts as a high-powered toolset that can take this processed data and enable the online analytical operations such as slicing, dicing, drilling, and pivoting. Moreover, the availability of the data cube and alignment of databases will influence the standards that will need to be available for a smooth integration. It is also possible that new standards will be developed.

### EXAMPLE OF APPLICATIONS OF THE DATA CUBE CONCEPT

#### Alignment of Instruments

One of the key elements of an assessment or learning system is the contextualization of the items and learning activities in terms of descriptive keywords that tie them to the subject. The keywords are often referred to as attributes in the Q-matrices (in psychometrics—see Tatsuoka, 1985), skills, concepts, or tags (in the learning sciences). We will use "concepts" as an overarching term for simplicity. Besides items that psychometrics focuses on, the field of learning sciences has a suite of monikers for elements that cater to learning. The latter include: readings, tutorials, interactive visualizations, and tutored problems (both singleloop and stepped). To cover all classes of deliverable learning

<sup>4</sup>https://en.wikipedia.org/wiki/Common\_Education\_Data\_Standards

<sup>5</sup>https://www.imsglobal.org/aboutims.html

<sup>6</sup>https://xapi.com/overview/

<sup>7</sup>http://opensalt.opened.com/about

and assessment items we would use the term "content-based resources" or "resources" for short.

The relationships between concepts and resources are often referred to as indexing. The intensive labor required to create indexes for a set of items can be leveraged via machine learning/NLP techniques over a tremendous corpus of items/resources. This large scale application was not possible before we had present day storage solutions and sophisticated NLP algorithms. More specifically, the production of said indexing is time-consuming, laborious, and requires trained subject matter experts. There are multiple approaches that address lowering the costs of producing indices that contextualize assessment items and learning resources. These approaches can come in the form a machine learning procedure that, given the training data from an exemplary human indexing, would perform automated indexing of resources.

Data cubes can offer affordances to support the process of production and management of concept-content/resource/item indices. First, even within one subject, such as Math or Science, there could be alternative taxonomies or ontologies that could be used to contextualize resources. See **Figures 7**, **8** for illustrations. Alternatives could come from multiple agencies that develop educational or assessment content or could rely upon an iterative process within one team.

Second, the case when multiple concept taxonomies are used to describe multiple non-overlapping pools of items or resources reserves room for a class of machine learning indexing procedures that could be described as taxonomy alignment procedures. These procedures are tasked with translating between the languages of multiple taxonomies to achieve a ubiquitous indexing of resources.

Third, all classes of machine learning procedures rely upon multiple features within a data cube. The definition and composition of these features is initially developed by subject matter experts. For example, the text that describes the item or resource, its content, or its rationale could be parsed into a highdimensional linguistic space. Under these circumstances, a deck of binary classifiers (one per concept), or a multi-label classifier could be devised to produce the indexing.

Also, when we are talking about translation form one concept taxonomy to another, one could treat existing expert-produced double-coding of a pool of resources, in terms of the two taxonomies being translated, as a training set. A machine learning procedure, then, would be learning the correspondence relationships. Often, in the form of an n-to-m mapping example, when one item/resource is assigned n concepts from one taxonomy and m from the other.

One of our first attempts with translating two alternative concept taxonomies—between the ACT Subject Taxonomy and ACT Holistic Framework—has yielded only modest results. We had only 845 items indexed in both taxonomies and 2,388 items that only had ACT Subject Taxonomy indexing. Active sets of concepts present in the combined set of 3,233 items included 435 and 455 for the Subject Taxonomy and Holistic Framework respectively. A machine learning procedure based on an ensemble of a deck of multinomial regressions (one per each of the 455 predicted Holistic Framework concepts) yielded a 51% adjusted accuracy. Since the index could be sparse, due to the large size of the concept taxonomy and the lower density of items per concept, and the classic machine learning definition of accuracy (matched classifications over total cases classified) would yield an inflated accuracy result due to overwhelming number of cases where the absence of a concept is easily confirmed (we obtained classical accuracies at 99% level consistently). Adjusted accuracy addresses this phenomenon by limiting the denominator to the union of concepts that were present in the human coder-supplied ground-truth training data, or in the prediction (the latter came in the form of pairings of source and target taxonomy concepts, see **Figure 11** for an example). Thus, our work so far and the 51% accuracy should be understood as the first step toward automating taxonomy alignment. We learned that it is significantly harder to align test items than it is to align the instructional resources, because the test items do not usually contain the words that describe the concepts, while the instructional resources do have richer descriptions. This motivated us to include additional data about the test items and the test takers, to increase the samples for the training data, and to refine the models. This is work in progress.

### Diagnostic Models

In addition to the alignment of content which is a relatively new application in education, the data cube can support psychometric models that use data from multiple testing administrations and multiple testing instruments. For example, one could develop cognitive diagnostic models (CDMs) that use the data from multiple tests taken by the same individual. CDMs are multivariate latent variable models developed primarily to identify the mastery of skills measured in a particular domain. The CDMs provide fine-grained inferences about the students' mastery and relevance of these inferences to the student learning process.

Basically, a CDM in a data cube relates the response vector **X**<sup>i</sup> = Xi11, ... , Xijt, ... , XiJT , where Xijt represents the response of the ith individual to the jth item from the testing instrument t, using a lower dimensional discrete latent variable **A**i= (Ai1, ... , Aik, ... , AiK) and Aik is a discrete latent variable for individual i for latent dimension k as described by the taxonomy or the Q-matrix. CDMs model the conditional probability of observing **X**<sup>i</sup> given **A**<sup>i</sup> , that is, P (**X**<sup>i</sup> |**A**i). The specific form of the CDM depends on the assumptions we make regarding how the elements of **A**<sup>i</sup> interact to produce the probabilities of response Xijt.

Traditional data governances in testing organizations cannot easily support the application of the CDMs over many testing administrations and testing instruments: usually the data from each testing instrument is saved in a separate database, that often is not aligned with the data from other instruments. In addition, in the traditional data governance, the taxonomies (and the Q-matrices) across testing instruments are not part of the same framework and are not aligned.

### Learning Analytics and Navigation

Another example of the usefulness of a data cube is to provide learning analytics based on the data available about


FIGURE 11 | Examples of question items manually tagged with holistic framework and subject taxonomy.

each student. As before, in a data cube, we start with the response vector **X**<sup>i</sup> = Xi11, ... , Xijt, ... , XiJT , where Xijt represents the response of the ith individual to the jth item from the testing instrument t. Then, let's assume that we also have ancillary data about the student (demographic data, school data, attendance data, etc.) collected in the vector (or matrix) or **B**i= (Bi1, ... , Bim, ... , BiM) and Bim represents a specific type of ancillary variable (gender, school type, attendance data, etc.). Let's assume that for some students we also have data about their success in college, collected under **C**. These data, **X**, **B**, and **C** can now be combined across students to first classify all the students, and then later on, to predict the student's success in the first year of college for each student using only the **X**<sup>i</sup> and **B**<sup>i</sup> . Most importantly, these analytics can be used as the basis for learning pathways for different learning goals and different students to support navigation through educational and career journey.

### Learning, Measurement, and Navigation Systems

The ACTNext prototype app, Educational Companion, illustrates an applied instance of linking learning, assessment, and navigation data streams using the data governance described above as the data cube. The app was designed as a mobile solution for flexibly handling the alignment of learner data and content (assessment and instructional) with knowledge and skill taxonomies, while also providing learning analytics feedback and personalized resource recommendations based on the mastery theory of learning to support progress in areas identified as needing intervention. Educational Companion evaluates learning progress by continuously monitoring measurement data drawn from learner interactions across multiple sources, including ACT's portfolio of learning and assessment products. Using test scores from ACT's college readiness exam as a starting point, Companion identifies the underlying relationships between a learner's measurement data and skill taxonomies across core academic areas identified in ACT's Holistic Framework (HF). If available, additional academic assessment data is drawn from a workforce skills assessment (ACT WorkKeys), as well as Socio-Emotional Learning (SEL) data taken from ACT's Tessera exam. Bringing these data streams together, the app predicts skill and knowledge mastery at multiple levels in a taxonomy, such as the HF.

See **Figure 12** for an illustration of the architecture for the Educational Companion App. More details about this prototype are given in von Davier et al. (2019).

As explained in section Alignment of Instruments above, through aligning instructional resources and taxonomic structures using ML and NLP methods, and in conjunction with continuously monitoring updates to a learner's assessment data, Companion uses its knowledge of the learner's predicted abilities along with the understanding of hierarchical, parent/child relationships within the content structure to produce personalized lists of content and drive their learning activities forward. Over time, as learners continue to engage with the app, Companion refines, updates, and adapts its recommendations and predictive analytics to best support an individual learner's needs. The Companion app also incorporates navigational tools developed by Mattern et al. (2017) which

provide learners with insights related to career interests, as well as the relationships between their personal data (assessment results, g.p.a., etc.) and longitudinal data related to areas of study in college and higher education outcome studies. The Companion app was piloted with a group of Grades 11 and 12 high school students in 2017 (unpublished report, Polyak et al., 2018).

Following the pilot, components from the Educational Companion App were redeployed as capabilities that could extend this methodology to other learning and assessment systems. The ACTNext Recommendation and Diagnostics (RAD) API was released and integrated into ACT's free, online test preparation platform ACT Academy, offering the same mastery theory of learning and free agency via evidence-based diagnostics and personalized recommendations of resources.

### CONCLUSION

In this paper we discussed and proposed a new way to structure large-scale psychometric data at testing organizations based on the concepts and tools that exist in other fields, such as marketing and learning analytics. The simplest concept is matching the data across individuals, constructs, and testing instruments in a data cube. We outlined and described the data structure for taxonomies, item metadata, and item responses in this matched multidimensional matrix that will allow for rapid and in-depth visualization and analysis. This new structure will allow real-time, big data analyses, including machine-learning-based alignment of testing instruments, realtime updates of cognitive diagnostic models during the learning process, and real-time feedback and routing to appropriate resources for learners and test takers. The data cube it is almost like Rubik's Cube where one is trying to find the ideal or typical combination of data. There could be clear purposes for that search, for instance creating recommended pathways or recognizing typical patterns for students for specific goals.

In many ways, the large testing companies are well-positioned to create flexible and well-aligned data cubes as described previously. Specifically, the testing data is valid (the test scores measure what they are supposed to measure, and these validity indices are known) and data privacy policies have been followed appropriately when the data was collected, which are two important features that support quality data and the statistical alignment of separate databases. Nevertheless, this new type of data governance has posed challenges for testing organizations. Part of the problem seems to be that the psychometric community has not embraced yet the data governance as part of the psychometrician's duties. The role of this paper is to bring these issues to the attention of psychometricians and underscore the importance of expanding the psychometric tool box to include elements of the data science and governance.

More research and work is needed to refine and improve AI-based methodologies, but without flexible data alignment, the AI-based methods are not possible at all.

### AUTHOR CONTRIBUTIONS

All authors listed have made a substantial, direct and intellectual contribution to the work, and approved it for publication.

#### REFERENCES


#### ACKNOWLEDGMENTS

The authors thank Andrew Cantine for his help editing the paper. The authors thank Drs. John Whitmer and Maria Bolsinova for their feedback on the previous version of the paper. The authors thank to the reviewers for their feedback and suggestions.


**Conflict of Interest Statement:** AvD, SP, and MY are employed by ACT Inc. PW was employed by ACT Inc. at the time this work was conducted.

Copyright © 2019 von Davier, Wong, Polyak and Yudelson. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

# The Onset of Rapid-Guessing Behavior Over the Course of Testing Time: A Matter of Motivation and Cognitive Resources

#### Marlit Annalena Lindner <sup>1</sup> \*, Oliver Lüdtke1,2 and Gabriel Nagy <sup>1</sup>

1 IPN - Leibniz Institute for Science and Mathematics Education, Kiel, Germany, <sup>2</sup> Centre for International Student Assessment, Munich, Germany

#### Edited by:

Ronny Scherer, University of Oslo, Norway

#### Reviewed by:

Evangelia Karagiannopoulou, University of Ioannina, Greece Grzegorz Szumski, University of Warsaw, Poland

> \*Correspondence: Marlit Annalena Lindner mlindner@leibniz-ipn.de

#### Specialty section:

This article was submitted to Educational Psychology, a section of the journal Frontiers in Psychology

Received: 23 November 2018 Accepted: 18 June 2019 Published: 23 July 2019

#### Citation:

Lindner MA, Lüdtke O and Nagy G (2019) The Onset of Rapid-Guessing Behavior Over the Course of Testing Time: A Matter of Motivation and Cognitive Resources. Front. Psychol. 10:1533. doi: 10.3389/fpsyg.2019.01533 Digital tests make it possible to identify student effort by means of response times, specifically, unrealistically fast responses that are defined as rapid-guessing behavior (RGB). In this study, we used latent class and growth curve models to examine (1) how student characteristics (i.e., gender, school type, general cognitive abilities, and working-memory capacity) are related to the onset point of RGB and its development over the course of a test session (i.e., item positions). Further, we examined (2) the extent to which repeated ratings of task enjoyment (i.e., intercept and slope parameters) are related to the onset and the development of RGB over the course of the test. For this purpose, we analyzed data from N = 401 students from fifth and sixth grades in Germany (n = 247 academic track; n = 154 non-academic track). All participants solved 36 science items under low-stakes conditions and rated their current task enjoyment after each science item, constituting a micro-longitudinal design that allowed students' motivational state to be tracked over the entire test session. In addition, they worked on tests that assessed their general cognitive abilities and working-memory capacity. The results show that students' gender was not significantly related to RGB but that students' school type (which is known to be closely related to academic abilities in the German school system), general cognitive abilities, and their working-memory capacity were significant predictors of an early RGB onset and a stronger RGB increase across testing time. Students' initial rating of task enjoyment was associated with RGB, but only a decline in students' task enjoyment was predictive of earlier RGB onset. Overall, non-academic-school attendance was the most powerful predictor of RGB, together with students' working-memory capacity. The present findings add to the concern that there is an unfortunate relation between students' test-effort investment and their academic and general cognitive abilities. This challenges basic assumptions about motivation-filtering procedures and may threaten a valid interpretation of results from large-scale testing programs that rely on school-type comparisons.

Keywords: rapid-guessing behavior, motivation, test-taking effort, item position effect, low-stakes assessment, large-scale assessment (LSA), latent class analysis

## INTRODUCTION

Computer-based assessments are being implemented more and more in educational institutions and large-scale testing programs. This digitalization of tests makes response-time measures (i.e., time on task; e.g., Goldhammer et al., 2014) and log files (e.g., Greiff et al., 2015) easily available. This opens new paths to more objective and also deeper insights into students' test-taking behavior (e.g., Wise and Kong, 2005; Goldhammer et al., 2014; Finn, 2015), for example, by detecting rapid-guessing behavior (RGB). The term RGB basically means that a test-taker provides a response to an item in just a few seconds after the item has been presented. Given that it is highly implausible that students truthfully work on a given task in such a short time frame, RGB is interpreted as a reflection of non-effort (Wise and Kong, 2005; Goldhammer et al., 2016; Wise, 2017). Even though RGB has recently been subject to valuable investigations that shed more light on the nature of this undesirable test-taking behavior, the psychological determinants that are related to RGB in low-stakes assessment have not yet been sufficiently examined.

The present study takes a closer look at the correlates of RGB, placing a specific focus on students' individual probability of showing an early RGB onset over the course of testing time. Specifically, we aimed to investigate the role of two main explanatory psychological characteristics at a student level that are considered to be related to low test-taking effort, namely, a lack of motivational and cognitive resources.

#### Motivation and Test-Taking Behavior

Educational assessment is essential for the evaluation of learning outcomes and the determination of the proficiency levels of test takers in diverse contexts. Unfortunately, test takers are not always fully motivated to engage in solving test items, especially in low-stakes settings (e.g., Wise and DeMars, 2005, 2010; Wise, 2006; Finn, 2015). Low-stakes means that the test scores have no formal consequences at a student level (e.g., grades, graduation), although aggregated test scores often have major consequences at an institutional or governmental level (e.g., program funding, educational reforms). A high level of effort invested by students when working on a test is considered a prerequisite for a reliable and valid interpretation of achievement levels (Cronbach, 1960; Messick, 1989; Baumert and Demmrich, 2001; Goldhammer et al., 2016). If the problem of low test-taking effort is not treated, for example by statistical correction procedures, students' proficiency may be underestimated, which may lead—in turn—to biased conclusions (see e.g., Wise and DeMars, 2005; Wise et al., 2006b; Nagy et al., 2018b).

Low test-taking motivation in low-stakes assessments is often explained by Expectancy-Value Models (e.g., Eccles et al., 1983; Wigfield and Eccles, 2000; Eccles and Wigfield, 2002), which assume that achievement motivation for a given task (e.g., taking a test) is a function of (1) expectancy (i.e., students' expectation of success in solving the test items) and (2) value (i.e., the perceived importance and usefulness of the test). The expectancy component is determined by both students' abilities and task demands and is, for example, low when test items are too difficult for a student. The value component is considered to be more complex: Eccles and Wigfield (2002) distinguish between four value components, namely, (a) attainment value (e.g., task importance), (b) intrinsic value (e.g., task enjoyment), (c) utility value (e.g., relevance for future goals), and (d) perceived costs (e.g., effort). It can be assumed that all four of these value aspects and, thus, also the overall value component are rather low in lowstakes assessments. This is because, at least for some test takers, the lack of personal consequences and a lack of intrinsic value in taking the test may be in conflict with the effort that is required to successfully solve the items. This is especially true for students with lower competence levels (i.e., low expectancy) who need to invest more effort to successfully work on a test. Accordingly, based on expectancy-value models, achievement motivation can be expected to be lower in low-performing students than in high-performing students.

Lower levels of student motivation become a serious problem when they result in low test effort, which can be defined as a lack of mental work that is put into responding to test items (Wise and DeMars, 2005, 2010; Finn, 2015). Analyzing data sets that include such invalid responses threatens the interpretation of the test scores obtained because construct-irrelevant variance is introduced (Haladyna and Downing, 2004; Nagy et al., 2018a) and psychometric properties are deformed (see e.g., Rios et al., 2017). This issue is often addressed by motivationfiltering procedures (see e.g., Finn, 2015, for a review): As one option, filtering can be based upon self-report questionnaires that aim to assess students' global test-taking motivation (e.g., Student Opinion Scale; Thelk et al., 2009). Such measures are convenient in any type of assessment (including paper-pencil tests), but self-reports are more vulnerable to measurement errors and social desirability (Swerdzewski et al., 2011). As a second option, measuring response times in computer-based assessments provides unobtrusive, more objective insights into students' actual test-taking behavior (e.g., Wise and Kong, 2005; Greiff et al., 2015), while this measure does not disturb or influence students during their taking of the test. Typical sources of measurement error can thus be minimized when referring to students' response behavior as an indicator of effort (or non-effort).

#### Identifying Rapid-Guessing Behavior

The identification of RGB has proven useful for detecting test takers who do not exert their maximum effort in a test (e.g., Wise, 2006, 2017; Wise et al., 2006b; Finn, 2015). RGB is operationalized by unrealistically low response times that would not even allow the item content to be read and understood and especially would not allow an effortful response; any trial that is not identified as RGB is considered solution behavior, resulting in a dichotomous measure of RGB. However, it is noteworthy that responses that are categorized as solution behavior do not necessarily reflect effortful item solving (for a discussion see e.g., Finn, 2015; Wise, 2017). The main advantage of identifying RGB is that it can be measured for each student and each item. This means that all single trials (i.e., person × item interaction) can be classified as either RGB or solution behavior (see e.g., Wise and Kong, 2005), which makes it possible, for example, to trace the development of non-effort over the course of the test.

However, a reasonable response time threshold needs to be determined to separate (non-effortful) RGB responses from (probably effortful) solution behavior. In doing so, false-positive and false-negative classifications need to be avoided. Various approaches have been discussed (e.g., Wise and Kong, 2005; Wise, 2006; Kong et al., 2007; Wise and Ma, 2012; Lee and Jia, 2014; Finn, 2015; Goldhammer et al., 2016). Defining one constant threshold for every item (e.g., 3 s) is a basic option to determine RGB. However, item-specific, normative thresholds that vary as a function of the mean response time per item (i.e., a certain percentage of the item mean is used to separate RGB from solution behavior; see e.g., Wise and Ma, 2012; Lee and Jia, 2014) or item characteristics (Wise and Kong, 2005; Wise, 2006) often yield a more valid classification of RGB and solution behavior. This is because item attributes can substantially impact the meaning and interpretation of (short) response times. Nonetheless, the different approaches can be helpful in handling different types of data sets (see e.g., Wise, 2017). Thresholds further need to be cross-validated by a combination of different criteria for every test (see e.g., Goldhammer et al., 2016; Wise and Gao, 2017). For example, the accuracy of responses classified as RGB should equal the a priori guessing probability per item, thresholds should be validated by the visual inspection of response time distributions, and 10-s thresholds should not to be exceeded. However, smaller threshold changes do not have a substantial impact on further analyses, suggesting that RGB can be classified with a high reliability—more or less independent of the specific method applied (Kong et al., 2007).

In conclusion, from a pragmatic perspective, RGB can serve as a useful indicator of test-takers' non-effort in motivationfiltering procedures. However, it is also important to gain a better understanding of RGB at a theoretical level and from a psychological point of view.

Theories and Correlates of Rapid-Guessing Behavior

Expectancy-value models help to predict achievement motivation in low-stakes tests. Related assumptions that are more specific to the assessment context and the explanation of RGB have been proposed by Wise and Smith (2011) in the Demands-Capacity Model (DCM; see also Wise, 2017). The core of the DCM is the assumption that the tendency of a test taker to engage in RGB is a function of the current fit of (1) the resource demands of the presented items, and (2) the effort capacity of the student. Resource demands are defined as aspects of an item that determine how difficult or mentally taxing it is, such as higher reading demands or complex information. On the other side, test-takers are assumed to have a certain effort capacity that they can invest in solving an item at a specific moment. The DCM is still vague regarding the factors that determine the current status of effort capacity, as the authors propose that many factors have an influence, namely, "test stakes, time pressure, fatigue from answering earlier items, how interesting earlier items were, or a desire to please teachers or parents" (Wise, 2017, p. 53). The DCM further assumes that students compare the current item demands with their current effort capacity. They decide to engage in solution behavior for a given item when their effort capacity is sufficient or, otherwise, to engage in RGB. This explains that testtakers change their response pattern in reaction to different items, as both item demands and effort capacity can easily fluctuate across a test session. Even though RGB is commonly understood as an indicator of a lack of motivation (see e.g., Finn, 2015), building on the DCM, we assume that students might also refuse to work on an item when they lack basic cognitive resources (i.e., as a facet of a lower effort capacity).

Evidence supporting the DCM comes from studies that have investigated correlates of RGB. There are two typical levels of aggregation: the person and the item level. Regarding the student level, the measure of response time effort (RTE<sup>1</sup> ), as introduced by Wise and Kong (2005), is determined as the proportion of solution behavior related to all presented items in a test and provides information concerning the overall level of invested effort per student. The correlations of RTE and person characteristics can provide information concerning factors that go along with higher or lower levels of test-taking effort, respectively. The item-specific counterpart, introduced by Wise (2006), is response time fidelity (RTF). It represents the effort invested in a specific item across all test-takers, namely, the proportion of effortful responses to that item. Thus, RTF is a useful parameter to investigate correlates of effort based on item characteristics. It is also possible to model students' responses by more complex linear or generalized mixed-effects models (e.g., Wise et al., 2009) to jointly investigate student and item characteristics and their connections to RGB.

Building on RTE and RTF and using multilevel approaches, research has shown that higher RGB prevalence at a student level (i.e., RTE) is, for example, often associated with lower academic abilities (e.g., Wise et al., 2009; Lee and Jia, 2014; Goldhammer et al., 2016; Wise and Gao, 2017), male gender (e.g., DeMars et al., 2013; Goldhammer et al., 2016), personality traits, such as lower conscientiousness and agreeableness or higher neuroticism (e.g., DeMars et al., 2013; Barry and Finney, 2016; Lu et al., 2018), and cultural background characteristics (e.g., Goldhammer et al., 2016). However, the findings are not consistent across studies. Especially the relation of test effort and academic ability levels needs to be discussed and investigated more as the results are mixed and of high practical importance (see e.g., Wise and DeMars, 2005; Wise and Kong, 2005; Wise et al., 2006b, 2009; Kong et al., 2007; Lee and Jia, 2014; Goldhammer et al., 2016; Wise and Gao, 2017). Overall, previous findings align with the DCM as they suggest that academic and motivational resources as well as sociocultural aspects play a role in test-takers' effort capacity, which is assumed to be responsible for their decisions to show solution behavior or to engage in RGB instead.

Again in line with DCM assumptions, there is evidence that item characteristics (i.e., item demands) influence students' tendency to engage in RGB. Especially surface characteristics, such as shorter texts and the presence of pictures have been shown to be related to lower RGB rates (Wise et al., 2009; Lindner

<sup>1</sup>Wise and Gao (2017) recently proposed a broader measure of test-taking effort, which they refer to as response behavior effort (RBE) and response behavior fidelity (RBF), which makes it possible to identify rapid omits and rapid perfunctory answers to constructed response items in addition to RGB.

et al., 2017a). However, deep item characteristics that are not easily traceable at first sight, such as item difficulty or the content area of an item did not have a significant impact on RGB rates, as shown by Wise et al. (2009). From a logical point of view, this is not surprising because the short time frame in which students look at an item before they engage in RGB is not long enough to analyze deeper item characteristics. Thus, the item appearance seems to be more important for the perception of item demands and the decision to engage in RGB or not.

Furthermore, the circumstances of the test situation have been connected to test-taking effort and RGB rates. For example, although different seasons or weekdays did not influence students' test-taking effort, a later testing time in a day (e.g., testing in the afternoon) was linked to lower RTE measures (i.e., more RGB; Wise et al., 2010). This suggests that physical and/or mental fatigue plays a role in reduced test-taking effort (Lindner et al., 2018), which may also explain why the most important predictor of RGB is the elapsed testing time (see e.g., Wise et al., 2009). There is compelling evidence across studies that items presented in later positions in a test are typically solved with lower accuracy (item position effect; e.g., List et al., 2017; Weirich et al., 2017; Nagy et al., 2018a), less motivational effort (e.g., Barry and Finney, 2016; Penk and Richter, 2017) and are substantially more prone to RGB (e.g., Wise et al., 2009; Setzer et al., 2013; Goldhammer et al., 2016).

Consequentially, because test-item demands change neither with day times nor with the test duration, the existing findings indicate that the reported increase of RGB over the course of testing time is mostly related to changes at the level of test takers' resources. Overall, there is reason to assume that both motivational and cognitive capacities become exhausted over the course of testing time due to the effort that has already been invested in solving previous items. Specifically, students need to build a new situational mental model for every single item and cognitively switch between tasks and solution strategies in a short time frame (Lindner et al., 2017a). Such operations are demanding and require working-memory capacity (i.e., executive attention; Engle, 2002) and self-control (Lindner et al., 2017a). Following Inzlicht et al. (2014), investing self-control to focus attention on cognitive tasks becomes more and more aversive over time, leading to a motivational disengagement from effortful tasks while attentional disruptions increase. This is also presumed to go along with a negative influence on students' affect over the course of a test session, which may cause a reduction in motivational effort (e.g., Ackerman and Kanfer, 2009; Ackerman et al., 2010; Inzlicht et al., 2014). As a consequence, individuals' performance typically decreases over the course of the test (e.g., Penk and Richter, 2017; Nagy et al., 2018b).

In this study, based on the DCM, we assumed that increasing exhaustion and negative emotions would be more pronounced for students who have lower cognitive capacities (i.e., academic abilities, general cognitive abilities, and workingmemory capacity) and lower motivational capacities (i.e., low task enjoyment). Thus, we expected that students with lower cognitive and motivational resources suffer from an earlier depletion of their effort capacity and, thus, start to engage in RGB at an earlier point in the testing time.

#### The Present Research

Although different studies have investigated the correlates of RGB, they mainly considered the frequencies or proportions of RGB (i.e., RTE or RTF; e.g., Wise and Kong, 2005; Wise, 2006) and, to the best of our knowledge, no studies have yet focused on correlates for the RGB onset in a test session. Furthermore, the question remains open of whether a lower level of test motivation at the beginning of the test and a (faster) loss of motivation over the course of the test session are associated with an earlier RGB onset. The present study aimed to answer these questions by investigating the measures of student characteristics (i.e., gender, school type, general cognitive abilities, and working-memory capacity) as well as data from a micro-longitudinal design with 36 repeated ratings of students' task enjoyment over the course of testing. Our main goal was to investigate the relations of these cognitive and motivational measures to students' individual risk of early RGB onset during a test, in order to enhance the theoretical understanding of the RGB phenomenon.

Parts of the underlying data set have been previously published with a much different focus on the effects of representational pictures in testing (see Lindner et al., 2017b). RGB was one of three dependent variables in the investigation of the effects of pictures as an item design characteristic. We do not report the respective findings in this study but, rather, directly build on the prior insights regarding students' RGB development across time, which we summarize here very briefly. In line with the literature (e.g., Goldhammer et al., 2016), the data showed a substantial RGB increase over the course of the test session, indicated by a significant main effect of item position (see Lindner et al., 2017b). However, this increase was substantially smaller in items that contained a representational picture (significant main effect picture). There was no significant interaction between the factors picture and item position. Pictures mainly induced a shift in RGB frequency. Both text-only and text-picture items were subject to an increase in RGB across time, but the probability of RGB was smaller throughout the test for items that contained a picture. In the current analyses, we took the systematic variation of picture presence as a control factor into account, but did not specifically investigate this characteristic.

In line with the literature, we assumed in the present research that RGB is a type of behavior that, similar to other phenomena in the testing context (e.g., item position effects, performance decline; e.g., Hartig and Buchholz, 2012; Debeer et al., 2014; Jin and Wang, 2014; List et al., 2017; Weirich et al., 2017; Wise and Gao, 2017; Nagy et al., 2018b), has a high probability of being maintained (at a student level) over the course of a test session, once it has begun. This means that once individuals engage in RGB, they have a high probability of showing this behavior in the subsequent items of the test. This assumption is also in line with insights from raw data of individuals' RGB development as well as with the DCM (Wise, 2017), according to which a depletion of students' effort capacity across time goes along with a higher probability of engaging in RGB. This hypothesis also formed the base of our attempt to model the data in a latent class approach to investigate the correlates of students' RGB onset, which will be explained in detail in the Methods section. Specifically, drawing on the empirical and theoretical background in the field as outlined above, we formulated the following hypotheses:

Hypothesis 1: We expected to find a higher probability of earlier RGB onset in (a) male students, (b) students from non-academic-track schools, (c) students with lower cognitive resources in terms of general cognitive abilities, and (d) students with lower cognitive resources in terms of workingmemory capacity.

Hypothesis 2: We expected that both the initial level of students' task enjoyment and its (negative) development over the course of testing would be predictive of RGB. Specifically, we expected that both (a) lower initial enjoyment ratings (intercept) and (b) a stronger decrease (slope) would be associated with the RGB variable and predict earlier RGB onset.

#### METHODS

As mentioned above, the current data set has been subject to investigations before. To avoid unnecessary repetition, we only report the measures that are relevant for the present analyses. Please consult the report by Lindner et al. (2017b) for further details.

#### Sample, Material, and Study Design

The analyzed sample comprised N = 401 students in the fifth and sixth grades in northern Germany (53.4% female, 51.4% fifth grade, Mage = 10.74, SDage = 0.76; n = 247 academic track [i.e., Gymnasium]; n = 154 non-academic track [i.e., regional school]) who took a computerized science test in an experimental classroom setting. Students were informed that their individual participation was completely voluntary and that they would not face any negative consequences if they did not participate or if they canceled their participation. Thus, all students were fully aware of the low-stakes testing environment, but they were also informed about the relevance of investing good effort to ensure reliable research results.

The scientific literacy test was constructed based on the science framework and items of the Trends in International Mathematics and Science Study (TIMSS; see e.g., Mullis et al., 2009; International Association for the Evaluation of Educational Achievement [IEA], 2013), which assess students' basic science achievement. The 36 items confronted students with realistic situations, forcing them to apply their declarative science knowledge from biology, physics, and chemistry to everyday phenomena and problems. It was essential that the students correctly understood the situation in the item stem for them to be able to solve the problem correctly. The items had a mean word count of M = 74.9 words (SDwords = 24.2). All items were presented in a multiple-choice format with a short item stem, a separate one-sentence question, and four answer options (one correct option). The items were randomly assigned to one of three test blocks (12 items per block), which were presented either with or without representational pictures (i.e., experimental manipulation of test items), resulting in six booklet constellations. A randomization check confirmed that the item difficulty did not differ between the blocks, F(2,33) = 0.05; p = 0.95; η² = 0.003. The systematic variation of presenting a representational picture (or not) in the items was balanced across booklets and realized in a within-subject multi-matrix design. To investigate RGB over the course of the test (i.e., in different item positions), items were presented in a random order within test blocks to avoid presenting certain items in certain positions. The six booklets were randomly assigned to the students and equally distributed in the sample (including school types). The marginal EAP/PV reliability of the science test was estimated as Rel. = 0.83.

#### Measures

#### Background Variables

We used a short questionnaire to assess background information, such as students' age, gender, grade level (fifth vs. sixth grade) and the attended school type (academic and non-academic track).

#### General Cognitive Abilities

The subtest N2 (Figural Analogies; adjusted according to students' grade level; α = 0.93/0.89) of the Kognitiver Fähigkeitstest (KFT) 4 – 12 + R (Heller and Perleth, 2000) was applied to measure spatial reasoning skills as an indicator of students' general cognitive abilities and resources. The subscale consists of 25 items, each of which presents students with one pair of meaningfully related figures and another single figure, for which the appropriate counterpart has to be selected from five answer options in order to create a similar pair of related figures.

#### Working-Memory Capacity

A self-programmed, computerized version of a reversed digit span test (see e.g., HAWIK-IV; Petermann and Petermann, 2010) served as an indicator of students' working-memory capacity. Students listened through headphones to an increasing number of digits (i.e., two up to eight) that were read out at a slow pace by a male voice. During the digit presentation, the keyboard was locked. After hearing each row (e.g., 3–5–8–7), students were asked to type the digit row in reverse order (e.g., 7,853) into a box that appeared on the screen. After logging in the response, the screen went white and the next digit row followed. The test contained 14 trials. The sum of correct answers determined the test score. Reliability was just sufficient (α = 0.64).

#### Task Enjoyment Ratings

As an indicator of students' current motivational level, we repeatedly measured students' task enjoyment while working on the items. We did so with a one-item measure (see Lindner et al., 2016), asking students how much fun they had solving the current item (i.e., "Working on this item was fun for me"). We assumed that lower enjoyment ratings would indicate lower motivational resources.

#### Rapid-Guessing Behavior

Students' response time was measured per item (in seconds), which served as the base for classifying RGB trials. Extreme response times two standard deviations (SD) above the item mean (0.3% of the data) were trimmed by replacing them with the value of two SD above the item mean (e.g., Goldhammer et al., 2014) to prevent bias in the means. Afterwards, the mean time on task for each item served as a base for setting RGB thresholds, following the normative threshold (NT) method proposed by Wise and Ma (2012). Using this method, itemspecific threshold percentages can be defined, which means that response times shorter than, for example, 10%, 15%, or 20% of the average solution time of an item are classified as rapid guesses. To achieve a balance between identifying as many non-effortful responses as possible and avoiding the classification of effortful responses as RGB (e.g., Wise and Kong, 2005; Lee and Jia, 2014), we used a mixed approach to evaluate potential thresholds by different validation methods (i.e., absolute thresholds, visual inspection and guessing probability in RGB trials; e.g., Goldhammer et al., 2016; see also section Motivation and Test-Taking Behavior). Taking all validation criteria into account (for a detailed evaluation, see Lindner et al., 2017b), the NT15 criterion turned out to deliver the best fit and was thus used for the RGB definition. This resulted in an average item-specific threshold of M = 5.6 s (SD = 1.4).

#### Apparatus and Procedure

Experienced test administrators conducted the study at schools during lesson time. All sessions were attended by a teacher and lasted up to 90 minutes. A laptop, headphones, and a mouse were prepared for each student. The science items were presented on 28 identical Lenovo <sup>R</sup> laptops, using the software flexSURVEY 2.0 (Hartenstein, 2012). Students answered a short background questionnaire, worked on the KFT, and took the workingmemory test. Afterwards, they worked on the science test. It was ensured that students knew that they would not be able to return to an earlier question after choosing an answer and that they always needed to provide a response in order to progress to the next item. Following each item, students rated their itemsolving valence. Providing an answer automatically forwarded the student to the next task. Students were repeatedly encouraged to take all the time they needed to solve each item but to work in a focused way through the test. This was done to ensure that the science test was worked on as a power test. There was no time limit for completing the test. Responses, response times (per item), and the item presentation sequence (i.e., item positions) were recorded in a log file for each student.

#### Data Analyses

RGB is a low-frequency behavior that is not exhibited by each student. As such, statistical modeling approaches for RGB should divide the total sample into at least two groups (or latent classes): One class that does not show RGB at all, and a second class of respondents who show at least some RGB responses. Within the class of individuals showing some RGB, the representation of the distribution of RGB can be challenging, especially in samples of modest size.

As a solution to this problem, we modeled RGB by means of a categorical latent variable (i.e., a latent class analysis; LCA). Our LCA model distinguished between latent classes that showed no RGB at all (i.e., no-RGB class), and three other classes that differed in the onset points of RGB (i.e., early, intermediate, and late onset points). In addition, we assumed the existence of a latent class consisting of students who had a rather low but constant probability of RGB at any point in the test (i.e., constantly low RGB class). To achieve this goal, we modeled the logits of the probability of RGB indicators yip [yip = 1 if individual i (i = 1, 2,..., N) showed RGB in position p (p = 1, 2, . . . , 36), and yip = 0 otherwise] conditional on class membership C<sup>i</sup> = k (k = 1, 2, 3, 4, 5):

$$\begin{aligned} \text{logit}\left[P\left(\mathbf{y}\_{ip} = 1 | \mathbf{C}\_i = k\right)\right] \\ \mathbf{0} = \boldsymbol{\eta}\_k \boldsymbol{\omega}\_{ip} + \boldsymbol{\tau}\_{0k} + \frac{\boldsymbol{\theta}\_k}{1 + \exp\left[-\boldsymbol{\alpha}\_k \left(\boldsymbol{\beta}\_k - p\right)\right]} \end{aligned} \tag{1}$$

In Equation (1), wip is a variable indicating whether the item presented in position p to individual i is a text-only (wip = 0) or a text-picture item (wip = 1), and γ<sup>k</sup> is a logistic regression weight accounting for the fact that text-picture items are less likely to be associated with RGB (see Wise et al., 2009; Lindner et al., 2017b). The parameter γ<sup>k</sup> was specified to be invariant across classes reflecting RGB (i.e., C = 1–4), but was constrained to zero in the no-RGB class (C = 5). The last two terms on the right-hand side of Equation (1) capture the development of RGB across item positions. τ0<sup>k</sup> is a lower asymptote parameter, and θ<sup>k</sup> describes the upper asymptote of the probability of RGB in class C = k. The parameter α<sup>k</sup> (α<sup>k</sup> ≥ 0) reflects the rate of change in RGB probabilities, whereas β<sup>k</sup> stands for the position in which the inflection point of the logistic function occurs in class C = k.

In order to provide an interpretable solution, the LCA parameters of Equation (1) were subjected to further constraints. The first three classes (C ≤ 3) were specified to reflect students with different onset points of RGB (parameters β<sup>k</sup> ). Here, we specified the β<sup>k</sup> parameters to be ordered (i.e., β<sup>1</sup> < β<sup>2</sup> < β3) and equally spaced, and the lower and upper asymptotes, τ0<sup>k</sup> and θk , to be equal across these three classes. In order to provide an interpretable asymptote parameter, we constrained the rateof-change parameter α<sup>k</sup> in such a way that the RGB probability in p = 1 (i.e., first item position) in the late-RGB-onset class (C = 3) solely reflected the lower asymptote τ0<sup>k</sup> . To this end, we constrained the last term of Equation (1) to be very close to zero in p = 1 by imposing the constraint α<sup>k</sup> = logit(0.001) (β3−1) . The constantly low RGB class (C = 4) was assumed to have the same τ0<sup>k</sup> and α<sup>k</sup> parameters as the classes C = 1 to 3, but θ<sup>4</sup> was allowed to take a different value. In this class, β<sup>4</sup> was set to be equal to the inflection point of the early-RGB-onset class (C = 1), β1. Finally, in the no-RGB class (C = 5), the parameters γ5, θ5, α5, and β<sup>5</sup> were fixed to zero, and τ<sup>05</sup> was fixed to −15. Taken together, our basic LCA model estimated only six measurement parameters (Equation 1), and four latent class proportions π<sup>1</sup> to π<sup>4</sup> (π<sup>5</sup> = 1 − PK<sup>−</sup> <sup>1</sup> k=1 πk ).

The LCA model was extended by the inclusion of covariates predicting class membership. This was accomplished by means of a multinomial logit model so that:

$$P\left(\mathbf{C}\_{i} = k | \mathbf{x}\_{i}\right) = \frac{\exp\left(\boldsymbol{\alpha}\_{0k} + \sum\_{j=1}^{J} \boldsymbol{\alpha}\_{1k\bar{j}} \boldsymbol{\chi}\_{\bar{i}\bar{j}}\right)}{\sum\_{l=1}^{l=5} \exp\left(\boldsymbol{\alpha}\_{0l} + \sum\_{j=1}^{J} \boldsymbol{\alpha}\_{1\bar{j}} \boldsymbol{\chi}\_{\bar{i}\bar{j}}\right)},\tag{2}$$

with x<sup>i</sup> being the individual i's J × 1 vector of covariate values with entries xij for covariates j = 1, 2, . . . , J, and ω parameters standing for multinomial intercepts and weights that were fixed to zero for the no-RGB class C = 5. Based on the estimates of the ω-parmameters, RGB probability curves, expected at specific values of the covariate x<sup>i</sup> , were derived by combining Equations (1, 2) to:

$$P\left(\boldsymbol{\jmath}\_{i\mid p} = 1 | \boldsymbol{\varkappa}\_i\right) = \sum\_{k=1}^{K} P\left(\mathbf{C}\_i = k | \boldsymbol{\varkappa}\_i\right) P\left(\boldsymbol{\jmath}\_{i\mid p} = 1 | \mathbf{C}\_i = k\right). \tag{3}$$

Most covariates were observed but, in the case of task enjoyment, we used latent variables that were derived from a linear growth model specified as:

$$z\_{i\circ} = \delta w\_{i\circ} + \eta\_{0i} + \frac{p-1}{36-1}\eta\_{1i} + \varepsilon\_{i\circ},\tag{4}$$

where zip is the individual i's enjoyment score in position p, wip stands for the values of the item-level covariate as defined before, and δ is a corresponding regression weight. The latent variables η0<sup>i</sup> and η1<sup>i</sup> represent the individual's initial enjoyment value and the rate of change, while εip is a random disturbance. The η-variables were assumed to follow a bivariate normal distribution. Disturbances were assumed to have zero means, to be normally distributed, and to be uncorrelated with each other as well as with any other variable in the system. The variances of disturbances were set to be equal across positions, but were allowed to be different for text-only and text-picture items. The η-variables were entered into the LCA models similar to xvariables (Equation 2), where all growth and LCA parameters were jointly estimated.

All estimations were carried out with the Mplus 8.0 program (Muthén and Muthén, 2017) using marginal maximum likelihood estimation. Parameter estimates were accompanied by robust standard errors adjusted for non-normality. As LCA models are known to be prone to local minima, we used multiple random start values to check whether the best log-likelihood could be replicated. Model-data fit was evaluated by information theoretic indices including the Akaike information criterion (AIC), Bayesian information criterion (BIC), and the sample sizeadjusted BIC (sBIC). These indices take model complexity (i.e., the number of parameters) into account and penalize highly parametrized models.

In order to test whether variables were associated with RGB, we performed multivariate Wald tests of multinomial logit regression weights (Equation 2). The first test served as a test of no association (NA), in which we simultaneously tested all weights attached to a covariate x<sup>j</sup> against zero (i.e., ω11<sup>j</sup> = ω12<sup>j</sup> = ω13<sup>j</sup> = ω41<sup>j</sup> = 0). The second test was a test of constant associations (CA) and examined the equality of logistic regression weights (i.e., ω11<sup>j</sup> = ω12<sup>j</sup> = ω13<sup>j</sup> = ω41j). The CA test is interesting because it indicates whether the effects of covariates on RGB differ between regions (i.e., item positions) in the test. For example, if a covariate is significantly related to RGB (i.e., significant NA test), but the covariate's effects do not differ from each other (i.e., non-significant CA test), it implies that the covariate's effects on RGB constantly increase across item positions (i.e., the curves expected for two values of the covariates have similar shapes but different gradients). In contrast, a significant CA test indicates that the effects of a covariate do not constantly increase across positions, which means that the probability curves predicted at different values of the covariate differ in their shapes. For example, it might turn out that the effect of a covariate is limited to the first latent class (C = 1), whereas its effects on classes C = 2 and C = 3 are near to zero. Imagining this case, differences in RGB probabilities at different levels of the covariate would already arise early in the test session and would then remain constant across subsequent item positions. Alternatively, if the covariate's effects turn out to be stronger on class C = 3 and close to zero on classes C = 2 and C = 1, it means that the covariate's effects emerge only in the last section of the test. Hence, the CA test does not indicate a certain type of relationship. Instead, it indicates a non-constant pattern of relationships.

### RESULTS

### Unconditional LCA Models

In a first step we employed LCA models that did not include any covariates. The analyses served mainly descriptive purposes and were further used to evaluate the model's ability to depict the marginal RGB probabilities. Our proposed LCA model fitted the data better than a comparison model that assumed two classes (students with no or some RGB) in which the thresholds of all RGB indicators were unconstrained in the RGB class and estimated differently for text-only and textpicture items (unconstrained two-class model: #Parameters = 71, Log Likelihood = −2,313.5, AIC = 4,769.1, BIC = 5,052.3, sBIC = 4827.0; present model: #Parameters = 10, Log Likelihood = −1,963.1, AIC = 3,946.2, BIC = 3,986.0, sBIC = 3,954.3). This result indicates that our LCA model provided a good description of RGB. **Figure 1** presents the class-specific RGB probabilities by item position, uncovered by our LCA model, whereas the model fitted and observed RGB proportions are presented in the first panel of **Figure 2**. In line with previous results, the LCA model indicated that text-only items were more strongly affected by RGB (γˆ = −1.05, SE = 0.17, p < 0.001). Furthermore, the LCA model categorized 56.6% of respondents as not engaging in RGB (observed data: 63.9%).

With respect to the onset of RGB, the LCA indicated that most students started to switch to this behavior in the later part of the test (23.8% in Class 3). The remaining classes had quite similar proportions, ranging between 5.0 and 8.0% (**Figure 1**). As can be seen in **Figure 2**, the five classes were sufficient for describing the marginal distribution of RGB for both text-only and text-picture items. Hence, the model appeared to be a solid starting point for assessing the predictors of RGB.

Next, we investigated changes in students' enjoyment ratings over the course of the test. We started with a linear growth curve model that was fitted to the data without considering the remaining variables. The model indicated that text-picture items were associated with higher enjoyment ratings throughout

FIGURE 1 | RGB probability for latent Class 1 (early onset point), Class 2 (intermediate onset point), Class 3 (late onset point), and Class 4 (constantly low RGB) with results for text-only (left) and text-picture items (right).

the test-taking session (δ<sup>ˆ</sup> <sup>=</sup> 0.15, SE <sup>=</sup> 0.02, <sup>p</sup> <sup>&</sup>lt; 0.001), and that enjoyment ratings were, on average, high at the beginning of the test (µˆ <sup>η</sup><sup>0</sup> = 2.94, SE = 0.04, p < 0.001) but decreased on average across positions (µˆ <sup>η</sup><sup>1</sup> = −0.29, SE = 0.04, p < 0.001). The results provide evidence for the existence of individual differences in initial enjoyment levels (σˆ 2 <sup>η</sup><sup>0</sup> <sup>=</sup> 0.39, SE <sup>=</sup> 0.03, <sup>p</sup> <sup>&</sup>lt; 0.001) and changes in enjoyment (σˆ 2 <sup>η</sup><sup>1</sup> <sup>=</sup> 0.52, SE <sup>=</sup> 0.05, <sup>p</sup> <sup>&</sup>lt; 0.001), with the two components being only weakly related (ρˆη0,η<sup>1</sup> = −0.12, SE = 0.06, p = 0.049). Hence, the growth curve model indicated that, regardless of their initial enjoyment level, students exhibited relatively large individual differences in enjoyment declines. This aspect is visualized in **Figure 2B**, where the

model-predicted average declines are depicted together with the observed means and the distribution of model-predicted scores (10th−90th percentiles of the distribution) that document increasing individual differences in enjoyment due to individual differences in the trajectories.

#### Conditional LCA Models

To study the correlates of RGB, we started by employing conditional LCA models in which we used each predictor in isolation without considering the remaining covariates. The exceptions were the two latent variables of the growth curve model applied to the enjoyment variables that were investigated simultaneously. **Table 1** presents multinomial regression weights


TABLE 1 | Multinomial logistic regression weights determined separately for each covariate, and corresponding Wald-χ 2 tests of no association (NA) and of constant associations (CA).

Gender: 0 = male, 1 = female; school type: 0 = academic track (Gymnasium), 1 = non-academic track (i.e., regional school); Measures of general cognitive abilities (KFT) and working-memory were standardized prior to the analysis. \*p ≤ 0.05; \*\*p ≤ 0.01.

determined for each variable and the corresponding tests for no association (row NA) and constant associations (row CA) with RGB.

As can be seen in **Table 1**, almost all variables were significantly related to RGB. The exception was gender. The pattern of gender differences was in line with previous results but did not reach the significance threshold (p = 0.113). Judged on the value of the Wald-χ 2 statistic, school type was most strongly related to RGB, whereas the initial level of and change in enjoyment had the weakest relationships to RGB. Furthermore, the four multinomial logistic regression weights belonging to each variable appeared to differ from each other. For example, the regression weights associated with school type indicated that the chances of academic-track students belonging to classes C = 1, 2, or 4 vs. class C = 5 were much smaller than the corresponding chances of non-academic-track students. In contrast, school-type differences in the relative chance of belonging to class C = 3 (i.e., the late RGB onset class) were less pronounced (i.e., the regression weight was closer to zero).

As can be seen in the CA row in **Table 1**, school type was differentially related to the onset point of RGB, whereas gender and initial enjoyment were not. Academic-track students were least likely to have an early RGB onset (i.e., membership in classes C = 1, 2, or 4). Similar relationships were found with the continuous covariates, general cognitive abilities, workingmemory capacity, and change in enjoyment, so that students with higher scores on these variables were least likely to have an early RGB onset.

In order to get an impression of the pattern of relationships, the model-predicted probabilities of RGB at selected values of the covariates are plotted in **Figure 3**. As suggested by the non-significant overall effect (NA test, **Table 1**) and the nonsignificant CA test, gender differences were rather small, but showed a relatively constant (albeit non-significant, p = 0.113) increase across positions. In contrast, differences between school types were clearly larger and showed a strong increase across item positions, whereby the increase was largest in the first two thirds of the test. A similar picture was revealed for the continuous measures of general cognitive abilities and workingmemory capacity. In the case of these variables, it appeared that above average scores did not have a meaningful effect on RGB. Rather, students who scored well below average on these tests had a higher probability of engaging in RGB.

The relationship of RGB with the repeatedly measured enjoyment variable is shown in **Figure 4**. In order to account for the initial level and the change component in the enjoyment ratings, the figure contains three line plots for low (10th percentile), average, and high levels (90th percentile) of initial enjoyment, which each contain RGB probability curves for low (10th percentile), average, and high levels (90th percentile) of change in enjoyment. As shown in **Figure 4**, lower initial levels of enjoyment were associated with constantly increasing levels of RGB across positions (non-significant CA test). As further shown in **Figure 4**, the RGB probability curves differed at each level of initial enjoyment, depending on the change in enjoyment, so that steeper decreases in enjoyment were associated with steeper increases in RGB (see also NA row in **Table 1**).

All results presented up to this point pertain to the models in which each covariate was investigated in isolation. However, the majority of student characteristics employed were correlated among each other, as can be taken from **Table 2**. Even though the correlations were not so high that they could cause collinearity problems, the question about each variable's unique contribution to the prediction of RGB emerged. We approached this question by using all covariates simultaneously as predictors of latent class membership. The results are presented in **Table 3**.

The (non-significant) relationship of gender with RGB was not affected by the inclusion of the other covariates (see **Table 1**). A similar result was found for school type; RGB was still significantly related to this variable and also strongly related to an early RGB onset. The relationship of general cognitive abilities with RGB was clearly reduced after all covariates were included in the model, although the relationship with RGB and RGB onset remained significant. In contrast, the relationship of working memory with RGB was similar to that of the previous model (see **Table 1**), which means that it continued to be significantly related to RGB and its onset. Initial enjoyment also remained significantly related to RGB, but the regression weights for the different latent classes did not differ significantly (CA test;

see **Table 3**). Finally, changes in enjoyment continued to be significantly related to RGB, but the CA test was no longer significant on the p < 0.05 level (p = 0.054). This weakens the evidence of a strong relation between students' enjoyment decline and early RGB onset.

### DISCUSSION

The present study examined the correlates of RGB onset and its temporal dynamic over the course of testing as a betweenstudent factor with regard to motivational and cognitive student characteristics, using a latent class approach as a base for our analyses. Specifically, we investigated the extent to which different patterns of (early) RGB onset were related to cognitive and motivational covariates in order to gain deeper insights into the processes that may underlie disengaged test-taking behavior in low-stakes assessment. In the following sections, we discuss the key results of the study with regard to our hypotheses, the theoretical assumptions, and earlier research. Finally, we reflect on the study's limitations, consider future research suggestions, and close the article with an overall conclusion and a consideration of the practical significance of our findings.

#### Student Characteristics

Testing our hypothesis regarding the relation of RGB or RGB onset and students' gender (H1a), we did not find a significant relation, contrary to our expectation. However, this is not entirely surprising, as the findings in the literature are also inconsistent. Several studies indicate that male students have lower levels of test-taking motivation and also tend to show disengaged

FIGURE 4 | RGB probabilities for text-only items by item position, expected for different combinations of initial enjoyment and change in enjoyment over the course of the test (i.e., item positions).

TABLE 2 | Predictor correlations.


\*p ≤ 0.05; \*\*p ≤ 0.01.

behavior, such as RGB, more often (for a review see e.g., DeMars et al., 2013). Still, not all studies find a significant relation between gender and RGB (e.g., Wise et al., 2009). In the present study, as can be seen in **Figure 3A**, the descriptive pattern was in line with the expectation that male students would engage in RGB earlier than female students, but the coefficient did not reach significance. This result seems to be primarily related to a power issue, as the present sample may not have been large enough to significantly show the effect. Generally, the relationship between RGB and gender appeared to be of lower practical importance considering the marginal effect sizes in vast representative samples, such as in the study by Goldhammer et al. (2016). However, gender differences in RGB may be more pronounced in younger students, which seemed to be reflected at a descriptive level in our data. The moderating role of students' age would, thus, be an interesting factor for future research.

Confirming our hypothesis regarding students' school-type attendance (H1b), we found a remarkably higher risk of an earlier RGB onset and a stronger increase of RGB probabilities in students from non-academic-track schools (see **Figure 3B**). This effect remained significant when all predictors were included in one model; moreover, school type was the strongest predictor of early RGB onset. In the German school system, which assigns students to different secondary school tracks based on their performance in elementary school, school type is strongly related to students' academic abilities (e.g., Prenzel et al., 2013). In addition, school type has been shown to be connected to differences in students' motivation to work in an effortful way in low-stakes assessments (e.g., Baumert and Demmrich, 2001; Nagy et al., 2018b). Thus, both factors, academic ability and motivation, are probably reflected in the substantial RGB differences between school tracks. Earlier studies have shown similar relations of RGB (e.g., Lee and Jia, 2014; Goldhammer et al., 2016; Wise and Gao, 2017) or item position effects (e.g., Nagy et al., 2016, 2018a) with students' academic ability level (e.g., SAT scores; Wise et al., 2009) or school-type attendance (Nagy et al., 2016). Nevertheless, some studies did not find ability-related differences in students' response effort (e.g., Wise and DeMars, 2005; Wise and Kong, 2005; Wise et al., 2006a). These mixed results might be attributed to the different sample characteristics, test situations, and criteria used to judge students' academic ability (e.g., scores from the investigated test vs. external criteria, such as SAT scores). In this study, we used a criterion that is independent of students' test achievement and known to be a solid indicator of academic abilities. However, while the investigated data set included students from academicand non-academic-track schools, it did not reflect the full width of German non-academic-track schools (i.e., no lower secondary schools). Our findings might therefore not fully represent schooltype differences, as students from lower non-academic schools might further contribute to the unfavorable picture of schooltype differences in RGB.

In line with our hypotheses regarding students' general cognitive abilities (H1c) and working-memory capacity (H1d), we found substantial evidence that both factors are significantly related to RGB and predict an earlier RGB onset and a stronger increase in RGB. However, this only applied to students with relatively low cognitive capacities (see **Figures 3C,D**). This indicates that a lack of cognitive resources raises students' risk of engaging in RGB early on and of showing a stronger RGB increase. Building on expectancy-value models (e.g., Eccles and Wigfield, 2002) and the DCM assumptions (Wise, 2017), this


TABLE 3 | Multinomial logistic regression weights determined jointly for all covariates, and corresponding Wald-χ 2 tests of no association (NA) and of constant associations (CA).

Gender: 0 = male, 1 = female; school type: 0 = academic track (Gymnasium), 1 = non-academic track (i.e., regional school); Measures of general cognitive abilities (KFT) and working-memory were standardized prior to the analysis. \*p ≤ 0.05; \*\*p ≤ 0.01.

is not really surprising. However, so far, we are not aware of any empirical studies that have investigated standardized cognitive ability tests as predictors of RGB development so far. While both of our measures were clearly related to RGB as isolated predictors, it is especially interesting that workingmemory capacity seemed to be more predictive of both RGB and RGB onset than general cognitive abilities. This became evident when we integrated all indicators into one full competitive model, where the general cognitive ability covariate lost a substantial part of its explanatory power but the workingmemory factor remained basically unaffected. This might be explained as follows: Whereas general cognitive abilities are assumed to be more or less stable across situations and time (i.e., fluid intelligence as a trait), working-memory capacity is known to be subject to stronger situational fluctuations (see e.g., Hofmann et al., 2011) and can also be subject to mental fatigue effects that undermine attentional control (Schmeichel, 2007). However, executive attention is a key factor in self-controlled behavior, which is also needed in any test situation in order for students to focus on the posed problems and to solve them with effort. This demand tends to become aversive over the course of testing time (Inzlicht et al., 2014). This relation could help to explain why working-memory capacity seems to be the more important cognitive resource required for engaged test-taking behavior over the course of a test session.

#### Task Enjoyment Over the Course of Testing

RGB is typically interpreted as an indicator of student motivation. In our study, we examined the extent to which RGB was related to students' perceived motivation level as an open question. By modeling the intercept of students' multiple enjoyment ratings across the test session as a latent covariate in the LCA, we tested Hypothesis H2a. Although there was evidence for a relation between students' initial enjoyment (i.e., rating of the first item) and RGB, we did not find a significant relation to RGB onset (**Figure 4**). This was true for both the isolated analysis of initial enjoyment as a single predictor and the full model with all predictors. The observed and model-fitted data of students' enjoyment ratings (**Figure 2B**) showed a decrease over the course of the test session, as expected, though the mean level of students' enjoyment remained relatively high. The figure also shows that there was a lot of inter-individual variance; we investigated this variance by integrating students' estimated slopes as a latent covariate into our LCA to test Hypothesis H2b. This provided tentative evidence that a negative enjoyment trajectory over time predicted both RGB and RGB onset in the isolated model. However, the relation with RGB onset did not remain significant when competitive covariates were added to the model, which weakens the evidence for Hypothesis H2b to some extent.

Overall, students' enjoyment ratings were not strongly related to their RGB tendency when compared to the cognitive covariates. This relatively weak relation could be due to the young age of the students in the current sample, who might not yet be able to correctly reflect on their current enjoyment; but, it could also indicate that test-takers simply have problems with an accurate evaluation of their motivational state. However, this question cannot be answered based on the present findings. Penk and Richter (2017) recently applied a comparable approach of modeling ninth-graders' test-taking motivation across a test session to investigate item position effects. They found that initial test-taking motivation was a better predictor of the item position effect than changes in motivation. This pattern is the opposite of our results and is somewhat surprising; it indicates that there are interesting questions to be answered in future research on test-taking motivation.

#### Limitations and Future Directions

Some limitations need to be taken into account when interpreting the present findings. First, the current sample cannot be considered representative, which constrains the generalizability. The effects of school type might be biased because we did not include all German school tracks and we tested only students in the fifth and sixth grades. Compared to typical large-scale assessments, the current sample was rather small but seemed to be sufficient, except for determining the relation between RGB and gender, which may have been underpowered. As an unusual advantage, however, the data included important measures, such as the repeated enjoyment rating and the indicators of students' general cognitive abilities and working-memory capacity, which were at the core of the present analyses. The test circumstances were highly comparable to typical computer-based low-stakes testing programs. Nevertheless, future studies should challenge our research and try to replicate the current findings in larger data sets. Especially a transfer of our latent class approach to other samples would be desirable to evaluate the extent to which the presumptions and findings of our study (e.g., the proportion of student assignments to the five individual LCA classes) are robust. As such, the proposed analysis could be a fruitful base for future research on the determinants of RGB onset and its dynamics across testing time.

Second, the reliability of our working-memory test (i.e., reversed digit span) was, unfortunately, not very high (α = 0.65). However, a tradeoff has to be made with view to the challenge of measuring working-memory indicators in group sessions, as individual test sessions can better ensure that the test is administered in the best way possible. It would therefore be advantageous to reexamine the current issue by assessing other or additional working-memory capacity indicators that have a higher test reliability.

A third potential limitation pertains to the fact that both the science test and the cognitive tests (KFT N2 and reverse digit span) were administered in the same test session. The results might therefore share common variance due to a general tendency of students to work seriously on test items in a lowstakes situation (i.e., in terms of a latent trait) and also due to their current overall compliance with the test-taking situation (i.e., in terms of a current state during the specific test administration). However, the cognitive tests were presented before the science test. The risk that students' behavior was already effortless at the beginning of the test session is rather low. This assumption is supported by the observation that only a small number of RGB trials occurred in the first items of the science test, indicating that most students were still prepared to make an effort to work on the test items at the beginning of the test. Nevertheless, test scores from standardized cognitive tests that were assessed in different sessions from another day would have been preferable.

#### Conclusion and Implications for Educational Practice

Drawing on a theory-driven latent class model, standardized measures of students' cognitive abilities, and repeated ratings of their current item-solving enjoyment, this study was able to extend previous work and widen the understanding of RGB. The main strength of our investigation is that our LCA approach made it possible to study the dynamics of RGB in connection with several indicators of cognitive and motivational resources at a student level. In brief, we found evidence that students' itemsolving enjoyment, academic ability, and cognitive capacities are (closely) related to the RGB onset point and the dynamics of RGB across a low-stakes test session. Students from non-academictrack schools, students with low general cognitive abilities and low working-memory capacity, as well as students with a stronger decline in their task enjoyment over the course of the test were substantially more likely to engage in RGB earlier in the test and to progress with that behavior. All of these findings are in line with the theoretical assumptions from expectancy-value models (e.g., Eccles and Wigfield, 2002) as well as those of the DCM by Wise and Smith (2011). However, future research should also focus on non-cognitive factors, such as coping strategies, text anxiety or the well-being of students and on the relations of these factors to test-taking behavior. In addition, characteristics of students' home environment, such as the socio-economic status of their parents and school culture, including the school climate, the ethnic composition and the value teachers, parents and peers attribute to learning and testing efforts, should be taken into account in order to better understand RGB from a broader perspective.

Alongside the new support they provide for the theoretical models concerning the psychological determinants of RGB, our results also have practical implications. The substantial relation of RGB to students' academic and cognitive abilities suggests that students' test engagement seems to be a seriously, confounding factor (in terms of true competences) for a valid interpretation of school-type comparisons of low-stakes test performances (see also Wise et al., 2009; Nagy et al., 2018b). This is a problem because such comparisons are often an important goal of largescale testing programs. Furthermore, all motivation-filtering procedures rely on the theoretical assumption that student motivation is unrelated to true proficiency. However, if this criterion is not fulfilled, the filtering procedure induces bias. In particular, filtering students with low proficiency out of the data would provide an overly positive picture of the performance in the investigated sample, leading to an overestimation of true proficiency. In addition to the attempt of using statistical correction procedures, this problem should also be discussed at the level of test characteristics. For example, applying shorter tests, using items with a more appealing design (see e.g., Lindner et al., 2017b, Wise et al., 2009), and possibly having longer breaks between different test blocks may foster students' test-taking motivation and could allow them to refresh exhausted cognitive resources before continuing to focus their attention on further tasks (see also Lindner et al., 2018). In the light of the current results, such considerations seem to be particularly relevant for students from non-academic-track schools and for students with low working-memory capacity. However, the extent to which an improvement in assessment conditions would actually contribute to solving the problems that are connected to low test effort is a question for future research.

### ETHICS STATEMENT

This study was carried out in accordance with the Declaration of Helsinki and the ethical guidelines for experimental research with human participants as proposed by the German Psychological Society (DGPs). Prior to the test session, we obtained written informed consent from all students and their legal guardians.

### AUTHOR CONTRIBUTIONS

All authors have made a substantial intellectual contribution to the article and approved it for publication. ML developed the study design, performed the data collection, prepared the data for analyses and wrote the manuscript. GN developed the conception of the analyses, performed the analyses and co-wrote the statistical analyses paragraph. GN and OL reviewed and edited the article. All authors contributed to the theoretical conception, the interpretation of the results and manuscript revisions.

### REFERENCES


### ACKNOWLEDGMENTS

This work was funded by the Leibniz Institute for Science and Mathematics Education (IPN) in Kiel.


**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2019 Lindner, Lüdtke and Nagy. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

# The u-can-act Platform: A Tool to Study Intra-individual Processes of Early School Leaving and Its Prevention Using Multiple Informants

Frank J. Blaauw1,2 \* † , Mandy A. E. van der Gaag1†, Nick R. Snell <sup>1</sup> , Ando C. Emerencia<sup>1</sup> , E. Saskia Kunnen<sup>1</sup> and Peter de Jonge<sup>1</sup>

<sup>1</sup> Department of Developmental Psychology, Faculty Behavioral and Social Sciences, University of Groningen, Groningen, Netherlands, <sup>2</sup> Distributed Systems Group, Bernoulli Institute for Mathematics, Computer Science, and Artificial Intelligence, University of Groningen, Groningen, Netherlands

#### Edited by:

Frank Goldhammer, German Institute for International Educational Research (LG), Germany

#### Reviewed by:

Kathleen Scalise, University of Oregon, United States Olga Kunina-Habenicht, Pädagogische Hochschule Karlsruhe, Germany

> \*Correspondence: Frank J. Blaauw f.j.blaauw@rug.nl

†These authors have contributed equally to this work

#### Specialty section:

This article was submitted to Educational Psychology, a section of the journal Frontiers in Psychology

Received: 22 November 2018 Accepted: 22 July 2019 Published: 20 September 2019

#### Citation:

Blaauw FJ, van der Gaag MAE, Snell NR, Emerencia AC, Kunnen ES and de Jonge P (2019) The u-can-act Platform: A Tool to Study Intra-individual Processes of Early School Leaving and Its Prevention Using Multiple Informants. Front. Psychol. 10:1808. doi: 10.3389/fpsyg.2019.01808 We present the u-can-act platform, a tool that we developed to study the individual processes of early school leaving and the preventative actions that mentors take to steer these processes in the right direction. Early school leaving is a significant problem, particularly in vocational education, and can have severe consequences for both the individual and society. However, the prevention of early school leaving is hampered by a mismatch between research and practice: research tends to focus on identifying risk factors using group averages and cross-sectional studies, while practitioners focus on intervening in individual processes. We aim to help solve this mismatch with our project u-can-act. In this project we have developed a platform that helps to gain insight into both the individual processes that precede early school leaving as well as the actions that mentors take to prevent it. In this paper we introduce the u-can-act platform, which consists of three technology-based, reusable methodological innovations. Specifically, our innovations concern: (i) an open source web application for longitudinal personalized data-collection, (ii) an automated study protocol that optimizes adherence in a difficult target group (adolescents at risk for early school leaving), and (iii) a technologically assisted coupling between mentor and student that allows us to study dyadic interactions over time. We present performance results of our platform, including participant adherence, the behavior of the questionnaire items over time, and the way that our web application is experienced by the participants. We conclude that our innovative platform is successful in collecting multi-informant time-series data on intervention processes among students in vocational education, both for at-risk students and control students, and for their mentors. Moreover, our platform is suitable for broader applications: it can be used to study any malleable individual process including the efforts of a second individual who aims to influence this process. Because of the unique insights that the u-can-act platform is able to generate, the platform may ultimately contribute to solving the mismatch between research and practice, and to more effective interventions in individual processes.

Keywords: early school leaving, ecological momentary assessments, web application, vocational education, motivation, open source

### 1. INTRODUCTION

Each year, many adolescents and young adults leave school early<sup>1</sup> . In Europe alone, 5.5 million individuals left school early in 2012 (European Commission, 2013). Early school leaving is a particularly large problem in vocational education and training (VET), where approximately two thirds of all European early school leaving takes place (Cedefop, 2016). This is alarming as early school leaving has severe consequences for both the individual and society as a whole. For example, compared to individuals who obtain a starting qualification, early school leavers have a weaker position on the labor market (e.g., a higher risk of unemployment, lower income, more precarious work conditions) and experience less health, a lower life expectancy, and less life satisfaction (Cedefop, 2016).

Thus it comes at no surprise that on the one hand scientists have spent much effort to investigate the causes and consequences of early school leaving and on the other hand practitioners have spent much effort to try to prevent early school leaving. However, the efforts of the practitioners are not always optimally informed by science. This sub-optimal information may be due to a mismatch in focus: while social scientists have a long tradition of generating knowledge on the between-individual level (e.g., finding general trends based on data retrieved from groups), practitioners tend to focus on the within-individual level (i.e., individual change processes). This mismatch has two important consequences.

Firstly, our body of between-individual scientific knowledge has facilitated the identification of individuals at risk for early school leaving but has hardly informed prevention strategies. For instance, it has been shown that early school leaving is more likely to occur among males, individuals with a migration background, and individuals with a low social economic status (Rosenthal, 1998). Although this general information is valuable for identifying at-risk individuals, it has little utility to steer interventions of a practitioner, as it is impossible for a practitioner to adjust these factors. Other, more malleable factors have also been demonstrated to be risk factors for dropout, such as problem behavior or negative attitudes toward school (Rumberger and Lim, 2008). Even though a focus on malleable factors is already more useful to the practitioner, merely focusing on malleable factors is still too limited, as reducing risk factors is not the same as promoting graduation and positive youth development (Zaff et al., 2017). In order to perform such promotion, more knowledge is needed on how withinindividual processes of positive, malleable factors that are known to promote graduation, such as motivation and engagement (Zaff et al., 2017), can be directly affected by practitioners who work with adolescents.

Secondly, it is fundamentally ill-advisable to use betweenindividual knowledge to inform within-individual processes. Although research on the between-individual level can provide general information about group characteristics, it provides knowledge that is true on average, but that might not hold true for any individual in specific (e.g., the non-existent average individual; Allport, 1937; Blaauw, 2018). Moreover, betweenindividual knowledge may obfuscate the relations on the individual level, meaning that findings on the between-individual level may not exist on the within-individual level, and can indeed even be opposite (e.g., Simpson's paradox; Simpson, 1951; Blyth, 1972, the ecological fallacy; Piantadosi et al., 1988, and non-ergodicity; Molenaar, 2004; Hamaker, 2012). These problems with translating between-individual findings to withinindividual processes are thought to be relevant for the majority of psychological processes (Molenaar, 2004; Kievit et al., 2013). As such, in order to inform practitioners on the individual processes of early school leaving, and how to steer these in the right direction, within-individual research is a necessity.

Fortunately, technological developments have made it increasingly feasible to study within-individual processes. A prominent method to do this is the Ecological Momentary Assessment (EMA) methodology, also known as the experience sampling methodology (ESM), or diary studies (Csikszentmihalyi and Larson, 1987; Shiffman and Stone, 1998). EMA is a methodology widely used in psychopathology research and behavioral research (e.g., Bolger et al., 2003; Trull and Ebner-Priemer, 2009; van der Krieke et al., 2016a). In an EMA study, a participant completes the same questionnaire for a long period of time, possibly multiple times per day, resulting in a large number of measurements of multiple (psychological) variables within one individual. This type of high resolution dataset can provide insight into the processes of the measured variables over time, and the relations between them, within a specific individual. Moreover, the data about an individual can be used to shed light on intra-individual variability, which would be unknown (or assumed non-existent) in a cross-sectional study.

In this paper we present the open source EMA platform of the u-can-act research project that we use to study the developmental processes of early school leaving in students, their micro-level interactions with their mentors, and the prevention of early school leaving within individuals. The platform aims to help researchers to effectively study dynamic within-individual processes from multiple informants, even among difficult to reach target groups. It does so by providing an automated way for collecting longitudinal questionnaire data and managing the connections between different informants. The platform can be reused and adapted by other researchers because it is fully open source. The platform is shaped by the aims and theoretical foundations of the u-can-act project, which we present in section 2. The platform encompasses three technological innovations that we present in-depth in section 3, these concern (i) the development of an open source EMA application, (ii) the development of an automated EMA protocol that aims to maximize adherence, and (iii) an innovative coupled multi-informant setup that enables us to investigate dyadic interactions as dynamic processes over time. We collected data among students and their mentors, described in section 4 and use this data to present findings on the performance of the platform. In particular, we focus on its ability to capture

<sup>1</sup>Early school leaving is defined as individuals aged 18–24 who completed at most lower secondary education (International Standard Classification of Education level 2) and who are not involved in further education or training (European Commission, 2013).

within-individual dynamics, the ease of participation for both mentors and students (including early school leavers) and the usability of the platform in section 5. We conclude that our platform is successful in achieving its aim and provide direction for future studies in section 6.

## 2. THEORETICAL FOUNDATIONS AND AIMS OF U-CAN-ACT

The u-can-act platform and its technological innovations (see also section 3) have their current form because of the aims and the underlying theory that we use in the u-can-act project to study early school leaving and its prevention. U-can-act focuses on (i) malleable, dynamic factors that are relevant to early school leaving (section 2.1) and (ii) dynamic within-individuals processes and dyadic interactions on a weekly, micro-level timescale (section 2.2). This allows us to clarify processes that precede early school leaving and determine the effects of the mentors' preventative actions on the development of the student, and ultimately, on the students' early school leaving intentions. With this information we aim to inform practitioners on a very practical and detailed level on the actions to take and when to take them, and help policy makers to choose preventative strategies that seem beneficial in reducing early school leaving. We have translated these aims in a theoretical model that reflects our main assumptions (section 2.3). This theoretical model forms the basis of our u-can-act platform.

### 2.1. A Focus On Malleable Factors

We focus specifically on malleable factors that are expected to vary over time within individuals, and that have the potential to not only prevent early school leaving, but to also promote positive development. A central theory we use for this is the self-determination theory. Self-determination theory is an important aspect of the process of early school leaving, while at the same time it is also an important means to promote positive development and intervene in the process of early school leaving (Vallerand et al., 1997; Zaff et al., 2017). The selfdetermination theory, as proposed and investigated by Deci and Ryan (2012), is primarily a theory of motivation. It postulates the existence of three basic psychological needs, which are autonomy, relatedness, and competence. The fulfillment of basic psychological needs fosters intrinsic motivation, but has recently also been ascribed a broader function: Deci and Ryan (2012) describe that the fulfillment of these needs is "essential for optimal development and functioning" (p. 417). Indeed, need fulfillment has empirically been related to many indicators of well-being and growth, while the frustration of needs is related to illbeing and maladaptive functioning (Vansteenkiste and Ryan, 2013), and of course, early school leaving (e.g., Hardre and Reeve, 2003; Alivernini and Lucidi, 2011).

The interesting characteristic about psychological needs is that they are changeable and can be supported (Hardre and Reeve, 2003; Ntoumanis, 2005; Mouratidis et al., 2011)—thus they form a particularly interesting source of information for practitioners. In fact, in a Dutch study that investigated fifteen early school leaving prevention and intervention projects it was found that the large variety of approaches could be uniformly characterized as aiming to support the autonomy, competence, and relatedness of the students (Heemskerk et al., 2018).

Besides need fulfillment, we focus on two other malleable variables relevant to early school leaving: engagement and expected success. Engagement is an important, malleable factor in the process of early school leaving (Fredricks et al., 2004) and can be defined in several ways (Nielsen, 2016), we chose to focus on two of these. First, behavioral engagement, which is a form of engagement that emphasizes involvement in activities, and is considered crucial in attaining positive academic outcomes and preventing dropout (Fredricks et al., 2004). This is perhaps the most commonly studied form of engagement, but has also been criticized to be one-sided and behavioristic (Nielsen, 2016). Therefore we also study emotional engagement, which has also been found to be an important predictor of early school leaving, besides behavioral engagement (Wang and Fredricks, 2014). In addition to engagement, we focus on the expectations that the students have about the academic success that they will obtain during the school year, as such expectations have also turned out to be malleable yet important predictors of persistence and school success (Zaff et al., 2017).

## 2.2. A Focus On Individual Processes On a Micro-Level

Much is still unknown about psychological need fulfillment and engagement as part of within-individual, micro-level processes that may change over a short time-span, like weeks or even days. However, some first steps have been made, for example by van der Kaap-Deeder et al. (2017). They found that a sense of autonomy satisfaction or frustration was directly influenced by daily interactions. Moreover they found that the social contexts of these interactions matters, as each of the three social contexts they studied (interactions with mothers, teachers, and siblings), uniquely contributed to whether autonomy satisfaction or frustration is experienced.

Thus experiences in different contexts have the potential to either fulfill or frustrate psychological needs and a withinindividual approach is necessary to understand the long-term consequence that this may have for early school leaving. For example, Aelterman et al. (2016) propose that need fulfilling activities result in a pull on the individual, attracting the individual to spend energy on the target activity, while need thwarting activities push the individual away. Extending this hypothesis, we can imagine that in some individual cases need fulfillment may in fact increase the chance of dropout: when individuals spend all their time outside of school because of the need fulfilling context, their engagement with school may decrease and dropout may eventually follow. Such a hypothetical process contradicts the common group-finding that need fulfillment is generally beneficial (Vansteenkiste and Ryan, 2013) and remains unexplored in studies so far because of their inter-individual focus (see also section 1). We can only gain insight into the existence of such hypothetical individual processes by taking a within-individual approach.

Moreover, a micro-level, within-individual approach is necessary in order to learn more about the role that mentors play in influencing students' development and preventing dropout. Individual guidance has proven to be quite effective to prevent early school leaving in many independent intervention and prevention programs (Heemskerk et al., 2018), but much is still unknown about the ingredients of such guidance. Which concrete actions do mentors take in their guidance of adolescents, what goals do they strive for? How effective are they in supporting the basic psychological needs of their students from week to week? Such questions can only be answered by studying the within-individual guidance processes of mentors and the microlevel interactions between students and mentors.

### 2.3. The Theoretical Foundations of the u-can-act Platform

We placed the malleable factors relevant for early school leaving in a hypothetical model that reflects our within-individual process approach (see **Figure 1**) and have used this model to as the foundation for the u-can-act platform. The interplay between the student and different contexts is at the heart of our theoretical model. Indeed, our within individual approach has led us to hypothesize that the interplay of need fulfillment inside and outside of the school is an important process underlying early school leaving, while it is at the same time a process that a mentor can potentially influence in order to prevent early school leaving. Because we are particularly interested in informing mentors on what they can do to help prevent dropout, the student-mentor interaction is central in our model.

**Figure 1** schematically represents the hypothetical model. It reflects the main theoretical assumptions that have driven the development and innovations of our dual informant EMAplatform, and includes the measures that we have employed. These measures cover different aspects of individuals' experience, mental state, and behavior, that are hypothesized to be relevant for the process of early school leaving and interventions in this process. Perhaps the most important assumption that is reflected by this theoretical model, is that students continuously interact with several environments, including a school environment, other environments (non-school, such as the home environment) and their mentor. We included the students' experiences of events and need fulfillment in both school and non-school contexts, as well as experiences of mentor need support and quality of the guidance they receive. We operationalize the students' mental state as emotional engagement, current school success expectations, and well-being. We measure the students' behavior by assessing the amount of time they spent on school activities and how open they have been with their mentor. Similar to the students, mentors have experiences, mental states, and behavior as well, which we operationalized solely with variables relevant to the student-mentor interaction. We assume that the mentors can experience various degrees of satisfaction in their interaction with the student. As a mental state, they can also have various degrees of intuitiveness when performing their actions (as opposed to performing planned actions), and have certain goals they want to achieve. Ideally, their goals are reflected in their actions, but also in their support of students' needs and in their time-investments in the student. This mentor-behavior can be perceived by the student in the quality of the guidance and in the support he or she feels in need fulfillment, with which the student-mentor interaction cycle has come full circle.

To test the relations and processes in our hypothetical model we needed a suitable measurement instrument that met at least three requirements. First and foremost, the instrument needed to repeatedly measure individuals over a period of time in order to gain insight into the within-individual dynamic processes of early school leaving. Secondly, the instrument needed to optimally facilitate easy participation, in order to gather enough data. After all, the processes of motivation that could underlie early school leaving might also influence the motivation of students to partake in this study. Thirdly, the instrument needed to be able to collect measurements for both students and their mentors in order to gain insight into their interaction and into the actions that mentors can take in order to prevent early school leaving. For this, a coupling between the two measurements was necessary. Because there were no applications available that met these requirements, we set out to develop such an application: the u-can-act platform.

### 3. THE U-CAN-ACT PLATFORM

We developed a platform that allows for studying withinindividual processes and dyadic interactions within an intervention setting, from a multi-informant perspective. The platform is rooted in three technological innovations.

The first innovation, and the foundation of our datacollection, is the development of a web application that applies a fully automated method for scheduling, sending invitations, and hosting EMA questionnaires. This free and open source application provides participants with a web interface to fill out weekly questionnaires. Our second innovation is a study protocol that optimizes participant adherence among a difficult target group, which includes an elaborate reward system and messaging that is automatically adapted to the participation behavior of each individual participant. The third innovation is the development of a multi-informant EMA questionnaire that allows us to study the process of early school leaving and the preventative actions in this process from both the student and mentor perspective, where the technology behind our platform manages and deals with the multi-informant aspect of our study by automatically coupling the mentors to their students. We will introduce the three innovations in more detail below.

The three innovations are all integrated in one open source software package, developed by Emerencia et al. (2017) and is freely available at http://u-can-act.com.

### 3.1. Innovation 1: An Open Source Web-Application

Our first innovation is perhaps most fundamental to our approach: an open source web application that measures the developmental processes of students and their micro-level interactions with their mentors. The application schedules

and sends out questionnaire invites automatically, and stores the data inside two separate and secure databases (one containing personal data and one containing the answers to the questionnaires). Screenshots of the application can be found in the Supplementary Material in **Figure S1**.

A schematic overview of the technological infrastructure of the u-can-act platform is provided in **Figure 2**. The platform serves its content by means of a web application implemented in the Ruby on Rails framework. Ruby on Rails is an open source framework that provides a default structure for web applications<sup>2</sup> . In order for other researchers, schools, and agencies to be free to use and adapt its implementation, we released u-can-act as MIT-licensed<sup>3</sup> open source software on https://u-can-act.com. The implementation of u-can-act builds upon our experience in designing architectures for web-based questionnaire platforms, such as the implementation of the HowNutsAreTheDutch web application (Blaauw and Emerencia, 2015; van der Krieke et al., 2016b).

The collected data is stored into two separate databases: one database that holds the questionnaire data, and one database that contains all personal data. The latter database keeps track of the completed questionnaires by storing a reference to the actual questionnaire data. This separation ensures anonymity in case of a breach in one of these databases. The personal information is stored in a relational SQL database named PostgreSQL. The questionnaire data is stored in a MongoDB NoSQL database. The rationale behind the choice for MongoDB is that it provides a schemaless document storage, which fits well with storing different types of questionnaire data. Finally, we use a third Redis NoSQL database that contains the aggregated / analytical data for caching purposes. Data stored in this database are considered volatile, and mostly used on the researcher dashboard, to provide them with general statistics about the questionnaire completion percentages and rewards collected. Without this cache, these data need to be calculated in real time, which negatively influences the performance of the application.

The traffic to the web application is protected using a 2048 bit RSA (SHA 256 bit) TLS 1.2 Secure Socket Layer (SSL) connection, which ensures private data exchange to and from the u-can-act web application. The interactions with the underlying database infrastructure are protected using SSL as well. Filling out a questionnaire is only possible via the link sent to the participant in a text sent to their phones or in an email message. These links contain a user identifier and a token. The tokens are stored using the Bcrypt encryption standard, which makes it practically impossible to retrieve the clear-text token from its encrypted counterpart.

The platform is built as generic, reusable software, such that other research projects could reuse the platform. Areas in which this could be of interest are, for example, psychiatry (e.g., HowNutsAreTheDutch and Leefplezier; Blaauw et al., 2014a,b; van der Krieke et al., 2016a), general health (Nair et al., 2016), pain monitoring (Stone et al., 2003), substance abuse (Shiffman, 2009), and many other fields that benefit from intra-individual measurements.

The u-can-act application automatically schedules questionnaires and invitations for each participant in the system. During the initial setup phase, the u-can-act application is initialized with a definition of the protocol that contains the collection of measurements, the interval at which invitations should be sent, and the actual questionnaire items that need to be completed. Subsequently all participants can be subscribed to their protocols at any given start and end date. The ucan-act application automatically invites them to complete their questionnaire on a given interval by means of a text message or email.

<sup>2</sup>Website: https://rubyonrails.org

<sup>3</sup>Website: https://opensource.org/licenses/MIT

## 3.2. Innovation 2: Optimizing Adherence and Study Load

Most students included in this study have a high risk of early school leaving, which might also be a risk for their participation behavior in the u-can-act study. Hence, optimizing the study adherence and minimizing the study load has been an important priority for u-can-act. As such, we performed three adherence-optimization steps, which were largely informed by our initial pilot study.

Firstly, we determined an EMA schedule that would work optimally for our sample. From our pilot study, we concluded that the optimal measurement interval is once a week for both the students and the mentors. The main reason for selecting this measurement interval is threefold: (i) this interval coincides well with the frequency of the meetings between student and mentor, (ii) this measurement interval did not significantly reduce the variance in the items compared to more frequent intervals that we also included in the pilot study, and (iii) the evaluation results showed that participants expected this study interval to be most sustainable.

Secondly, in our pilot study we performed interviews, observational studies, and a detailed analysis of each questionnaire question to optimize the users' experience and minimize time-investment while using the application. We optimized the questionnaire questions that scored lowest on understandability and incorporated many qualitative recommendations to increase the usability of the app. This involved, for example, reformulating questions to ask about concrete categories (instead of free text, broader categories, or actions), and providing more information about the meaning, context, and purpose of questions.

Thirdly, and perhaps most importantly, the qualitative data from the pilot study and brainstorm sessions within both our team and one of the involved guidance agencies informed our intrinsic and extrinsic motivational strategies, which we will describe in more detail below.

#### 3.2.1. Fostering Intrinsic Motivation: Personalization

The students receive one SMS text message per week for approximately 35 weeks during the study to inform them that the questionnaire is available for them to fill out. The text messages are framed in a positive way, emphasizing the value of their contribution for their mentor and the research project. The contents of the text messages were dynamically constructed and personalized for each user, taking into account the participation figures (see **Figure S3** in the Supplementary Material for an overview of the invite message and personalization procedure). The rationale behind sending different and personalized text messages was that both the fact that the message text was variable and that it was personalized potentially has a motivating effect for actually filling out the questionnaire (Heerwegh et al., 2005; Muñoz-Leiva et al., 2010).

A second personalization step was performed in the questionnaires themselves. U-can-act uses a system that can automatically tailor questionnaires toward the individual. This means that certain variables are replaced with values relevant to the participant. For example 'your mentor' would be changed to the actual name of the mentor. The options that were available for personalization were: (i) the name of the mentor, (ii) the name of the student, (iii) the gender of the student (different forms), and (iv) the name of the supervisory agency they were affiliated with.

#### 3.2.2. Fostering Extrinsic Motivation: Monetary Rewards for Students

After the EMA study was completed, students received a monetary reward that reflected their amount of completed questionnaires. They received a two Euro reward for each questionnaire they completed. If students completed three questionnaires consecutively, they were awarded a so-called

"bonus Euro," which was an additional one Euro reward on top of the two Euro reward. This bonus Euro is an example of gamification and aims to motivate the students to complete longer questionnaire series and not leave many gaps, which can be troublesome for certain analyses. The bonus Euro was awarded for each completed questionnaire until one questionnaire was missed, after which the students again needed to complete three consecutive questionnaires. After each completed questionnaire a reward page was displayed to the students. On this page they could see the monetary rewards that they had already earned, the rewards that were still earnable, their progress toward the endgoal (the maximum amount of reward) and their bonus streak. All this was displayed using a playful design, see **Figure S1** in the Supplementary Material for a visualization.

### 3.3. Innovation 3: Multi-Informant EMA to Study Students, Mentors, and Their Interactions

The u-can-act platform maps out the process of early school leaving and preventative actions in this process from two perspectives: students and mentors. On the one hand, u-can-act collects weekly data about students and their own experiences. On the other hand, the platform takes the perspective of the mentors into account, by asking them to complete questionnaires for each of the students that they supervise. The database is set up in such a way that an automatic coupling is made between each student and their mentor, which enables us to study the interactions between them. Moreover, this coupling helps foster personalization (see also section 3.2), as for example, students see their mentor's name when answering questions about the quality of his or her supervision. We provide more detail on the data collection among students and mentors in sections 4.2 and 4.3.

### 4. METHODS FOR EVALUATING THE PLATFORM

We collected empirical data among students and mentors during the u-can-act project that we use to evaluate the performance of the u-can-act platform and its three innovations. For this evaluation, we check whether the platform meets three requirements (see also section 2.3) that we believe are essential in order to measure within-individual dynamic processes among adolescents and the interactions with their mentors: (i) dynamicity of the measures, (ii) easy participation, and (iii) good user experience. We describe our data-collection protocol and measurement instruments for both the student as well as the mentor study and give a brief description of our methods for analyzing the performance of the platform.

### 4.1. Ethics

The u-can-act research protocol was assessed and approved by the ethical committee of the University of Groningen under code 16351-O. All participants provided their informed consent online. No explicit informed consent was collected from the parents/legal guardians of non-adult participants, as all participants were above the age of sixteen.

### 4.2. Student Study

The first students were enrolled in the student study on November 6, 2017. Students and mentors joined the study on six moments, for an overview see **Figure 3**. The students that participated were all participating in secondary vocational education in three locations spread throughout the Netherlands. The students that participated in this study could be in one of two sub-groups: an at-risk subgroup, or a control subgroup. The students in the at-risk subgroup were considered to be at risk of early school leaving by their own educational institution, for example because their grades were low, they attended only few classes, experienced stressful situations at home, or showed disruptive behavior in class. Because of this elevated risk, these students were signed up for extra individual guidance. The individual guidance was supplied by mentors from three different supervision agencies (more on this in section 4.3). We approached the students through their mentors: we first asked the mentors to participate, who then asked their students to participate. The students in the control subgroup did not have a mentor, as they were not considered to be at risk for early school leaving and were approached directly.

The student study comprises three main questionnaires: (i) a general assessment, (ii) an EMA questionnaire, and (iii) a post-assessment. The general assessment collected information about the students' demographics and living situation. The EMA questionnaire collected information on variables that could fluctuate over time and are hypothesized to potentially underlie early school leaving (i.e., autonomy, competence, and relatedness). The post-assessment collected information about their current educational situation, such as whether they were still enrolled in their educational track, and whether they intend to complete the track.

#### 4.2.1. Procedure

In order to participate in the study, a student had to be subscribed to the u-can-act platform and provide online informed consent. The control subgroup students were randomly selected from one educational institution in the Northern part of the Netherlands. In collaboration with this educational institution we sampled several students that were considered to be not be at risk of early school leaving, and had not received additional supervision from within their educational institution to help them with school or private problems. If they agreed to participate and accepted the informed consent, they were enrolled in the study.

All students in this study followed the same assessment protocol. Near the start of the EMA study, students were asked to complete a required general assessment questionnaire. Then, for approximately 35 weeks (or until the beginning of the summer holiday period, whichever was shorter), they received a personalized text message each Thursday at noon, in which they were requested to fill out a questionnaire. Each text message contained a link to the u-can-act web application that provided access to the questionnaire they had to fill out. The application automatically sent a reminder text message 8 h later in case a student did not complete the questionnaire before that time. Questionnaires were available for 30 h after the initial invitation. To facilitate early stopping from the study, students were

presented with a button with which they could unsubscribe from the study on June 28, 2018. This button presented them with the question whether their summer holiday had already started, and if it did, that they could end their subscription now, after which they would receive a final, post-assessment questionnaire.

#### 4.2.2. Student Questionnaire Items

The general assessment consisted of nine questions, with which we collect data about (i) birth year, (ii) nationality, (iii) relationship status, (iv) whether or not they had children (and how many), (v) the name of the school they attend, (vi) the type of education they follow, (vii) the level of education, (viii) how many years of education they followed thus far, and (ix) what the students did before starting their current studies. The full list of questions and corresponding answer options is presented in the Supplementary Material in **Table S1**. We collected data about gender during the sign-up process, along with first name, last name, and mobile phone number.

The weekly student EMA questionnaire consisted of twentyfive questions. These questions were in most cases newly created for the purpose of this study, or adapted from previous questionnaires. All questionnaire items are described in **Table S2** in the Supplementary Material. The questionnaire items were selected to assess experienced autonomy, competence and relatedness in three contexts (school, outside-of-school, and mentor relationship); behavioral and emotional engagement with school; school success expectations; evaluations of their mentors' actions; their general level of well-being and the general valence of their school experiences. An interactive example of the web application can be found online<sup>4</sup> . Note that for the control group, all questions related to supervision of a mentor were removed as they were not applicable (questions 18–24).

The visual design of the questionnaire is composed of three different question options: (i) visual analog scales (VAS), (ii) radio buttons, and (iii) checkboxes. Each of the VAS scales provides a continuous value ranging from 0 to 100, and displays a small indicator showing the selected number. The default value of the VAS scale was set to 50 (the center of the scale), and the extremes of the scale had appropriate labels (e.g., "not at all" to "very much," see **Table S2**, "Response range"). The checkboxes and radio buttons were used to create multiple choice questions of which, respectively, multiple or only a single answer could be selected. In some cases, the radio questions had an option which allowed for the input of free text.

The post-assessment questionnaire consisted of at least 11 and at most 14 items (depending on the answers to the questions). The questionnaire focused on (i) whether the student dropped out or not (and when), (ii) the average grade of the student, (iii) if the students dropped out we asked whether they would start a new study and if the students persisted, how certain they are that they will complete this study, (iv) their average grade, (v) the quality of the supervision of the mentor, and (vi) some general questions related to the evaluation of the web application. The full questionnaire is provided in **Table S3** in the Supplementary Material.

#### 4.3. Mentor Study

The mentor study started at the same date as the student study, November 6, 2017 (see **Figure 3** for more information), and consisted of three personal self-report questionnaires: a general assessment, a post-assessment and a series of EMA questionnaires about the students that they supervise. Each mentor completed diary questionnaires about their mentoring of each of their students separately. As such, the mentors essentially participated in several parallel EMA studies, one for each of their students.

#### 4.3.1. Procedure

The enrollment procedure for mentors was similar to the student enrollment procedure, although mentors could only participate whenever the mentor was actually actively involved in the supervision of one or more students. We asked the mentors to provide some general, personal information in a general assessment questionnaire. This general assessment questionnaire consisted of four questions concerning (i) education level, (ii)

<sup>4</sup>Website: https://app.u-can-act.nl/dummy/student

year of birth, (iii) years experience in supervising students, and (iv) nationality. The questionnaire and its items are listed in **Table S4** in the Supplementary Material. Note that the gender for each participant was already known upon sign-up.

Similar to the student study, mentors received a weekly text message on Thursday around noon. In addition to the text message, mentors also received an email. Both the text message and email contained an invitation text and a link to their mentor dashboards, which provided each mentor with an overview of the questionnaires they had completed for the students that week and the adherence of each of their students by means of a heat map. This information could be used by the mentors to intervene if too many measurements were missed by a student. An illustration of this dashboard is provided in **Figure S2** in the Supplementary Material. The mentors did explicitly not have access to the actual questionnaire data provided by the students, in order to provide anonymity to the students.

An interactive version of the mentor dashboard and mentor questionnaire is available online<sup>5</sup> . The system automatically reminded the mentors via e-mail and text message to fill out all the questionnaires if they had not done so 8 h after the initial invite.

At the end of the study, or after the mentors clicked a button telling the system that their holiday had started, a mentor received a post-assessment questionnaire. The post-assessment had a dynamic number of items, depending on the number of students that they supervised. It contained six questions related to their experience with supervising students, general questions related to the web application, and one question for each of the students they supervised, asking whether and by how much the student has improved during the supervision phase. The full list of questionnaire items for the post-assessment questionnaire is provided in **Table S7** in the Supplementary Material.

#### 4.3.2. Mentor Questionnaire Items

The mentor EMA questionnaire was constructed in a bottomup fashion: we designed the instrument in several brainstorm and focus-group sessions with one of the mentoring initiatives. An important outcome of these sessions was a categorization of the actions and goals that mentors frequently take in their guidance of students. In this way we aimed to measure variables that are highly relevant to the mentoring process. All mentor questionnaire items are listed in **Table S6** in the Supplementary Material.

The mentor questionnaire was different from the student questionnaire in the sense that it was partly dynamic based on the needs of the mentor and could consist of a varying number of questions. By default, the questionnaire contained 24 questions, which could dynamically be extended to a maximum of sixtynine questions depending on the information a mentor wanted to provide. The dynamic part of this questionnaire resides in its third question, which reads "Add another action (or series of actions)." This question provided the mentors to add up-to ten new action clusters (see **Table S6** in the Supplementary Material) to record actions they had performed for the current student.

### 4.4. Analysis of the Platform

We show whether our platform indeed captures withinindividual dynamics by calculating the root mean squared successive difference (RMSSD) for each of the questions in the separate questionnaires. The RMSSD is a measure of instability, and provides insight into the fluctuation of a variable over time (von Neumann et al., 1941). Fluctuation or variability is important for questions to be meaningful in an EMA (if a question does not fluctuate, there is no value in repeatedly collecting it) and is necessary to capture in order to gain more insight into within-individual processes over time. We calculate the average RMSSD of each of the continuous variables for each participant in separation and then report the average.

Next, we provide insight into the ease of participation by firstly providing a detailed overview of the adherence to the study over time for all the followed subgroups. We zoom into the adherence among students who dropped out of their educational trajectory. Secondly, we show how long it takes to fill in the questionnaire. We have implemented a questionnaire system that records the difference in time between subsequent questions in the questionnaire that allows us to do so. These timings provide a general insight into the ease of answering questions, and into which questions take more time than others and might be candidates for revision in future research.

Finally, we give some preliminary insight into whether our platform is able to take successful measurements among both students and mentors. We do so by reporting on how our participants have experienced the use of our platform using quality indicators from the post-assessment. Specifically, we asked all participants to grade the application on a scale from 1 to 10 (in steps of 0.5). Furthermore, we asked how difficult they found it to keep participating in the study on a scale from 0 to 100, where 0 denotes that it was very difficult to participate and 100 denotes that it was easy to participate.

### 5. RESULTS

Before we evaluate the platform we first provide some characteristics of our sample. We then evaluate the performance by demonstrating the dynamics of the items, the ease of participation and the user experience of the platform.

### 5.1. Sample Characteristics

On July 27, 2018 the data collection in the u-can-act project was completed. The data set comprises of a total of 40 mentors from three supervisory agencies that participated in u-can-act, and 181 students, of which 50 are in the control group. We excluded one participant from the dataset because of seemingly unrealistic answer patterns; this individual had left all the sliders on their default value, without manually placing them there. Moreover, we found that individuals with older browsers did not see some questions (hidden questions that were toggled by other questions). This error occurred in 4.9% of the data, which was also excluded. The application was fixed to resolve this error in future studies.

The mean age of the mentors was 33.09 years (median = 28, range 20–49, standard deviation [SD] = 12.62) and 67.44% were

<sup>5</sup>Website: https://app.u-can-act.nl/dummy/mentor

women. The mentors had on average 4.46 years of experience (median = 2, range 0–25, SD = 5.96). Most mentors (95.35%) had the Dutch nationality. 83.33% of the mentors had at least finished intermediate vocational education.

In the at-risk student sample, the mean age was 20.59 years (median = 20, range 16-33, SD = 2.63) and 54.74% were women. The control student sample had similar characteristics and were on average 19.17 years old (median = 19, range 17–25, SD = 1.92) and contained slightly more women (66%). The students started their current study after: high school (at-risk: 39.2%, control: 70.83%), another secondary vocational education trajectory (atrisk: 45.6%, control: 20.83%), working (at-risk: 4.8%, control: 6.25%), or something else (at-risk: 10.4%, control: 2.08%).

There were 17 students who dropped out of their educational trajectory during our study. All of the dropouts were in the at-risk group and none were in the control group. Of these dropouts, 11 left the educational system entirely ("school leavers"), while 6 students had plans to switch to a new educational trajectory ("switchers"). Most of the students, particularly the switchers, dropped out near the end of the academic year in the Netherlands (which coincided with the end of our measurements). For an overview of the dropout moments see **Figure 4**.

### 5.2. Dynamics of the Questionnaires Items

We calculated RMSSD's to investigate whether our instrument was capable of capturing the dynamics of within individual processes. Over all groups and items, the average RMSSD was 16.22 (median = 13.72, range 7.85–68.47). We also calculated RMSSD's for each of the item separately, these are listed in the tables describing the questionnaire items (**Table S5** for the mentor questionnaire and **Table S2** for the student at-risk and control questionnaire). The RMSSDs indicate that most items showed, on average, reasonable variation, and that none of the questions had drastically more variation than the others. The outlier of question 4 in **Table S5** can be attributed to the fact that this question asked for the time spent on the supervision of a student, and is therefore scaled differently than the other questions, which are ranged 0–100.

### 5.3. Ease of Participation

Ease of participation was measured by both global adherence numbers (i.e., the number of filled out questionnaires) and the time it took for each questionnaire to be completed.

#### 5.3.1. Adherence to the Study Protocol

Across all agencies, the participants completed a total 6659 assessments On average, each at-risk student that started the diary study<sup>6</sup> completed approximately 68.25% of their possible diary questionnaires. The control group completed approximately 83% of their possible diary questionnaires. The completion rate of the mentors was slightly lower at 52.28%.

The adherence to the study over time is depicted in **Figure 5**. Here, **Figure 5A** shows the general (normalized) adherence to the study over time, in terms of completed questionnaires per group (control students, mentors, and at-risk students). Participation dropped rapidly after week 25, probably because we provided the participants with the option to finish their participation and fill-out the post-assessment questionnaire, as summer vacations started for many of them. In **Figure 5B**, we show the distribution of the percentage completed questionnaires for each of the subgroups. It is interesting to note that most of the at-risk students and the students in the control group completed at least 90% or even 100% of the questionnaires, while only a small part of the mentors showed such consistent adherence.

Additionally, we zoomed in on the adherence behavior of the students who dropped out of their educational trajectory, see also **Figure 6**. We can see two distinct patterns for the two types of dropouts: the school leavers and the switchers. School leavers show a completion percentage over time that is similar to that of the larger at-risk student group, although perhaps surprisingly, they seem to complete more than average questionnaires in the beginning of the year. It is also interesting to see that the majority of school leavers tend to keep participating in our study even after the moment of school dropout. This is different for the switchers, they participate a little less than the larger at-risk group in the beginning of the year, and their participation in our study declines sharply in the 15 weeks before the school dropout moment.

#### 5.3.2. Questionnaire Completion Times

We investigated the time it takes to complete each question, and the questionnaire as a whole. The average completion time for each of the questions for both students and mentors is shown in **Tables S2**, **S5**. **Figure 7A** shows the distribution of completion times as measured over the whole study (i.e., the time it takes to fill out a questionnaire). Very often (in 93.65% of the cases), the questionnaire was completed within 5 min. Since there is a bimodality in the completion times, we calculated the mode for both peaks in the histogram. The first mode is 7 s, which can be explained by a mentor answering that he or she had not seen the student that week. The second mode in the data is 67 s. Mentors and at-risk students had similar completion times, while the control group generally spent less time on the questionnaire. This can be explained by the fact that the control group usually had a questionnaire with fewer questions (control = 19 vs. at-risk = 25, see also section 4.2). **Figure 7B** shows how the time to complete a questionnaire fluctuates over time. There is a steep decline in completion time in the first 2 to 7 weeks, perhaps indicative of a learning curve. After this the completion times become more stable, although they do mildly and gradually decline even further.

### 5.4. User Experience

As part of the post-questionnaire, we asked both of the student groups and the mentor group to evaluate the u-can-act platform. This questionnaire was completed by 59 at-risk students, 26 control students, and 12 mentors. The control group graded the platform high with an 8 (median = 8.5, range = 5–10, SD = 1.27), as did the at-risk students with a 7.84 (median = 8, range = 3–10, SD = 1.4) and the mentors with a 7.08 (median = 7, range = 5.5– 9, SD = 0.93). The control group judged it to be easy to adhere

<sup>6</sup>Thus the participants that provided informed consent.

to the protocol with a mean score of 79.42 (median = 83, range = 0–100, SD = 24.34), as did the at-risk students with a mean score of 73.27 (median = 81, range = 0–100, SD = 28.58)

(B) Cross-sectional questionnaire completion in percentages.

and the mentors found it more difficult than the students with a mean score of 45.67 (median = 45, range = 23–78, SD = 16.93).

### 6. DISCUSSION

The platform that we have developed within u-can-act seems to be successful in collecting multi-informant and dynamic timeseries data on within individual processes among students in vocational education—both regular students and students at risk for early school leaving—and their mentors. This is firstly evidenced by the satisfactory results of the dynamics of our EMA items: their sufficient fluctuation over time, which was on average 16.22 (in terms of RMSSD). The success of our set of innovations is furthermore evidenced by the high participant adherence among a presumably difficult target group: 68.25%. For the control group, the adherence was even higher (83%), signifying the difficulty of the at-risk subgroup (the at-risk students) we are dealing with, while adherence was lowest among the mentors (52.28%). Interestingly, among students who dropped out of their educational trajectory, the school leavers in our sample participated at the same level or even more than the at-risk group, while the switchers participated less. In addition, the questionnaire items took a relatively short time to answer, which was generally less than 5 min for the whole questionnaire. Moreover, the participants were satisfied with the user experience of the app, and indicated that it was easy to adhere to the protocol for an extended period of time, although the mentors experiences more difficulties in this. We will argue that all our (technological) innovations have contributed to these successes.

First of all, the development of the web-based platform and its innovations was essential for participation. This platform resulted in a flexible data-collection application that can be incorporated in students' daily lives by using their own smartphones. The use of a responsive web application had three major advantages: (i) the questionnaires could be filled out on any smart-phone (independent of its operating system), (ii) participants did not need download an app, and (iii) it gave mentors the option to fill out the questionnaires on a PC or tablet. Our platform was designed in such a way that it can automatically remind participants to fill in their questionnaires, to further facilitate easy participation and improve adherence. Another facilitating feature of our platform was the use of identification tokens, which meant that the participants did not need to log in (and thus did not need to remember their credentials).

We hypothesize that the high adherence is also largely influenced by our measurement protocol aimed at maximizing adherence. Because we optimized the usability of the web application by performing an elaborate quantitative and qualitative pilot study, irritations with both the technology and the formulation of the questions were discovered and solved. This led not only to high adherence, but also to a pleasant user experience which we believed helped the users of our platform to participate seriously in our study and improve the validity of their answers. We applied both internal and external motivational strategies to facilitate adherence and generate a pleasant user experience. However, we assumed that it would be unrealistic to solely rely on the intrinsic motivation for adherence of the at-risk students. We mainly dealt with adolescents at risk for early school leaving who, according to literature and our own theoretical model (see section 2) are likely to have trouble with their intrinsic motivational resources for schoolrelated activities (Hardre and Reeve, 2003), which could affect research participation. We fostered intrinsic motivation as much as we considered possible. We used personalized messages in our invitations that were adapted to their participation behavior for example by complementing them on a long streak of filling in the app (fostering the experiences competence) and emphasized our gratefulness for their contribution to both us researchers and their mentor (fostering relatedness). We also used the name of their mentor (agency) in the application to increase the personal relevance. Apart from focusing on intrinsic motivation, we also stimulated their extrinsic motivation, by designing a monetary reward system that uses gamification and playful design elements in the form of bonus-streaks<sup>7</sup> . As has also been found in literature (e.g., Cerasoli et al., 2014), the combination of extrinsic motivational strategies with intrinsic motivational strategies may help foster motivation more than relying solely on intrinsic motivational strategies when it concerns simple tasks (such as filling out a questionnaire).

In the mentor study, we did not have any extrinsic motivational strategies in place, and fostered only intrinsic motivation through the same type of personalization as we did for the students. We made the assumption that their intrinsic motivation would be strong as our research would be of direct importance for the mentoring agency that they were part of, as it would provide them with important information regarding the effectiveness of the actions they take to prevent early school leaving. However, as mentor participation was quite low compared to student participation (mentors: 52.28% vs. at-risk students: 68.25% vs. control students: 83%) and the mentors indicated to experience a medium degree of difficulty in adhering to the protocol, we believe relying solely on their intrinsic motivation was insufficient. We suspect that using extrinsic motivation as a supplement (e.g., a reward system similar to that of the students) may have been helpful and consider this a promising avenue to explore in future research. Furthermore, the relatively low mentor participation may also be improved by re-evaluating the content of the mentor questionnaire. This questionnaire has a qualitative part where mentors fill in the actions that they took in guiding their students, and to next place these actions in suitable categories. This may have been a relatively hard task for some mentors and future studies may look into how this measurement can be made easier.

We believe that the mentor-student connection was essential for study adherence, and was key to get the at-risk students to participate. We approached the at-risk students through their mentor: if students participated, they did so at the behest of their mentor. Furthermore, during the study, the mentors could monitor their students' study adherence, which allowed them to targetedly motivate each student when needed to increase adherence.

### 6.1. The Innovations Have Produced an Open Source Platform That Collects Multi-Informant Time-Series Data

In order to allow other agencies and researchers to use the u-can-act platform for their own purposes, we released it as open source software. The open source philosophy has several benefits, such as the verifiability of the source code (anyone can inspect the code and verify its logical integrity) and the fact that the software is freely available. The software package includes technical instructions on the use of the software, making it reusable for interested others. This may be interesting for other researchers or practitioners specifically interested in processes of early school leaving and its prevention. However our platform also serves a broader audience due to its generic implementation. The u-can-act platform can be used by anyone interested to gain insight into within-individual processes and the dyadic interactions or interventions that influence these processes.

Apart from the software being freely available and verifiable, its open source availability could also attract other developers to work on the platform and maintain it past the span of the ucan-act project itself. Maintenance is crucial in a software project in order for it to remain secure and to incorporate updates of external dependencies.

### 6.2. Limitations

The u-can-act project is an important step to help reduce early school leaving. However, in the present work, we do not yet propose the means to reduce early school leaving. This was not the focus of the present paper. In this paper we aimed to present the platform that we use to collect data about the mentoring process and the process of early school leaving. Our goal with this platform is to generate knowledge that will help reduce early school leaving, but the platform is not by itself meant to directly contribute to this. This may be a direction for future research however, as the current platform can be augmented with a more elaborate dashboard for mentors, on which they could follow the development of their students and adjust their intervention accordingly. The open source nature of our platform allows for such an augmentation to be developed in the future. By presenting our design, platform and initial findings, we have taken a first step in such a direction. And even if this does not happen, we believe that the data that this platform allows us to collect will foster new insights in the individual processes surrounding early school leaving and will eventually help mentors interact with their students in such a way that early school leaving is reduced.

In u-can-act, we focused on a specific subset of the Dutch educational system: vocational education. The reason for this focus is because most early school leaving takes place in this part of the educational system, and moreover, it comprises the largest number of people in the Netherlands. Because of this specific focus, data collected in this study will only be partly generalizable. On the other hand, we argue that generalizing these data might not be useful regardless of the data collected, as in this paper we strongly advocate for a more personalized approach to dealing with early school drop out.

The software is currently in a state where it requires considerable technical expertise to tailor the platform to the needs of a new research group or mentor agency. Setting up the platform requires a few technical steps, such as setting up a server and hosting a database. We have tried to make this as easy as possible with an elaborate manual<sup>8</sup> that is added to the ucan-act web application (Emerencia et al., 2017). Alternatively, the technical implementation and maintenance could be done by a professional company, but then costs would be involved. Thus, even though it is open source, the current platform is like any other questionnaire platform in the sense that it needs expertise to set-up or maintain, or requires costs for external parties to do so. We will leave it up to future researchers to decide.

<sup>7</sup>We did, however, limit the use of gamification elements in order to prevent the application as being seen as "childish."

<sup>8</sup>Available online from: http://u-can-act.com

We are currently working on an interface so that researchers can do as much as possible of the set-up themselves to help overcome this limitation.

#### 6.3. Future Research

In our future work, we aim to provide a solid understanding of early school leaving and methods to prevent this. We aim to test the theoretical model that served as the foundation of the present work. Our studies will include a mapping and profiling of student processes of dropout and persistence, and mentoring processes over time. Moreover, we will investigate the micro-level dynamics within the student, and between the student and the relevant contexts, such as the mentor, school, and non-school context. Our further research will contribute to a better understanding of the process of early school leaving and the prevention of early school leaving.

### 7. CONCLUSION

The present work set out to describe and evaluate a novel platform, and its technological innovations, that we have developed in our project u-can-act. The platform allows researchers to investigate within-individual processes of early school leaving and interventions in this process. In fact, with some adaptation, the platform can be useful in any situation where insight is needed in within-individual processes and the way that interventions may affect such a process. The rich and unique dataset that we collected with the u-can-act platform allows us to answer many questions related to an individualized perspective on motivation and early school dropout, which were impossible to answer without these data. Moreover, the open source nature of our platform allows other interested agencies or researchers to also collect detailed multi-informant EMA data to better understand within-individual change processes and the effects of interventions.

### ETHICS STATEMENT

The u-can-act research protocol was assessed and approved by the ethical committee of the University of Groningen

#### REFERENCES


under code 16351-O. All participants provided their informed consent online. No explicit informed consent was collected from the parents/legal guardians of non-adult participants, as all participants were above the age of sixteen.

### AUTHOR CONTRIBUTIONS

FB: wrote the initial draft of the manuscript, performed the analysis. MG: wrote the initial draft of the manuscript, did the literature study. NS: performed major revisions, contacted the participating agencies/students, did the literature study. AE: performed major revisions on the manuscript, performed the analysis. EK: performed major revisions on the manuscript, supervised the implementation of the study and the u-can-act platform. PJ: performed major revisions on the manuscript, supervised the implementation of the study and the u-can-act platform.

### FUNDING

The u-can-act project was funded by The Netherlands Initiative for Education Research (NRO) grant (no. 405–16–401) received by EK, MG, PJ and Henk Sligte from the Netherlands Organization for Scientific Research (NWO).

### ACKNOWLEDGMENTS

The authors want to thank all participants of u-can-act for participating in this research project. The authors also thank Teun Blijlevens of Umanise who improved the user experience of the application, our collaborator the Kohnstamm Institute who helped set up the study and all agencies involved in this study for their enthusiasm and helpful collaboration: Het Buro, Mijn School, and Plusgroep.

### SUPPLEMENTARY MATERIAL

The Supplementary Material for this article can be found online at: https://www.frontiersin.org/articles/10.3389/fpsyg. 2019.01808/full#supplementary-material

8th International Conference on Service-Oriented Computing and Applications (SOCA) (Rome: IEEE), 131–138.


**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2019 Blaauw, van der Gaag, Snell, Emerencia, Kunnen and de Jonge. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

# What Technology Can and Cannot Do to Support Assessment of Non-cognitive Skills

*Vanessa R. Simmering\*, Lu Ou and Maria Bolsinova*

*ACTNext by ACT, Inc., Iowa City, IA, United States*

Advances in technology hold great promise for expanding what assessments may achieve across domains. We focus on non-cognitive skills as our domain, but lessons can be extended to other domains for both the advantages and drawbacks of new technological approaches for different types of assessments. We first briefly review the limitations of traditional assessments of non-cognitive skills. Next, we discuss specific examples of technological advances, considering whether and how they can address such limitations, followed by remaining and new challenges introduced by incorporating technology into non-cognitive assessments. We conclude by noting that technology will not always improve assessments over traditional methods and that careful consideration must be given to the advantages and limitations of each type of assessment relative to the goals and needs of the assessor. The domain of non-cognitive assessments in particular remains limited by lack of agreement and clarity on some constructs and their relations to observable behavior (e.g., self-control versus -regulation versus -discipline), and until these theoretical limitations must be overcome to realize the full benefit of incorporating technology into assessments.

Keywords: non-cognitive, competencies, assessment, construct validity, technological advances, theoretical limitations

### INTRODUCTION

Non-cognitive skills have been increasingly recognized as important contributors to education and workplace success (Levin, 2013). These skills include a wide range of competencies, such as perseverance, collaboration, emotional intelligence, and self-regulation; **Table 1** list those included in a recent systematic review (Smithers et al., 2018). There is some disagreement on how to define and delineate them, including whether such attributes are fixed traits or malleable skills (for discussion, see Lipnevich et al., 2013; Duckworth and Yeager, 2015; Smithers et al., 2018; Simmering et al., 2019). Although these are important theoretical issues that will inform assessment development, they are beyond the scope of the current paper. Rather, we discuss how advances in technology may change non-cognitive assessments. We aim to provide a high-level overview of advantages gained through technology, along with new and remaining challenges that must be addressed. We focus on non-cognitive skills because many are more contextual and dynamic than academic skills (e.g., delay of gratification, emotional reactivity). Before considering technological advances, we first briefly review the limitations of traditional non-cognitive assessments.

#### *Edited by:*

*Frank Goldhammer, German Institute for International Educational Research (LG), Germany*

#### *Reviewed by:*

*Oliver Luedtke, University of Kiel, Germany René T. Proyer, Martin Luther University of Halle-Wittenberg, Germany*

#### *\*Correspondence:*

*Vanessa R. Simmering vanessa.simmering@act.org*

#### *Specialty section:*

*This article was submitted to Quantitative Psychology and Measurement, a section of the journal Frontiers in Psychology*

*Received: 07 December 2018 Accepted: 09 September 2019 Published: 25 September 2019*

#### *Citation:*

*Simmering VR, Ou L and Bolsinova M (2019) What Technology Can and Cannot Do to Support Assessment of Non-cognitive Skills. Front. Psychol. 10:2168. doi: 10.3389/fpsyg.2019.02168*

TABLE 1 | Non-cognitive skills included in Smithers et al. (2018) systematic review and meta-analysis.


Character skills Executive functions Personality traits Socio-emotional skills Soft skills Specific capabilities Attention Cognitive flexibility/control Conscientiousness Delay of gratification Effortful control/self-control/regulation Emotional stability/reactivity/regulation Impulsivity Inhibitory control Locus of control Motivation Perseverance/persistence Responsibility Self-esteem Sociability

*Smithers et al. did not differentiate terms as high-level versus specific; this has been added to acknowledge the multidimensional nature of the high-level constructs, though we recognize that some specific capabilities may also be multidimensional. We also group terms we viewed as synonymous within specific capabilities, although these views are not universal in the broader literature.*

### COMMON LIMITATIONS IN ASSESSMENTS OF NON-COGNITIVE SKILLS

Duckworth and Yeager (2015) reviewed concerns with measurement of non-cognitive skills, outlining limitations of two types of assessments, questionnaires, and performance tasks, using the construct self-control for illustration (see Simmering et al., 2019, for related discussion). Questionnaires can be administered to any informant but most commonly use parent- and teacher-report for children and self-report for adolescents and adults. Questionnaires may ask about a subject's behavior in general, in a specified period (e.g., at this moment, in the past week, month, or year), or in a hypothetical situation [as in situational judgment tests (SJTs)]. Responses may be ratings of frequency (e.g., "almost never" ranging to "almost always"), how well a description fits the subject (e.g., more or less true or like the individual), or choices of specific behaviors in SJTs. The limitations Duckworth and Yeager described were misinterpretation of items, lack of insight or information, insensitivity at different time scales, and reference or social desirability bias. Simmering et al. (2019) also noted context insensitivity as a limitation, as behaviors may occur in some contexts but not others that are not differentiated by questionnaires (e.g., perseverance in school work versus hobbies, or different academic subjects). Furthermore, some studies suggest that self-reports in response to hypothetical situations diverge from actual behavior in analogous experiences (Woodzicka and LaFrance, 2001; Bostyn et al., 2018). Limitations of questionnaires have been extensively studied (e.g., Furnham, 1986), with numerous remedies developed (e.g., Kronsik and Presser, 2009).

An alternative approach is to observe behavior directly rather than eliciting informants' reflection and interpretation. Performance tasks are designed to compel behavior in relevant contexts, with the advantage of creating controlled situations in which all subjects are observed (for discussion, see Cronbach, 1970). For example, objective personality tests assess personality traits through behavioral indicators from performance tasks rather than self-reports (Ortner and Schmitt, 2014). Although performance tasks offer advantages over questionnaires – avoiding subjective judgments by informants, less opportunity for social desirability, reference, and acquiescence biases, more temporal sensitivity – they have serious limitations (see Duckworth and Yeager, 2015; Simmering et al., 2019, for further discussion). For example, lab-based performance tasks such as the Balloon Analogue Risk Task (Lejuez et al., 2002) typically assess single constructs (i.e., risk-taking) and lack diversity needed to form a complete personality profile. Performance tasks are generally designed to elicit one "right" behavior and may conflate "wrong" behaviors that reflect different underlying causes (e.g., Saxler, 2016). Participants' behavior may reflect factors beyond the intended construct, such as compliance with authority of comprehension of instructions. This is a particular concern when participants' prior experiences differ substantially from those designing, administering, and interpreting the tasks; behavior considered maladaptive in the task may be more appropriate to participants' experience. Furthermore, task artificiality could create inauthentic motivations and constraints, leading to unnatural behaviors. Tasks with scenarios created in real time can also lead to error in task implementation, recording of behavior, or participant responses.

To overcome these types of limitations, Duckworth and Yeager (2015) recommended using multiple measures suited to the assessor's goals while acknowledging and accounting for the limitations of each. They also noted that further innovation in assessment could avoid some limitations, with specific examples including incorporation of technology. In the next section, we review technological advances in non-cognitive assessments and the advantages they offer.

## ADVANTAGES OF TECHNOLOGY-ENHANCED ASSESSMENTS

Technology allows new and expanded ways to collect data and present content. Computerizing assessments has become more common as access to technology has increased, but these implementations often merely reconfigure prior assessments to be presented on a screen without further adaptation. We focus on more substantive changes that expand the scope of the types of measurements and content included in non-cognitive assessments.

First, technology allows for real-time collection of multiple types of data, including self-reports, physiological data, and observed behavior. Traditionally, assessments are presented once or a few times at widely spaced intervals. Continuous, unobtrusive data collection is now possible through devices such as smartphones or fitness trackers. For example, Wang et al. (2014) combined multiple data sources from automated sensors on a smartphone (i.e., accelerometer, microphone, light sensing, global positioning, Bluetooth) with self-report sampling to evaluate how college students' daily activity related to their mental well-being (i.e., depression, stress, loneliness) and academic performance. Sensor data correlated moderately with these outcomes, as well as students' self-reports. These data were then used to infer students' studying and social behavior to predict their GPA (Wang et al., 2015), indicating how sensor data could be used instead of self-reports. Automated sensors are not only less obtrusive to participants but can also provide a more temporally complete record, which avoids relying on narrow sampling and extrapolation to track change over time (c.f., Adolph et al., 2008). Such temporal detail is necessary to evaluate dynamic non-cognitive skills, such as self-regulation.

Second, ecologically valid methods allow data collection directly from relevant contexts, avoiding the need for retrospection or generalizations in questionnaires, imagined experiences in SJTs, or contrived scenarios in a lab (see Stone and Shiffman, 1994, for related discussion). Experience sampling methods, such as ecological momentary assessments and daily diaries, ask participants report thoughts, feelings, behaviors, and environment at regular intervals over time or around target events. They have been widely used to track emotions in natural contexts, allowing assessment of emotion regulation (Silk et al., 2003; Tan et al., 2012). When contextual variation is also recorded, these assessments can tally how frequently a subject encounters specific contexts and whether behavior varies across those contexts.

Third, some devices allow data collection not attainable without technology. For example, during computerized activities, participants' eye movements can be continuously recorded using eye-trackers, and mouse movements or touchscreen selections can be collected using specialized software. Such data were inaccessible before technological solutions were developed, and they provide the opportunity for more holistic analysis of behavior. Assessments that provide these and other types of process data during participation, such as item-level response latencies (e.g., Ranger and Ortner, 2011), allow researchers to use more than just final responses to improve measurement. For example, pupillometry and reaction times can differentiate whether participants were controlling attention proactively (i.e., mentally preparing for target actions) versus reactively (i.e., adjusting action following external signals) even when target actions (i.e., identifying a stimulus sequence) did not differ (Chatham et al., 2009). Log files of online game-based assessments include time and event information that can be used to track participants' collaboration during the game (Hao et al., 2016; Hao and Mislevy, 2018). Process data may provide insight into responses that would not be possible without technology, and analyzing such data can support assessment validation (Lee et al., 2019).

Beyond data collection, technology enables presentation of content in ways not possible with traditional assessments. Computerized adaptive testing draws items from a large pool of items with varying difficulty to present them adaptively based on test-takers' previous responses and estimated ability (Segall, 2005). This allows more sensitivity to student ability levels and reduces the influence of small mistakes and lucky guesses on the final estimated ability. While computerized adaptive testing is most often used to measure cognitive abilities, it can also improve the measurement of other constructs, like personality (Makransky et al., 2013) and mental health (Becker et al., 2008; Stochl et al., 2016). Because adaptivity is an important facet of non-cognitive skills, test design and administration organizations such as the National Center for Education Statistics recommend adaptive tests in collaborative problem solving and other future assessments (Fiore et al., 2017).

Beyond contingent item presentation, interventions can also be integrated into computerized assessments. Based on assessment results, personalized feedback and recommended learning materials can be provided to respondents to improve individual development. Such systems have gained popularity in assessments of cognitive skills (e.g., Klinkenberg et al., 2011) but can also support non-cognitive skills. For example, Hutt et al. (2017) developed an eye-tracking application to monitor students' mind-wandering in real time during a computerized learning task. When mind-wandering is detected, the application intervenes to repeat the recent material, redirect the student's attention, or ask a question to allow self-reflection in the student. Although the goal was to improve students' learning of the material, feedback on the frequency of mind-wandering could also teach students to monitor and regulate their mental engagement.

The nature of the material going into assessment items can also be expanded by technology. Rather than presenting text questionnaires, researchers can create multi-modal vignettes to present scenarios like SJTs. Audio-visual presentation is preferable to text for students with limited reading comprehension and can increase the validity for such groups (e.g., Chan and Schmitt, 1997). Through interactive technology like digital games and virtual or augmented reality, more complex content can be created to simulate real-life contexts that may be difficult to observe naturally. These environments can include "stealth" assessments in which students' capabilities are evaluated without explicit queries. For example, in a role-playing game comprising quests that require creative problem solving, players' actions may be scored for evidence of both cognitive (e.g., reading comprehension) and non-cognitive (e.g., persistence) competencies (Shute, 2011). Embedding target constructs in naturalistic interactions allows participants to respond with authentic behaviors rather than reporting imagined behavior in response to a hypothetical scenario. This can increase motivation and engagement when properly designed (Moreno-Ger et al., 2008), which in turn could reduce measurement error.

Technological advances can also facilitate generation of new content with reduced human effort, a vital feature for delivering assessments at scale. Machine learning and artificial intelligence have been developed for generating traditional assessment content (i.e., item stems and response options), although much work remains to achieve wide adoption (Gierl et al., 2012). One potential advantage to automated content generation, beyond the efficiency, is the expanded ability to personalize material for students. For example, research on motivation and engagement suggests that integrating students' social and cultural identities into instructional and assessment design can improve outcomes for students from marginalized groups (Haslam, 2017). More work is needed to identify the best ways to design non-cognitive assessments to align with students' identities, but technology provides a promising avenue to realize this level of personalization.

### CHALLENGES IN ADOPTING TECHNOLOGY-ENHANCED ASSESSMENTS

Technology-enhanced assessments are not without challenges and limitations. First, construct validity remains a significant concern, and adapting previous assessments to incorporate new technology may affect validity positively or negatively. As noted above, video vignettes in SJTs increased validity by decreasing the influence of reading comprehension (Chan and Schmitt, 1997). Conversely, more complex scenarios could introduce variation in interpretations or decision processes by participants. Such complexity likely reflects real-life contexts more closely but introduces challenges for standardization, especially when content presentation is contingent on participant performance. Standardized items and tasks, as well as scoring rubrics, for virtual performance assessments must be developed and validated in pilot studies (Hao et al., 2017).

The collection of more extensive, ecologically valid, and objective measures of behavior, whether during natural experience or games and simulations, still requires interpretation of how behaviors relate to underlying constructs (an important facet of construct validity; Borsboom et al., 2004). For example, although Hutt et al. (2017) related pupillometry and saccade duration to mind-wandering, these behaviors could be driven by external factors rather than internal processes. Similarly, data from automated sensors (as in Wang et al., 2014) cannot directly address whether variation in recorded activities reflects internal differences (i.e., participants' self-regulation abilities) versus external forces. It is also possible that behaviors measured in these ways are not representative: knowing one is being observed in daily life may lead to atypical behavior, especially when a device is first introduced (c.f., Alvero and Austin, 2004), or participants may be more willing to act "out of character" in a simulation.

Second, one must consider both ethical issues shared with traditional assessments (e.g., how data will be stored, used, and potentially shared; proper training for those administering and interpreting assessments) and new issues that arise with technology. Technological requirements can contribute to inequity, as not all communities have access to necessary infrastructure (e.g., internet bandwidth, devices meeting specifications) or funding to adopt high-tech assessments, and participants may be unaccustomed to using technology. Automated or continuous recordings may invade the privacy of participants or non-participants who have not consented to have their data collected (e.g., conversation partners in audio recordings); although these concerns would be addressed through human subjects protections for research, such protections do not extend to assessments in non-research settings. Ethical concerns for developing technological assessments are conceptually similar to traditional assessments but may be practically different. For example, machine learning algorithms may be biased due to the training sets used to develop them (Springer et al., 2018) similar to how questionnaires may be biased by validation with unrepresentative samples (Clark and Watson, 2019).

Third, collection of more varied and continuous data introduces challenges in compliance and data management. Participants may find continuous or frequent sampling intrusive and therefore be less willing to complete an assessment. Imperfections in devices and software can lead to lost data, with some sources of loss relating to constructs of interest (e.g., losing track of eye gaze if posture changes as interest wanes). The multitude of possible reasons underlying data loss across different types of sensors and devices, combined with reasons shared with traditional assessments (e.g., selectively omitting responses, attrition), makes addressing missing data both practically and theoretically complex.

How we make use of more and different types of data across sources also presents new challenges. Connecting multiple assessments to the same individual profile requires complex data management solutions to ensure both privacy for individuals and accessibility for those using assessment results. If multiple sources are used simultaneously in real time, the data streams must be synchronized and at compatible granularity. Intensive longitudinal datasets require developing identifiable statistical models that can accommodate irregularly spaced, high-dimensional, noisy, dynamic data, as well as related robust and efficient computing software to make use of them (Chow et al., 2018).

Lastly, there can be a strong temptation to apply new technology to assessment as it becomes available without fully evaluating the potential costs and benefits of its adoption. It is important not to let technological capabilities be the driving factors in assessment development but rather to focus on the need the assessment is serving and whether that need can be better met by technology. New technological applications must be carefully designed and validated even when they seem to be only a minor change from previous methods. For example, moving from text to audio-visual presentation of SJTs introduces decisions for how each character looks and sounds. Participants may interpret or respond to characters' behavior differently based on demographic features (c.f., Renno and Shutts, 2015) or voice intonation, which can unintentionally alter the content from the text version. Each new development will bring in new considerations for how the method reaches assessment goals.

### CONCLUSION

Advances in technology have expanded the horizons of what types of assessments are possible and achievable. These expansions can contribute to our understanding of non-cognitive capabilities as well as traditional academic content. The advantages of technology-enhanced assessments include how and what data can be collected, as well as the content that can be presented. With these advantages come some new challenges in the implementation and analysis of assessments, as well as the familiar challenges of construct and predictive validity that all assessments must address. Whether technology can improve an assessment will depend on details of the construct, the target group, the aims of the assessment, and the desired implementation. Assessment methods should be tailored to the specific conditions at hand. In the context of non-cognitive assessments in particular, more work is needed to arrive at well-defined constructs with clear connections to behavior as we also work to capitalize on the advantages technology offers.

#### REFERENCES


### AUTHOR CONTRIBUTIONS

VS conceptualized the topic and all three authors contributed equally to development of the ideas. VS drafted the manuscript, then LO and MB provided critical additions and revisions.

### ACKNOWLEDGMENTS

The authors would like to thank Chelsea Andrews for help finding and discussing relevant articles for background research.


**Conflict of Interest:** VS, LO, and MB were employed by the non-profit company ACT, Inc.

*Copyright © 2019 Simmering, Ou and Bolsinova. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.*

# Combining Text Mining of Long Constructed Responses and Item-Based Measures: A Hybrid Test Design to Screen for Posttraumatic Stress Disorder (PTSD)

#### Qiwei He<sup>1</sup> \*, Bernard P. Veldkamp<sup>2</sup> , Cees A. W. Glas<sup>2</sup> and Stéphanie M. van den Berg<sup>2</sup>

<sup>1</sup> Educational Testing Service, Princeton, NJ, United States, <sup>2</sup> Department of Research Methodology, Measurement and Data Analysis, Faculty of Behavioural, Management and Social Sciences, University of Twente, Enschede, Netherlands

#### Edited by:

Frank Goldhammer, German Institute for International Educational Research (LG), Germany

#### Reviewed by:

Alexander Robitzsch, University of Kiel, Germany Margot Mieskes, Darmstadt University of Applied Sciences, Germany

> \*Correspondence: Qiwei He qhe@ets.org

#### Specialty section:

This article was submitted to Quantitative Psychology and Measurement, a section of the journal Frontiers in Psychology

Received: 20 December 2018 Accepted: 03 October 2019 Published: 22 October 2019

#### Citation:

He Q, Veldkamp BP, Glas CAW and van den Berg SM (2019) Combining Text Mining of Long Constructed Responses and Item-Based Measures: A Hybrid Test Design to Screen for Posttraumatic Stress Disorder (PTSD). Front. Psychol. 10:2358. doi: 10.3389/fpsyg.2019.02358 This article introduces a new hybrid intake procedure developed for posttraumatic stress disorder (PTSD) screening, which combines an automated textual assessment of respondents' self-narratives and item-based measures that are administered consequently. Text mining technique and item response modeling were used to analyze long constructed response (i.e., self-narratives) and responses to standardized questionnaires (i.e., multiple choices), respectively. The whole procedure is combined in a Bayesian framework where the textual assessment functions as prior information for the estimation of the PTSD latent trait. The purpose of this study is twofold: first, to investigate whether the combination model of textual analysis and item-based scaling could enhance the classification accuracy of PTSD, and second, to examine whether the standard error of estimates could be reduced through the use of the narrative as a sort of routing test. With the sample at hand, the combination model resulted in a reduction in the misclassification rate, as well as a decrease of standard error of latent trait estimation. These findings highlight the benefits of combining textual assessment and item-based measures in a psychiatric screening process. We conclude that the hybrid test design is a promising approach to increase test efficiency and is expected to be applicable in a broader scope of educational and psychological measurement in the future.

Keywords: posttraumatic stress disorder, text mining, item response theory, Bayesian framework, self-narratives

## INTRODUCTION

Epidemiological research on mental illnesses such as posttraumatic stress disorder (PTSD) requires efficient methods to identify cases in large population-based samples (Shrout and Yager, 1989) because the diagnosis of the disorder is difficult to make and can involve expensive testing. A two-phase design can help on both accounts. The first phase involves a screening measure, meaning a more detailed diagnostic procedure needs to be administered solely to a selected subsample (Diamond and Lilienfeld, 1962; Shrout et al., 1986).

Item-based self-report instruments are often considered efficient for PTSD screening, as they usually require short administration time and do not require the presence of a clinician (Wohlfarth et al., 2003). Questionnaires such as the Trauma Assessment of Adults (Gray et al., 2009), the Brief Trauma Questionnaire (Schnurr et al., 2002), the Life Events Checklist (Gray et al., 2004), and the Trauma Life Events Questionnaire (Kubany et al., 2000) all have psychometric support for evaluating exposure to potentiality traumatic events. In addition to trauma exposure screeners, abbreviated PTSD symptom screeners are frequently used to determine the need for more in-depth clinical interviews (Lancaster et al., 2016). These include the Primary Care PTSD Screen (PC-PTSD; Prins and Ouimette, 2004), the Short Form of the PTSD Checklist-Civilian Version (Lang and Stein, 2005), the Trauma Screening Questionnaire (TSQ; Brewin et al., 2000), and the Short Post-Traumatic Stress Disorder Rating Interview (SPRINT; Connor and Davidson, 2001). These instruments ideally contain the minimal number of items necessary for accurate case identification, have simple decision rules to determine who passes and fails the screening, and are applicable to populations with varying prevalence of PTSD and experiencing different traumas (see more in reviews by Brewin, 2005; Lancaster et al., 2016).

As an alternative to such questionnaire-based screening, He et al. (2012) developed a computerized textual assessment system using text mining techniques, which was proved to be effective in analyzing open-ended writings regarding participants' trauma history and physical symptoms. The main idea was to analyze the respondents' textual input – the self-narratives describing traumatic experiences and impacts on their personal life to predict the risks of developing PTSD. In their study, the textual screening procedure resulted in a good agreement (82%) compared with a clinical structured interview in identifying the presence and absence of PTSD and yielded a higher sensitivity and positive prediction power than an itemized screening instrument.

With a growing body of research in learning patterns of language usage in psychiatric patients, textual input became recognized as an important additional source in the prediction of mental health (Pennebaker et al., 2003). For instance, Pennebaker (2001) found that linguistic markers, such as the use of negativeemotion words, cognition words, and insight words, predicted the future mental health of college students who wrote about traumatic events. Alvarez-Conrad et al. (2001) defined the presence of words relating to death and dying as an indicator of treatment-resistant PTSD. Consequently, the analysis of respondents' textual input and linguistic elements might provide crucial information for understanding cognitive mechanisms associated with trauma and hold valuable potential to screen for and predict PTSD symptoms and subtypes. Properly developed technologies such as text mining are expected to help individuals to self-test and public health organizations to screen for possible mental health conditions and prompt further evaluation when warranted, potentially preventing disorders from becoming chronic, debilitating, and difficult to treat (Todorov et al., 2018).

The focus of this study is to assess to what extent text mining techniques can be applied in the PTSD screening phase and to establish the extent to which they result in better estimates and better prediction of true diagnosis compared to the use of a questionnaire alone. Specifically, we propose a two-stage hybrid test design using a Bayesian approach to combine text mining and item response modeling in one systematic framework, where an automated score based on textual analysis serves as input for a prior distribution of a latent trait associated with PTSD that is measured by a number of questionnaire items using an item response theory (IRT) model (Rasch, 1960; Lord, 1980). Bayesian methods are especially useful for the estimation of a hierarchical structure (refer to Mislevy, 1986; Zwinderman, 1991), which allows extra prior information to be added into the measurement with the aim to increase prediction accuracy. Models developed in the Bayesian framework have been applied broadly in psychological and educational assessments. For instance, Matteucci and Veldkamp (2013) integrated students' background variables, such as scores obtained by the examinees from other tests, socioeconomic variables, and demographic variables as prior information to improve the accuracy of students' ability estimates (van den Berg et al., 2013) combined self-report and clinical interview data in a Bayesian approach to increase measurement precision in identifying schizotypal symptoms. However, the inclusion of textual assessments as prior information has been rarely described in the literature.

The purpose of this study is twofold: first, to investigate whether the combination model of textual analysis and itembased scaling can enhance the classification accuracy of PTSD, and second, to examine whether the standard error of estimates could be reduced through use of narrative as a kind of routing test. To examine the performance of our proposed method, we conducted a study to compare the estimates for a latent trait associated with PTSD with and without the use of a text mining score by means of three approaches: (1) an IRT-based test only, (2) textual analysis only, and (3) a combination of textual analysis and IRT-based itemized test including using the whole range of IRT-based items at one time and adding items adaptively starting from the one with the highest information, which is similar to the item selection procedure used in computerized adaptive testing (van der Linden and Glas, 2000).

### MATERIALS AND METHODS

### Sample and Instrument

Data used in the current study were collected from 105 trauma survivors via an online survey embedded in an open forum that is dedicated to people with mental health issues. Before administering items from the survey, all the participants were asked to report whether they had been diagnosed as PTSD or non-PTSD by psychiatrists via structured interviews with standardized instruments. Cases with missing diagnoses were discarded in the present study. Participants were also informed that the objective of the research was to develop a more flexible intake procedure for PTSD diagnosis and were requested to give responses to all the questions following the instructions.

The online survey consisted of two parts: self-narrative writing and administration of dichotomous questions regarding

PTSD symptoms. In the writing section, respondents were asked to write about their traumatic events and briefly describe the symptoms related to these experiences. Text length was recommended to be over 150 words, which was found as the average length of self-narratives input by PTSD patients in a previous study (He et al., 2012). In the item-based section, respondents were required to give compulsory answers to 21 items that were employed exactly the same in the National Comorbidity Study-Replication (NCS-R; Kessler et al., 2004) PTSD screening section. The NCS-R, conducted between February 2001 and April 2003 in the United States, is a nationally representative community household survey of the prevalence and correlates of mental disorders. These 21 dichotomous items (i.e., "yes" = 1, "no" = 0) one-to-one correspond to the PTSD symptoms that were defined in Diagnostic and Statistical Manual of Mental Disorders Fourth Version (DSM-IV; American Psychiatric Association, 2000). The first two columns in **Table 1** show the PTSD diagnostic criteria in the DSM-IV and their corresponding items that were used in the NCS-R as well as in this study.

Six of the 105 participants were excluded: Two reported they had never experienced traumatic events that were listed in the NCS-R, and four gave responses only to the item section but missed the writing section. This resulted in a total of 99 participants for the final set, among whom 34 were diagnosed as PTSD and 65 as non-PTSD. The sample had an age range between 19 and 63, with a mean of 30.06 (SD = 11.30). The majority of participants were female (78.4%). Over 90% participants had a higher educational background (i.e., college/university or above). 52.6% participants were reported as single, 40.2% were married, and 6.2% were divorced.

### Procedure

To examine the performance of the hybrid test design, we estimated individuals' PTSD latent traits via three approaches: (1) an IRT-based test only, (2) text classification of self-narratives, and (3) combining textual analysis and IRT in a Bayesian framework. There were two analytic paths involved in the third approach: In one path, we combined the textual analysis with the whole set of 21 IRT-based items at a single time. In the other, we combined the textual analysis and the IRT latent scale in an

TABLE 1 | Item Parameters of 21 Questions Related to PTSD in NCS-R (calibrated with n = 880).


The item parameters were estimated from unidimensional 2PL model on a sample of 880 respondents in the NCS-R. SE indicates the standard error of item parameter estimation. r indicates validity coefficients that are calculated as the correlation of total score with each criterion item.

adaptive way, that is, we added the 21 items into the analysis one by one in descending order of item information available. We will illustrate each approach in detail in the following subsections. All analyses in the Bayesian framework were conducted using the software WinBUGS 1.4.3 (Lunn et al., 2000).

#### Approach 1: Using an IRT-Based Test Only

The IRT framework has been increasingly applied in psychiatric assessments in recent decades (e.g., van Groen et al., 2010; Weisscher et al., 2010; He et al., 2014b). In contrast to the classical sum score methods, IRT models (Rasch, 1960; Lord, 1980) provide improvement and flexibility by scaling the difficulty of items and the latent trait level of people on the same metric. Namely, the severity of prescribed symptoms and the latent degree of individuals' mental illness are set on a common scale, and thus can be meaningfully compared.

In the first approach, we focused on applying an IRT model on responses to the 21 PTSD diagnostic items in the NCS-R without adding any prior information. We employed a set of fixed item parameters that were previously calibrated using a larger sample size of 880 respondents collected in the NCS-R (He et al., 2014b). Note, however, that these 880 respondents gave responses to the questionnaire only, without any input by way of self-narratives. Given the objective of this study – examining the role of textual information in latent trait estimation to screen for PTSD, we had to collect a new sample of 99 respondents in this study who gave responses to both textual self-narratives and an itemized questionnaire, thus making it possible to combine both structured and unstructured data analysis in one framework.

In He et al. (2014b), given that symptom domains defined by the DSM-IV were used to index a general level of PTSD severity, we first considered a unidimensional two-parameter logistic (2PL) model underlying responses to the 21 symptoms (i.e., all 21 items on a single dimension). Next, given that the major 17 symptoms (in criteria domains B, C, and D) are placed a priori into three separate criterion domains, we also considered a threedimensional IRT model where each domain was associated with a separate dimension. In addition, a special version of the 2PL model – the Rasch model or one-parameter logistic (1PL) model (Rasch, 1960) where the item discrimination parameter is simply fixed as one – was also considered, since such a model is often used in clinical applications as well (e.g., Wong et al., 2007; Elhai et al., 2011).

In the unidimensional 2PL model, that is, the probability of a score in category "yes" (Xni = 1) of item i is given by the item response function

$$P(X\_{ni} = 1 | \theta\_n) = \frac{\exp\left[\alpha\_i(\theta\_n - \beta\_i)\right]}{1 + \exp\left[\alpha\_i(\theta\_n - \beta\_i)\right]},\tag{1}$$

where θ<sup>n</sup> is the latent PTSD level of person n, β<sup>i</sup> is an item difficulty parameter representing the severity level of each diagnostic symptom, and α<sup>i</sup> is an item discrimination parameter indicating the extent to which the item response is related to the latent θ-scale. Note that in the Rasch model, the discrimination parameter α<sup>i</sup> is fixed as 1. In the multidimensional version of the 2PL model, the probability of a positive response depends on M latent variables, say θn1, . . . , θnm, . . . , θnM. In the multidimensional case, in eq. 1, the product αiθ<sup>n</sup> is replaced by P αimθ nm .

m The dimensionality and model fit were examined using two steps: a likelihood ratio-statistic and an item-oriented Lagrange multiplier (LM) test. First, the likelihood-ratio test of the 2PL model against the Rasch model yielded a value of the test statistic χ <sup>2</sup> = 78.53, df = 16, p < 0.001, while the multidimensional model against the unidimensional 2PL model yielded a value of χ <sup>2</sup> = 37.41, df = 3, p < 0.001. It was concluded that the multidimensional model fit the data best, and the 2PL fit the data significantly better than the Rasch model. However, although using a more complex model generally results in better model fit, using a more parsimonious model might still lead to adequate data description.

To investigate this, a second approach was used. Under each model, item fit was evaluated using an LM item fit statistic (Glas, 1998, 1999). These statistics can be used to evaluate the fit of the expected item response function given by Formula (1) to the observed item responses. Item fit was tested with a significance level of 0.01. For the Rasch model, the test was significant for six items, while no tests were significant for either the 2PL model or the multidimensional model. Further, the LM test statistic is accompanied by an effect size that measures the difference in observed and expected average item responses. For the 2PL model and the multidimensional model, these differences had the same magnitude. Hence, although a multidimensional IRT model fit the data better than 2PL in terms of the likelihood ratio test, it was not clearly superior in item fit. Therefore, the simpler unidimensional 2PL model was preferred over the more complicated multidimensional one. Consequently, the item calibration in the NCS-R was undertaken with the unidimensional 2PL model by marginal maximum likelihood (Bock and Aitkin, 1981) on a sample of 880 respondents in He et al. (2014b).

Further, we calculated validity coefficients r to examine how strong each criterion weighed on the general trait of PTSD and check whether these external criteria could match the discrimination parameters derived from the 2PL that indicates the extent to which the item response was related to the latent θ-scale. The validity coefficient is a statistical index used to report evidence of validity for intended interpretations of test scores and defined as the magnitude of the correlation between test scores and a criterion variable. We calculated the validity coefficients as the correlations between the NCS-R test results and each criterion variable and reported the results in the last column in **Table 1**. The larger the validity coefficient, the more confidence we can have in predictions made from the PTSD test scores. As shown in **Table 1**, the discrimination parameters in the third column showed a high agreement with the validity coefficients in the last column: for instance, the highest discrimination parameter located in criterion C6, where the top validity coefficient 0.58 was also found in this item. Similar findings were also applied to the lowest values of these two variables such as in criterion E1 and C3. The evidence demonstrated that the item weighting from the 2PL

could provide similar conclusions based on external criteria (i.e., validity coefficients) to get consistent results in identifying strong (weak) factors in the test.

To maintain consistency with the previous study (He et al., 2014b), we fixed the calibrated item parameters in the current study. The fixed parameters and their standard errors were reported in the third column to the sixth column in **Table 1**. As shown here, the discrimination parameters varied in the interval [0.78, 1.86], with a mean value around 1.32. The difficulty parameters were included in the range [−4.45, 1.22], with a mean of −0.99. The respondents' latent traits were estimated by expected a posteriori (EAP) assuming a normal distribution.

#### Approach 2: Text Classification of Self-Narratives

Text classification is a special approach in the field of text mining, aiming to assign textual objects from a universe to two or more classes (Manning and Schütze, 1999). Supervised text classification generally involves two phases: a training phase and a testing phase. During the training phase, the most discriminative keywords to determine the presence or absence of PTSD are extracted and the relationship between the keywords and class labels is learned. The testing phase involves checking how well the trained classification model performs on a new dataset. In the testing procedure, each new input is scanned for the keywords that were extracted from training, and the most likely label for each new self-narrative is predicted. He et al. (2012) developed a supervised text classification model for PTSD screening. In this study, 300 self-narratives, consisting of 150 written by PTSD respondents and 150 written by non-PTSD respondents, were used to develop a screening system. In a follow-up study (He et al., 2017), four machine learning algorithms – including Decision Tree (DT), Naïve Bayes (NB), Support Vector Machine (SVM), and a self-developed alternative, the product score model (PSM) – were employed in conjunction with five data representations – unigrams, bigrams, trigrams, a combination of uni- and bigrams, and a mixture of n-grams. Unigram is the simplest and most commonly used data representation model where each word in a document collection acts as a distinct feature. N-gram considers the interaction effect of two, three, or more consecutive words (Manning and Schütze, 1999).

In He et al. (2017), it was found that narrative classification accuracy was maximized with the PSM in conjunction with unigrams. Although the addition of n-grams (i.e., bigrams and trigrams) has not significantly enhanced overall classification accuracy, it did help balance the performance metrics of text classification and improve the reliability of prediction. Furthermore, slight prevalence effects were found in the overall accuracy of all four machine learning algorithms; however, a substantial increase of positive prediction value (PPV) was noticed with the increase of prevalence of PTSD. When the prevalence of PTSD was low, the SVM and PSM had good sensitivity and high negative predictive power. This suggested that these two models could perform well in excluding the individuals identified as non-PTSD from the follow-up tests. Further, in a comparison with the mean performance of traditional screening measures reviewed by Brewin (2005), the SVM and PSM were shown to be more sensitive in detecting PTSD than the traditional screening measures, but their ability in detecting non-PTSD was a bit lower than the benchmark in clinical practice.

Because the PSM in conjunction with unigrams resulted in the highest agreement with the psychiatrists' diagnoses in clinical practice in the previous study (He et al., 2017), we applied this approach in the present study. We used the top 1,000 unigrams that were identified as the most robust classifiers to distinguish PTSD from the non-PTSD in He et al. (2012, 2017). Among the 1,000 unigrams, in descending order of word frequency, the 10 unique words most used by the PTSD patients were "rape," "flashback," "fire," "involve," "avoid," "incident," "date," "tower," "men," and "fault." The words "test," "hardly," "tumor," "tight," "excite," "evil," "pleasure," "vision," "frantic," and "funny" were found to be the top 10 in the non-PTSD corpus (He et al., 2012). Analogous to the results obtained by Orsillo et al. (2004) in the research regarding emotion expressions of PTSD patients, the words favored by PTSD patients had relatively stronger negative semantic tendency no matter the lexical form: adjective, noun, or verb (He et al., 2012).

A preprocessing routine was implemented to standardize the n-grams for textual analysis, which was consistent with the previous studies (He et al., 2012, 2017). This involved screening digital numbers, deducting non-informative "stop words"<sup>1</sup> (e.g., "I," "to"), common punctuation marks (e.g., "," ":") and frequently used abbreviations (e.g., "isn't," "I'm"), and "stemming" the rest of the words, using the Porter algorithm (Porter, 1980), to remove common morphological endings. For example, the terms "nightmares," "nightmaring," and "nightmare" were normalized in an identical stem "nightmar" <sup>2</sup> by removing the suffixes and linguistic rule-based indicators (for more preprocessing rules refer to Manning and Schütze, 1999; He et al., 2012, 2017).

The PSM is an alternative machine learning algorithm to address the smoothing issue of NB using a form of Laplace's law (Laplace, 1995). This model was validated in previous studies (He et al., 2012, 2017). Holding the similar independence assumption as the NB model, the PSM features assigning two weights for each keyword (in binary classification) to indicate how popular the keywords are in the corpora of self-narratives written by either PTSD patients (corpus<sup>3</sup> C1) or non-PTSD patients (corpus C2). The name product score comes from a product operation to compute scores for each class, that is, S<sup>1</sup> and S2, for each input text based on the term weights. To be consistent with the previous studies, we used the smoothing constant a = 0.5, which was added to the word frequency to account for words that did not occur in the training set but might occur in new texts (for more smoothing rules refer to Manning and Schütze, 1999; Jurafsky and Martin, 2009). The equation is,

$$\begin{cases} \text{S}\_1 = P(\text{C}\_1) \cdot \prod\_{w=1}^k \left[ (u\_w + a) / len(\text{C}\_1) \right] \\ \text{S}\_2 = P(\text{C}\_2) \cdot \prod\_{w=1}^k \left[ (\nu\_w + a) / len(\text{C}\_2) \right], \end{cases} \tag{2}$$

<sup>1</sup>The current study used the standard "English Stop Word List" (127 words) in Python Natural Language Toolkit (NLTK, Perkins, 2010) to deduct the noninformative words.

<sup>2</sup>The stemming algorithm is used to normalize lexical forms of words, which may generate stems without an authentic word meaning, such as "nightmar." <sup>3</sup>A body of texts is usually called a text corpus.

where u<sup>w</sup> and v<sup>w</sup> are the number of occurrences of keyword w in both corpora C<sup>1</sup> (i.e., PTSD corpus) and C<sup>2</sup> (i.e., non-PTSD corpus), respectively. len(C) is the corpus length, namely, the sum of the word occurrences in each corpus. P(C) is the prior probability of a certain class in the whole corpus collection. The classification rule is defined as:

$$\text{choose } \begin{cases} C = 1 & \text{if } \log(\mathbb{S}\_1/\mathbb{S}\_2) > b \\ C = 2 & \text{else} \end{cases},\tag{3}$$

where b is a constant set as zero in this study. The reason was that in the previous study (He et al., 2012) it was found during the PTSD textual screening procedure that the largest number of positive cases could be captured without unduly sacrificing specificity when the threshold was set at zero. The value of log(S1/S2) was defined as the text score for each self-narrative (see also He and Veldkamp, 2012; He et al., 2012). For an easy comparison with the IRT scales, we standardized the text scores as Z ∼ N(0, 1) 4 .

#### Approach 3: Combining Textual Analysis and IRT in a Bayesian Framework

Textual analysis and item response modeling were combined in a Bayesian framework, where the text score of each self-narrative obtained in approach 2 was used as prior information. The posterior distribution of the latent PTSD level is proportional to the product of the prior and the likelihood, that is,

$$P(\theta|\mathbf{x}, \boldsymbol{y}) \propto p(\boldsymbol{x}|\theta, \boldsymbol{\alpha}, \boldsymbol{\beta}) \mathbf{g}(\theta|\boldsymbol{y}),\tag{4}$$

where x is the vector of responses to the questionnaire, y is the text score for each individual, g(θ|y)is the prior given the covariate of textual assessments, α and β are the fixed discrimination and difficulty parameters of items, p(x|θ, α, β) is the likelihood function of the IRT model. The relation between the PTSD latent trait θ of individual n and the text score y<sup>n</sup> is given by the linear regression

$$
\theta\_n = b\_0 + b\_1 \wp\_n + \varepsilon\_n,\tag{5}
$$

where b<sup>0</sup> and b<sup>1</sup> are the regression coefficients. The error terms are assumed to be independent and normally distributed as ε<sup>n</sup> ∼ N(0, σ 2 ) with n = 1, ..., N individuals. The assumption of a linear regression model is translated into a normal conditional distribution of θ<sup>n</sup> given the text covariate as

$$
\theta\_n | \mathbf{y}\_n \sim N(b\_0 + b\_1 \mathbf{y}\_n, \ \sigma^2) \tag{6}
$$

Formula (6) represents an informative prior distribution of the PTSD latent trait. For each individual, the estimation of latent trait was performed by using 5,000 Markov chain Monte Carlo (MCMC) iterations with the burn-in of length of 1,000.

To determine whether the introduction of the prior distribution was effective, we compared the posterior distribution of θ<sup>n</sup> in the combination model with the estimation from the IRT-based test only. Because the item parameters in the IRT model were fixed, the θ-estimates resulting from both of the IRTbased test and the combination model (use textual information as a prior) were on a common scale and thus could be compared.

Two investigations were conducted to analyze the efficiency of the combination model. The first was to combine the textual assessments with the full range of 21 items of the NCS-R questionnaire. The main purpose was to explore whether adding the text prior would significantly impact the accuracy of PTSD detection. The second investigation pursued the question of whether adding textual assessments to the questionnaire could result in a reduction of the number of items administered without sacrificing precision of the θ-estimates. Those items that provide peak information around the cutoff threshold are ideal for a shorter version of a mastery test (Thomas, 2011). Since the target of screening is to make classification decisions, a natural choice would be to maximize information at the chosen diagnostic cutoff (for more about item information refer to Lord, 1980). In the current study, we employed the same cutoff point at θ = −0.15 that was derived from He et al. (2014b) to distinguish PTSD and non-PTSD using a larger sample size of 880 respondents collected in the NCS-R. As mentioned above, this study shared the same questionnaire scale and item parameters as He et al. (2014b). This ensured the value of the cutoff point was comparable in these two studies. Further, the cutoff point derived based on a larger sample size was shown to be more reliable than a smaller sample size, so we kept the cutoff value consistent.

In He et al. (2014b), three approaches were used to set the standard (i.e., obtain a cutoff point on the latent scale) to distinguish PTSD and non-PTSD. The first approach entailed finding the midpoint between the medians of the two distributions (Cizek and Bunch, 2007). The second was the contrasting-groups method (Brandon, 2002), which uses logistic regression to determine the latent score point at which the probability of category membership is 50%. Setting the respondent status as a dichotomous variable coded 0 = non-PTSD and 1 = PTSD, we entered the latent scores of all the respondents into a general logistic regression equation; that is, y ∗ = a + bx, where y ∗ is the predicted value of the outcome variable (respondent status) for a respondent and x is the respondent's observed score. Given y <sup>∗</sup> = 0.5, the classification cutoff point for PTSD and non-PTSD groups could be obtained simply. The third approach used the Bayesian discrimination function, which minimizes expected risk. Using the zero-one loss function, the decision boundary becomes gi(x) = P(C<sup>i</sup> |x) = P(Ci)p(x|Ci) p(x) , where P(Ci) is the prior probability (i.e., the prevalence of PTSD or non-PTSD in the total sample); p(x|Ci) represents the class likelihood (we assumed the latent trait scores have a normal distribution); and p(x) indicates the marginal probability of observation x. Given the assumption of normal distribution in both PTSD and non-PTSD groups, we could derive the cutoff point. Finally, we calculated the average of these three cutoff points based on the 21 items in the NCS-R and got −0.15 as the cutoff point on the latent scale.

<sup>4</sup> In He et al. (2014b), the IRT parameters were calculated by the marginal maximum likelihood method with the assumption that ability was in a standard normal distribution. The original ability scale was therefore also in a standard normal distribution. In other words, after fixing the IRT parameters, the resulting ability scores are on a standard normally distributed scale. Therefore we can normalize the text score on the same scale.

Consequently, in the present study, we calculated the item information for all the 21 items at this derived cutoff point and ranked the items in a descending order, namely, starting from the item with the highest information to the least information (see **Figure 1**). The items were ordered as following: C6, B5, C4, B3, C5, C2, D5, B2, B4, D3, D2, F2, C7, D4, D1, C1, B1, C3, F1, E1, A2. We started to examine the performance of a combination of the text prior and the most informative item – text prior with item C6 (i.e., "did you have trouble feeling normal feelings like love, happiness, or warmth toward other people?") versus using item C6 alone. The second informative item (B5) was then added in for the comparison of the next pattern. The procedure continued until all the 21 items were included. Both test information and standard error of θ – estimates were calculated for each pattern (i.e., with and without text prior) with an increasing number of informative items. Since textual assessment was suggested as a sort of complementary information to predict people's physical and mental health (e.g., Gottschalk and Gleser, 1969; Rosenberg and Tucker, 1979; Smyth, 1998; Franklin and Thompson, 2005), the test information was expected to increase, and the standard errors were expected to decrease when text priors were added.

The performance of the three approaches was compared on five metrics: accuracy, sensitivity, specificity, positive predictive value (PPV) and negative predictive value (NPV). The diagnoses made in the structured interviews by psychiatrists were used as the true standard in the comparison. Accuracy, the main metric used in classification, is the percentage of correctly defined individuals. Sensitivity and specificity are the proportion of actual positives and actual negatives that are correctly identified, respectively. These two indicators do not depend on the prevalence in the sample (i.e., proportion of "PTSD" and "non-PTSD" of the total), and hence are indicative of real-world performance. The predictive values, PPV and NPV, are estimators of the confidence in predicting correct classification, that is, the higher predictive values are, the more reliable the prediction is.

### RESULTS

For the sample of 99 participants, the latent trait estimation via approach 1 resulted in a normal distribution of latent trait levels θn, with a mean value of −0.39 and variance of 2.31. The standardized text scores obtained from approach 2 resulted in a range [−2.92, 4.22]. In approach 3, the latent linear regression model given by Formula (4) and (5) was estimated using the item responses and the textual covariates. The intercept and slope coefficients were obtained as −0.41 and 1.44, respectively. The error term in the prior information (textual covariates) had a normal distribution with a mean value of zero and variance of 3.57. Hence, the informative prior distribution of the PTSD latent trait was defined as θn|y<sup>n</sup> ∼ N(−0.41 + 1.44yn, 3.57).

The correlations among the estimations from the three approaches are presented in **Table 2**. It was noted that the correlation between the EAP of θ-estimates via approach 1 and the text scores estimated via approach 2 was 0.56, suggesting that there was a positive and moderate relation between the selfnarrative writing and the responses to the itemized questionnaire in the structured interview. This result reiterated the findings in the earlier studies that the words and expressions were capable of predicting one's mental health status.

**Table 3** shows the performance metrics of the three approaches. As we expected, the diagnostic accuracy rate was fairly high – 0.94 – when using the 21-item questionnaire by the IRT alone, and was improved to 0.97 with an addition of textual assessment. It suggested that 6 out of 99 respondents were misclassified using the IRT scale alone, while the misclassification rate decrease to 3 out of 99 respondents when adding the textual analysis as prior information. Using a 95% confidence interval, the paired sample t-test showed that the mean of latent trait estimation (t = 3.86, df = 98, p < 0.01) and standard deviation of latent trait distribution both significantly differ with and without text prior (t = 3.70, df = 98, p < 0.01). That is, the extra information gained from the textual analysis helped the latent trait locate closer to their true value, which helped decrease the misclassification rate by 50%. Given concerns on only using the keywords as predictors to make the classification, the accuracy rate (0.84) produced by the textual assessment was satisfactorily high, although it was a bit lower than the other two approaches. The sensitivity and NPV were perfect for all three approaches, implying that both the IRT and the textual assessments were sensitive for identifying PTSD patients. With the introduction of textual assessment, the specificity and PPV rose to 0.95 and 0.92, respectively. It suggested that the textual assessment played an effective role in detecting non-PTSD and strengthened the power in identifying PTSD in the population.

We further examined the relationship between the standard error of the estimate of θ and the number of items with the

TABLE 2 | Correlations among estimates from three approaches: IRT, TX, and a combination of TX and IRT (21-item).


TX indicates the textual assessments. Correlation is significant at the 0.01 level (2-tailed).

TABLE 3 | Performance metrics compared among IRT, TX, and a combination of TX and IRT (21-item).


TX indicates textual assessment. PPV and NPV represent the positive predictive value and negative predictive value, respectively.

presence or absence of text prior. We added in items into the analysis one by one following an adaptive way with a descending order of the item information, which was derived at the cutoff point introduced in the **Figure 1**. As shown in **Figure 2**, the horizontal axis indicates the number of items in the IRT model and the vertical axis indicates the average standard error of the latent trait estimation. The curve of standard error without using the text prior (i.e., the dash line), that is, using the IRT model alone via approach 1, starts around 1.6 and drops gradually to 0.68 when all the 21 items are included. The curve of standard error using a text prior (i.e., the solid line) follows a similar pattern but stays on a lower level than the dash curve. It starts around 1.4 (when the first item with the highest information was included) and ends around 0.65 (when all the 21 items were included). Using a 95% confidence interval, the paired sample t-test showed that the standard error of estimation with text prior was significantly lower than that without text prior (t = 3.86, df = 98, p < 0.01) when including the whole range of 21 items. With the increasing number of items, the differences between these two curves decreased from 0.20 to 0.03. It suggested that the textual assessment did have an impact on the latent trait estimation, and the effect was more apparent when using fewer items. The red dotted line highlights the standard error when using 21 items without the text prior. It crosses the solid curve at 17 items, implying that with the introduction of the text prior, 17 items would be good enough to make the estimation as precisely as using the whole range of 21 items. That is, by using the text priors, the questionnaire length can be shortened by 4 items without sacrificing precision.

### DISCUSSION

In this study, a new intake procedure for PTSD screening was developed that combined an automated textual assessment of patients' self-narratives and an itemized questionnaire. To determine whether the introduction of text information is effective, we identified PTSD cases via three approaches: (1) we estimated PTSD latent trait by using IRT on a standardized questionnaire, (2) classified patients' self-narratives into PTSD and non-PTSD groups by using a text mining technique, and (3) estimated the posterior distribution of PTSD latent trait by combining textual assessments and IRT in a Bayesian framework by both a linear and adaptive method. With the sample at hand, the results showed that the combination model enhanced the accuracy of PTSD detection from 0.94 to 0.97, reduced the standard error of latent trait estimation, and could shorten the questionnaire length by four items without sacrificing accuracy.

In the current study, the diagnostic accuracy was already high (0.94) when using the itemized questionnaire alone (approach 1). However, a structured interview that generally employs questionnaires is time consuming in daily practice. The computerized textual assessment proposed in this study is relatively easy to conduct via the internet. The highly satisfactory detection accuracy rate (0.84) is promising for real application. Note that the threshold in textual analysis could be adjusted according to the requirements of the practioner, for instance,

using a relatively lower threshold to include the maximum number of PTSD potential patients for the second step in an itemized questionnaire, or increasing the threshold to a higher value in order to precisely detect PTSD patients by the textual assessment alone (He et al., 2012). Given concerns of the cost-effectiveness of the screening at an initial stage, it would be interesting to combine these two approaches in a two-phase framework to reduce clinical expense and improve the accuracy rate.

Further, according to the results in the previous study of He et al. (2012), the NPV of the textual assessments was satisfactorily high – 0.85 – when the text classification algorithm PSM was applied in conjunction with unigrams. It meant that the textual screening tool was helpful in excluding the non-PTSD respondents from the follow-up tests. For the 99 sample in the present study, taking the 85% confidence interval, 53 respondents could be excluded from the further tests.

It is also worthwhile to discuss the cost-effectiveness of the hybrid test design that combined the textual analysis and item-based test. The results showed that using textual information helped save follow-up items. However, weighing the benefits of the text prior, we would also take the amount of time it takes to write self-narratives into account. On the one hand, from respondents' perspective, writing selfnarratives provides flexibility to express the individual's inner world and prevents being passively triggered by sensitive questions, even if the process might take longer than directly responding to the itemized questionnaire. On the other hand, from the practitioners' perspective, the procedure for item development is often time consuming and involves multiple steps (e.g., data collection, data cleaning, field trial, item parameter calibration, and examination of reliability and validity of a scale). Comparatively, textual analysis could substantially shorten scale-development time and simplify the procedure once the model is successfully trained and refined with different textual contexts.

In addition, structured textual analysis that usually involves tight structures from existing software, such as Linguistic Inquiry and Word Count (LIWC; Pennebaker et al., 2001), is a good supplement to the text mining-based techniques. LIWC is a textual analysis software program that looks for words and counts them in categories relevant to psychology across multiple text files, for instance, essays, emails, blogs, novels, and so on. It has two central features – the processing component and dictionaries. During processing, the program goes through each file word by word. Each word in a given text file is compared with the

dictionary file. A dictionary refers to the collection of words that define a particular category such as "family," "positive emotion," and "work." In a pilot study based on 50 self-narratives, half written by a PTSD group and half by a non-PTSD group, it was found that the PTSD respondents used significantly more emotional words and expressions related to family. These results are interesting enough to be addressed in another paper in the future.

Some limitations in the present study also merit discussion. First, the sample size was rather small at only 99 participants. Second, it was notable that female respondents represented the majority (approximately 78%) in the sample, which was consistent with the proportion of females in the target sample of PTSD<sup>5</sup> in the NCS-R. Further, evidence has shown that females are associated with a higher risk for PTSD (e.g., Lancaster et al., 2016). It would be interesting to examine whether the screening method (with text priors) plays an equal role in detecting PTSD in males and females, especially given concerns about the potential differences in their writing habits. Third, those in the sample had an unusually high level of education. This was probably caused largely by data collection being conducted on an internet platform. People with a higher educational background are possibly easier accessed via a web-based test than a less educated group (Naglieri et al., 2004). It would be interesting to make a comparative study in the future to investigate whether demographic variables (e.g., age, gender, and education) could make an impact on the textual assessment and hybrid model.

Last but not least, since the data used in this study was collected via an online platform, special caution needs to be taken as far as the potential risk of fake information. We had invited at least two psychiatrists to check each self-narrative entry to ensure the input was reasonable and authentic and could be used in this study. However, how to validate the internet data before entering data processing would be an important topic. For instance, He et al. (2014a) introduced an approach to detect potential fake information on social media (i.e., Facebook) data collection via statistical models on person and item fit.

Prevalence of a condition is an important indicator when reporting the performance metrics of a screening method. Whereas sensitivity and specificity are independent of the prevalence of the disorder in the population, positive and negative predictive power are sensitive to population prevalence (Brewin, 2005). In our previous study (He et al., 2014b), we reported the possible prevalence as ranging from 5 to 50% and noticed that there was little difference in the accuracy of screening for PTSD using the PSM model when the range of prevalence was so large. It was also noticed that when the prevalence of PTSD in the sample was increased, the PPV increased as well. It meant that the confidence of correctly identifying PTSD also increased. In the current study, we note that both specificity and PPV increased when we used the hybrid model.

In summary, the current study presented a new trial in developing a hybrid model to combine textual assessment of patients' self-narratives and itemized questionnaire in detecting mental illness. Its aim was to reduce the respondents' burden and clinicians' workload. Adding textual prior information, detection accuracy could be enhanced and test length could be shortened. The results demonstrated that the combination of a textual assessment and an IRT-based questionnaire is a promising approach to increase cost-effectiveness in PTSD diagnosis and is expected to be applicable in a broader scope of both (online) screening and psychiatric diagnosis as well as other psychological and educational assessments in the future. Further, with the rapid development of computer-based assessments, more data could be captured during the assessment process. The use of timing data as well as action sequences, keystrokes (e.g., type in and delete), and other process-related information hold promise for contributions to the advancement of screening methods in future research.

### ETHICS STATEMENT

This study was carried out in accordance with the recommendations of Code of Ethics for Research in the Social and Behavioural Sciences Involving Human Participants used as the guidelines by the Faculty of Behavioural, Management and Social Sciences (BMS) Ethics Committee, University of Twente with written informed consent from all subjects. All subjects gave written informed consent in accordance with the Declaration of Helsinki. The protocol was approved by the BMS Ethics Committee, University of Twente.

### AUTHOR CONTRIBUTIONS

QH contributed to the development of the methodological framework and the model estimation procedures, conduction of the data analysis, and the drafting and revision of the manuscript. BV contributed to providing suggestions on the methodological framework and the model estimation procedures, and the reviewing and revision of the manuscript. CG contributed to providing suggestions on the methodological framework, and the reviewing of the manuscript. SB contributed to providing suggestions on the model estimation procedures and conduction of the data analysis, and the reviewing of the manuscript.

### FUNDING

This study was partially supported by the Stichting Achmea Slachtoffer en Samenleving, Netherlands.

### ACKNOWLEDGMENTS

This manuscript was included in the first author's doctoral thesis "Text Mining and IRT for Psychiatric and Psychological Assessment" (He, 2013). The authors confirmed this is the only medium it has appeared in and is in line with the authors' university policy. The thesis can be accessed online at https://research.utwente.nl/en/publications/text-mining-andirt-for-psychiatric-and-psychological-assessment. The authors would like to thank Larry Hanover for his help in reviewing this manuscript.

<sup>5</sup>Only people who had mental health problems or were screened as positively high potential into mental problems in the round 1 were included as a target sample of PTSD in the NCS-R.

### REFERENCES

fpsyg-10-02358 October 19, 2019 Time: 16:21 # 11


Novel Screening Method? PsyArXiv (Preprints). Available at: https://psyarxiv. com/y68fx/ (accessed October 10, 2019).


scales among crime victims. Psychol. Assess. 15, 101–109. doi: 10.1037/1040- 3590.15.1.101


**Conflict of Interest:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2019 He, Veldkamp, Glas and van den Berg. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

# Predictive Feature Generation and Selection Using Process Data From PISA Interactive Problem-Solving Items: An Application of Random Forests

#### Zhuangzhuang Han<sup>1</sup> \*, Qiwei He<sup>2</sup> \* and Matthias von Davier<sup>3</sup> \*

<sup>1</sup> Teachers College, Columbia University, New York, NY, United States, <sup>2</sup> Educational Testing Service, Princeton, NJ, United States, <sup>3</sup> National Board of Medical Examiners, Philadelphia, PA, United States

#### Edited by:

Samuel Greiff, University of Luxembourg, Luxembourg

#### Reviewed by:

Timothy R. Brick, Pennsylvania State University, United States Daniel W. Heck, University of Marburg, Germany

#### \*Correspondence:

Zhuangzhuang Han zh2198@tc.columbia.edu Qiwei He qhe@ets.org Matthias von Davier MvonDavier@nbme.org

#### Specialty section:

This article was submitted to Quantitative Psychology and Measurement, a section of the journal Frontiers in Psychology

Received: 11 January 2019 Accepted: 17 October 2019 Published: 21 November 2019

#### Citation:

Han Z, He Q and von Davier M (2019) Predictive Feature Generation and Selection Using Process Data From PISA Interactive Problem-Solving Items: An Application of Random Forests. Front. Psychol. 10:2461. doi: 10.3389/fpsyg.2019.02461 The Programme for International Student Assessment (PISA) introduced the measurement of problem-solving skills in the 2012 cycle. The items in this new domain employ scenario-based environments in terms of students interacting with computers. Process data collected from log files are a record of students' interactions with the testing platform. This study suggests a two-stage approach for generating features from process data and selecting the features that predict students' responses using a released problem-solving item—the Climate Control Task. The primary objectives of the study are (1) introducing an approach for generating features from the process data and using them to predict the response to this item, and (2) finding out which features have the most predictive value. To achieve these goals, a tree-based ensemble method, the random forest algorithm, is used to explore the association between response data and predictive features. Also, features can be ranked by importance in terms of predictive performance. This study can be considered as providing an alternative way to analyze process data having a pedagogical purpose.

Keywords: process data, interactive items, feature generation, feature selection, random forests, problemsolving, PISA

### INTRODUCTION

Computer-based assessments (CBAs) are used for more than increasing construct validity (e.g., Sireci and Zenisky, 2006) and improving test design (e.g., van der Linden, 2005) through inclusion of adaptive features. They also provide new insights into behavioral processes related to task completion that cannot be easily observed using paper-based instruments (Goldhammer et al., 2013). In CBAs, a variety of timing and process data accompany test performance. This means that much more data from the response process of an answer is available in addition to correctness or incorrectness.

Along with assessing the core domains of Math, Reading, and Science, the Programme for International Student Assessment (PISA) introduced a problem-solving domain in the 2012 cycle, with fundamental technical support from computer delivery. It enabled interactive problems – problems in which exploration is required to uncover undisclosed information

**259**

(Ramalingam et al., 2014)—to be included in a large-scale international assessment for the first time (Organisation for Economic Co-operation and Development [OECD], 2014b). Dynamic records of actions generated during the item-response process form a distinct sequence representing participants' input and the internal state of the assessment platform. Analyzing these sequences can facilitate understanding of how individuals plan, evaluate, and select operations to achieve the problem-solving goal (e.g., Goldhammer et al., 2014; Hao et al., 2015; He and von Davier, 2016; Liao et al., 2019).

The problem-solving items in this new domain were typically designed as interactive tasks. The contents of these items cover a broad scope, from choosing an optimal geographic path between departure and destination points to purchasing metro tickets via a vending machine. Both the students' responses and the whole process of how students solved the problem in a sequence were captured in log files, such as clicking buttons, drawing lines, dragging on a scale, performing keystrokes to respond to openended items, and so on. The data contained in log files, referred to as process data in the present study, provide information beyond response data (i.e., whether the final response was correct or not).

While process data are expected to provide a broader range of information, the complex embedded structure demands an extension of existing analysis methods. These demands entail efforts to apply techniques used in other disciplines such as data mining, machine learning, natural language processing (NLP), social networking, and sequence data mining. These techniques serve two purposes: (1) creating predictive features/variables<sup>1</sup> associated with an outcome variable (i.e., feature generation) and (2) determining which features are the most predictive (i.e., feature selection).

The present study analyzed process data from a released PISA 2012 item (Organisation for Economic Co-operation and Development [OECD], 2014a)—Climate Control Task – that is intended to measure problem-solving skills of participants in science. The purpose of this study was twofold: first, to use process data obtained in a simulation-based environment to generate predictive features; and second, to identify the most important predictive features associated with success or failure on the task. The present study employed one of the tree-based ensemble methods – random forests – to select the most predictive features when considering students as the target of inferences.

The remainder of this paper is organized as follows. First, a brief overview of the methods is provided for generating features from process data and selecting important classifiers. The random forest algorithm is introduced and its potential use in analyzing process data is discussed. In the subsequent section, an integrated approach for generating features from process data and selecting features by the algorithm is introduced using a case study from the PISA 2012 problem-solving item. Results obtained from the introduced approach and their interpretations are then presented. Lastly, thoughts on the limitations and implications of the suggested approach are given.

### OVERVIEW OF FEATURE GENERATION AND SELECTION USING PROCESS DATA

### Generating Features Using Process Data

The principle of predictive feature generation is to maximize information exploration generated solely from timing and process data. This information may be indicative of respondents' problem-solving processes, which are associated with the problem-solving skills targeted in the assessment. As summarized in He et al. (2018), the features collected in log files can be roughly categorized into three groups: (1) behavioral indicators that represent respondents' problem-solving strategies and interactions with the computer, (2) sequences of actions and mini-actions that are directly extracted from test takers' process data, and (3) timing data such as total time on task, duration of respondent actions in the simulation environment, and time until first actions are taken by the respondent when solving a digital task.

#### Behavioral Indicators

Behavioral indicators are typically recorded at a higher, aggregated level. Although human-computer interactions are often accomplished through simple gestures or movements, in most cases, they are not automated actions but involve case-based reasoning and self-regulatory processes (Shapiro and Niederhauser, 2004; Azevedo, 2005; Lazonder and Rouet, 2008; Zimmerman, 2008; Brand-Gruwel et al., 2009; Bouchet et al., 2013; Winne and Baker, 2013). Therefore, to perform well on computer-based problem-solving tasks, one needs to have essential skills in using information and communication technology tools and higher-level skills in problem solving. Respondents have to decode and understand menu names or graphical icons in order to follow the appropriate chain of actions to reach a goal. Meanwhile, problem-solving tasks also require higher-order thinking, finding new solutions, and interacting with a dynamic environment (Mayer, 1994; Klieme, 2004; Mislevy et al., 2012; Goldhammer et al., 2014).

A typical example is the strategy indicator "vary one thing at a time (VOTAT)" studied in Greiff et al. (2015). This study showed that VOTAT was highly correlated with student performance. Note that solving complex, interactive tasks requires developing a plan consisting of a set of properly arranged subgoals and performing corresponding actions to attain the final goal. This differs from solving logical or mathematical problems, where complexity is determined by reasoning requirements but not primarily by the information that needs to be accessed and used (Goldhammer et al., 2013). In this sense, one could argue that the indicators of user actions should in some systematic way map onto the subgoals a user develops and applies to achieve a successful completion of the learning or assessment task.

Another example of a strategy indicator was derived from the problem-solving path and pace of examinees as studied in Lee and Haberman (2016). In this study, it was found that test takers adopted different strategies in solving reading tasks in an international language assessment and that these strategies were highly related to respondents' country, language, and cultural

<sup>1</sup>Predictor variables and covariates are also used interchangeably without being specifically mentioned in sections that follow.

background. For example, the typical strategy of test takers from two Asian countries was to skip the passage and view the questions first. Based on what the item's instructions requested, those test takers went back to read the passage and locate the information needed. Conversely, participants from two European countries were found to follow what was intended, that is, read the stimuli passage first and then answer the questions. These two strategies did not have a significant relationship to performance of test takers, although substantial performance differences and completion rates were found in the low-performing group.

#### Sequences of Actions From Process Data

The importance of sequence data in education has been recognized for decades. Agrawal and Srikant (1995) said "the primary task, as applied in a variety of domains including education, is to discover patterns that are found in many of the sequences in a dataset." Actions or mini-sequences that can be represented as n-grams (He and von Davier, 2015, 2016) are typical indicators to describe respondents' behavioral patterns. For instance, actions related to "cancel" (e.g., clicking on a cancel button in order to go back and change or check entries again) are sequence indicators, which are associated with test takers' cognitive processes and may indicate hesitation or uncertainty about next steps. Mini-sequences can not only show the actions adjacent to each other, but also the strategy link between the actions. For example, in He and von Davier (2016), strategy changes between the searching and sorting functions were successfully identified through analysis of bigrams and trigrams. More details on the use of n-grams for analyzing action sequences are given in the see section "Materials and Methods".

Some researchers have employed sequential pattern mining to inform student models for customizing learning to individual students (e.g., Corbett and Anderson, 1995; Amershi and Conati, 2009). Other researchers have employed sequential pattern mining to better understand groups' learning behaviors in designed conditions (e.g., Baker and Yacef, 2009; Zhou et al., 2010; Martinez et al., 2011; Anderson et al., 2013). As Kinnebrew et al. (2013) summarized, "identifying sequential patterns in learning activity data can be useful for discovering, understanding, and, ultimately, scaffolding student learning behaviors." Ideally, these patterns provide a basis for generating models and insights about how students learn, solve problems, and interact with the environment. Algorithms for mining sequential patterns generally associate some measures of frequency to rank identified patterns. The frequency of a pattern along the problem-solving process timeline can provide additional information for interpretation. Further, the observed variability across action-sequence patterns may play an important role in identifying behavioral patterns that occur only during a certain span of time or become more or less frequent than ones occurring frequently but uniformly over time, thus allowing us to explore what conditions lead to such changes (Kinnebrew et al., 2013).

Sequential pattern mining can be conducted via various approaches. For instance, Biswas et al. (2010) used hidden Markov models (HMMs; Rabiner, 1989; Fink, 2008) as a direct probabilistic representation of the internal states and strategies. This methodology facilitated identification, interpretation, and comparison of student learning behaviors at an aggregate level. As with students' mental processes, the states of an HMM are hidden, meaning they cannot be directly observed but produce observable output (e.g., actions in a learning environment).

As Fink (2008) pointed out, the development and spread in use of sequential models was closely related to the statistical modeling of texts as well as the restriction of possible sequences of word hypotheses in automatic speech recognition. Motivated by the methodologies and applications in NLP and text mining (e.g., He et al., 2012; Sukkarieh et al., 2012), a number of methods from these fields can be borrowed for application in process data analysis. For instance, the longest common subsequence introduced by Sukkarieh et al. (2012) to educational measurement for scoring computer-based Program for the International Assessment of Adult Competencies items was used in He et al. (2019) to extract the most likely strategy by respondent in each item by calculating the distance between individual sequences and reference ones. This approach allowed comparisons of respondents' behavior across multiple items in an assessment.

#### Features Generated From Timing Data

In addition to sequential data on actions taken by respondents during the problem-solving process, CBAs provide rich data on response latency or timing data. Each action log entry is associated not only with data on what a respondent did, but also when the action took place. These timestamps can be aggregated to an overall time measure for the survey, which is specific to the task, or measures that are specific to certain types of interactions such as keystrokes, navigation behavior, or time taken for reading instructions. Timing data at this level of resolution has led to renewed interest in how latency can be used in modeling response processes (e.g., DeMars, 2007; van der Linden et al., 2010; Weeks et al., 2016). In addition, timing data information is expected to be valuable in conjunction with the types of actions observed in the sequence data and to help us derive features that allow predicting cognitive outcomes such as test performance as well as background variables (Liao et al., 2019).

### Predictive Feature Selection

Feature selection models play an essential role in identifying predictive indicators that can distinguish different groups, such as the correct and incorrect groups at the item level in problemsolving processes. A variety of models that have been developed in "big data" fields that relate to information retrieval, NLP, and data mining are also applicable to process data analysis. Here, we discuss some commonly used feature selection models that are popularly used in similar settings, ultimately focusing on one tree-based ensemble method – the random forest method – which will be further applied in this study.

As reviewed by Forman (2003) as well as Guyon and Elisseeff (2003), the feature selection approaches are essentially divided into wrappers, filters, and embedded methods. Wrappers utilize the learning machine of interest as a black box to score subsets of variables according to their predictive power. Filters select subsets of variables as a preprocessing step, independent of the chosen predictor. Embedded methods perform variable selection in the process of training and are usually specific to given learning machines. We introduced these three methods in details in the following subsections. In the embedded methods, the random forests method that has been used in this study is highlighted.

#### Wrapper Methods

fpsyg-10-02461 November 21, 2019 Time: 12:26 # 4

These methods, popularized by Kohavi and John (1997), offer a simple and powerful way to address the problem of variable selection, regardless of the chosen machine learning approach. In their most general formulation, they consist of using the prediction performance of a given approach to assess the relative usefulness of subsets of variables. The wrapper methods that are most used in sequential forward selection or genetic search perform an exhaustive search over the space of all possible subsets of features, "repeatedly calling the induction algorithm as a subroutine to evaluate various subsets of features" (Guyon and Elisseeff, 2003). These methods are more practical for lowdimensional data but often are not for more complex large-scale problems due to intractable computations (Forman, 2003).

#### Filter Methods

These methods apply an intuitive approach in that the associations of each predictor variable with the response variable are individually evaluated, and those most associated with it are selected. For nominal response variables (the case considered in this study), measures of dispersion (also referred to as concentration or impurity depending on the context) such as Gini's impurity index and Shannon (1948)'s entropy are employed as the building blocks for measures of association between response variables and features (Haberman, 1982; Gilula and Haberman, 1995). In cases where response and features are both categorical, Goodman and Kruskal (1954) measure the association using the proportion of reduction of concentration if a predictor variable is involved. Other examples of measures of association can be found in, Theil (1970), Light and Margolin (1971), and Efron (1978).

Practices in area such as NLP implement an even more simplified approach by comparing the value of test statistics of association such as the chi-square statistic for the nominal response and categorical independent variable (Nigam et al., 2000; Oakes et al., 2001; He et al., 2012, 2014). Though some have raised concerns that this approach lacks statistical significance and soundness, its practical effectiveness in ordering the importance of categorical features makes it broadly accepted by certain audiences (Manning and Schütze, 1999; Forman, 2003). Applications can be founded in the recent literature about feature selection in large-scale assessment (He and von Davier, 2015, 2016; Liao et al., 2019).

#### Embedded Methods

These methods incorporate variable selection as part of the model training process. Compared with wrapper methods, they are more efficient and reach a faster solution by avoiding retraining a predictor from scratch for every variable subset investigated (Guyon and Elisseeff, 2003). For instance, the classification and regression tree (CART; Breiman et al., 1984) algorithm can be redesigned to serve this purpose. The random forest algorithm (Breiman, 2001), as an extension of CART that is a random ensemble of multiple trees, belongs to the family of embedded methods and is the method chosen for the current study. The random forest algorithm increasingly adjusts itself by randomly combining a predetermined number of single tree algorithms (shorten as trees in later sections). By aggregating the prediction results obtained from individual trees, the forest reduces prediction variance and improves overall prediction accuracy (Dietterich, 2000).

Some basic ideas about tree algorithms are reviewed here to facilitate understanding of the random forest algorithm. Let X = X1, . . . , X<sup>p</sup> for covariates and Y denote the outcome variable. Instead of establishing an analytical form of predicting Y from X, a decision tree grows by recursively splitting the space of covariates extended by the set X in a greedy way such that segments (nodes) created have the least impurity (for classification) or mean squared error (for regression) possible and are thus used to predictY. Binary split – splitting a parent node into two child nodes – is conventionally employed and guided by the splitting rules. For classification, one of the rules is the Gini impurity index (Breiman et al., 1984; Breiman, 2001),

$$I\_G(s,t) = 1 - \sum\_k p\_k^2(s,t),$$

where t denotes the current node, p<sup>k</sup> (s, t) is the frequency of class k in the samples of node t, and split s represents a certain numeric value or class label of a covariate X<sup>j</sup> . If Y is binary, the above expression will be simplified as 1 − p 2 0 (s, t) − p 2 1 (s, t). It is intuitive that the index is a measure of dispersion: 1 indicates the utmost dispersion and 0 stands for the most extreme concentration. In other fields such as ecology, the index used to measure diversity is known as the Simpson-Gini Index due to its similarity to the Simpson Index (Peet, 1974). It should be noted that the estimate of I<sup>G</sup> (s, t) is biased for small samples if the sample frequencies f<sup>k</sup> (s, t) = n<sup>k</sup> (s, t)/n(s, t) are directly used. This is because the unbiased estimate of p 2 k (s, t) is <sup>n</sup><sup>k</sup> (s,t)[1−n<sup>k</sup> (s,t)] n(s,t)[1−n(s,t)] . A simple modification can be implemented to correct this bias.

The optimal split is determined by seeking the s that maximizes

$$
\Delta I\_G \text{ (s,t)} = \left[ I\_G \text{ (s,t)} - \frac{1}{N\_l} [N\_{l\_l} I\_G \text{ (s,t)} + N\_{l\_r} I\_G \text{(s,t)}] \right]
$$

through the given predictors in set X. The quantity above indicates the decrease of impurity resulting from splitting the parent node t at s into the left child node t<sup>l</sup> and the right child node t<sup>r</sup> . Sample sizes (Nt<sup>l</sup> and Nt<sup>r</sup> ) of child nodes are used to obtain the weighted impurity. For regression, the mean squared error is applied as the splitting rule (Breiman et al., 1984; Breiman, 2001).

Random forests ensemble individual decision trees through the following steps. First, subsets of samples are randomly drawn from the whole sample dataset and individual trees are grown based on each subset of samples. Note that data entries not chosen in each random draw are called "out of bag" data and kept for

validating purposes. Second, for each individual decision tree in the random forest algorithm, it grows by recursively splitting a parent node into two or more child nodes with respect to a set of predictor variables as previously discussed. Rather than seeking the "best" cut point through all available predictor variables, the tree of random forests only examines through a set of m randomly chosen variables at each split. An individual tree stops to grow when a preset number of leaf nodes (nodes at the end of the tree that have no child nodes) or a threshold in terms of impurity of child nodes is reached. Third, final predicted responses are obtained by aggregating the prediction results over these fitted individual trees constructed using different subsets of covariates.

Even though the stability of an individual tree in terms of prediction is still not quite comparable with a typical logistic regression model fitted using all covariates, Breiman et al. (1984) argued that the variance is reduced because of the aggregation, which further enhances the overall prediction performance. Lin and Jeon (2006) showed that the random forest outperforms other less model-based predictive methods in cases with moderate sample sizes. In addition to the improvement on prediction performance, random forests also have other advantages in practice. As introduced above, only a certain number of covariates are selected to conduct each split when growing a decision tree. Such a feature allows the random forest algorithm to fit with a relatively larger number of predictor variables (especially for categorical variables) on a given sample size compared to other predictive methods such as linear models (e.g., generalized linear models), for which fitting with an extensive number of predictors may create data sparsity and reduce the numerical robustness.

In addition, two built-in variable selection methods of random forests, using two types of variable importance measures (VIMs)—(1) impurity importance and (2) permutation importance – have been successfully applied in fields such as gene expression and genome-wide association studies (Díaz-Uriarte and Alvarez de Andrés, 2006; Goldstein et al., 2011). The current study utilizes the permutation importance to select the most important variables extracted from the process data.

Impurity importance is quantified by accumulating 1I<sup>G</sup> (s, t) for each covariate over nodes of all trees. The accumulation is weighted by the sample sizes of nodes. While the importance measure enjoys all the computational convenience of the random forest algorithm, the splitting mechanism – just by chance – favors variables with many possible split points (e.g., categorical variables with many levels), resulting in a biased variable selection (Breiman et al., 1984; White and Liu, 1994). Much statistical literature further investigated this issue and proposed practical solutions (Kim and Loh, 2001; Hothorn et al., 2006; Strobl et al., 2007; Sandri and Zuccolotto, 2008). For instance, Strobl et al. (2007) reimplemented the random forest method based on Hothorn et al.'s (2006) conditional inference treestructural algorithms (ctrees) to provide unbiased estimation of impurity importance. Instead of altering the algorithm, Sandri and Zuccolotto (2008) proposed a heuristic procedure to directly correct the bias of impurity measure by differentiating the "importance" resulting from characteristics of variables from the importance due to the association with the outcome variable.

As another built-in VIM of the random forest algorithm, the measure of permutation importance is free from this undesirable bias. Although it has been criticized for its computational inconvenience, the simple nature of the permutation importance measure becomes attractive as computation speed increases. The rationale of the permutation importance measure is as follows: First, a predictor variable, say X<sup>j</sup> , is permutated in terms of the order of samples. Second, together with the other unaltered variables, another random forest algorithm is fit to compare with the algorithm constructed using unaltered samples. Permutation breaks the original association between X<sup>j</sup> and Y, resulting in a drop of prediction accuracy for the testing data. Lastly, the rank of predictor variables can be established after applying this procedure to each covariate. In the present study, the permutation importance measure, also known as the mean decrease accuracy (Breiman, 2001), was implemented to conduct variable selection.

Tree-based ensemble algorithms also include bagging (Breiman, 1996) and boosting (Freund and Schapire, 1997). Bagging-tree algorithms are similar to random forests but are more straightforward in terms of randomizing the data and growing individual trees. Boosting-tree algorithms grow a sequence of single trees in a way that the latter grown tree fits the variation not explained by the former grown tree. Bayesian additive regression tree (BART; Chipman et al., 2010) is a tree ensemble method established in the Bayesian approach, offering a straightforward means of handling model selection by specifying a prior for the tuning parameter controlling the complexity of trees. Meanwhile, BART considers the uncertainty of parameter estimation with that of model selection. In addition, this method provides a flexible way to address the missing data issue by allowing for directly modeling the missing mechanism.

### MATERIALS AND METHODS

### Item Description and Data Processing

This study analyzed process data from a computer-based problem-solving item from PISA 2012 – Climate Control Task 1 (item code CP02501). The full-sample data has been made publicly available by the OECD<sup>2</sup> . The dataset for this item includes responses from 30,224 15-year-old in-school students from 42 countries and economies. Sample sizes of countries and economies are shown in **Table 1**.

This item (a snapshot of the item is shown in **Figure 1**) asked test takers to determine which of the three sliders controls temperature and which controls humidity, respectively. To obtain the correct answer, test takers were permitted to manipulate the sliders and monitor changes through the display. The answer to the task was given by drawing lines linking the diagrams to indicate the association between the inputs (sliders) and outputs. The correct solution is shown in **Figure 1**. The "reset" button undid previous simulations by clearing the display and resetting the sliders to their initial status. No limit was

<sup>2</sup>The dataset is available at http://www.oecd.org/pisa/data/pisa2012databasedownloadabledata.htm.


imposed on the number of steps of manipulation or rounds of exploration. Also, no time constraint was imposed on each item; however, the total test time of a test cluster (problemsolving items) was limited to 20 min. Either one or two clusters were randomly given to a participant depending on different assessment designs (Organisation for Economic Co-operation and Development [OECD], 2014b). The order of items in a cluster was fixed, and a former item could not be resumed once the next item had begun. According to different assignments of clusters, the position of Climate Control Task 1 varied across test forms. For this item, the average time spent by students was 125.5 s and the median time was 114.5 s; 95% of examinees spent from 22.2 s to 290.2 s on the item; only 1,149 participants (about 3.8% of the total sample) finished the task in 30 s or less, with a 5.1% rate of correctness. Given these results, later sections of the paper assume that the item is not considered as speeded for this sample in general and position effects, if any, are negligible. However, the analysis of the current study conducted without considering the speeded issue which should be noted as a limitation and further investigated by future research.

Items like Climate Control Task 1 are constructed using the MicroDYN approach (Greiff et al., 2012) that combines the use of the theoretical framework of linear structural equation models to systematically construct tasks (Funke, 2001) with multiple independent tasks to increase reliability. Briefly speaking, a system of causal relations (e.g., the first slider controls temperature) is embedded in a scenario that allows participants to explore input variables and observe the corresponding changes of output variables through a graphical representation. No specific prior domain knowledge is required for this type of task in general. However, examinees need to gain and have command of the knowledge by exploring and experimenting before providing appropriate answers. For such tasks, a strategic knowledge for effective exploration is crucially important (Greiff et al., 2015) that is, the VOTAT (vary one thing at a time; Tschirgi, 1980) strategy; this term is also known as the control-of-variable strategy (Chen and Klahr, 1999) in developmental psychology.

In PISA 2012, a partial credit assignment – 0 for incorrect, 1 for partially correct, and 2 for correct – was used to score the responses of Climate Control Task 1. Partial credit was given if a student explored the simulation by using the VOTAT strategy efficiently – only varying one control at a time when trying to change the status of each control individually at least once, regardless of actions being in adjacent attempts or in a round before resetting – but failed to correctly represent the association in a diagram.

To show that the VOTAT strategy is strongly related to performance on the item, Greiff et al. (2015) restricted polytomous responses as dichotomous by treating partially correct as incorrect and then investigated the association between the dichotomous responses with the indicator of applying the VOTAT strategy efficiently alongside other covariates. Following the same settings, the present study explored the association between the binary responses and the indicator of the use of the VOTAT strategy together with other covariates created from the process data to find out (1) whether the current partial scoring rubric was still supported by the prediction model (i.e., random forests)—namely, whether the VOTAT variable was still the most associated factor with responses while interacting with other covariates – and (2) whether the rubric was still sufficient compared with the new predictor features extracted from the process data. It should be noted that the restriction of response variable may not be applicable for items that are intended to measure a construct other than the interactive complex problemsolving (Cheng and Holyoak, 1985; Funke, 2001) skills or constructed without using the MicroDYN approach.

**Table 2** shows a section of the postprocessed log file that is, a readable process dataset whose entries are actions

listed in chronological order. The even number indicates the actions belong to a certain test taker. The type of action, as well as the corresponding timestamp, was recorded for each action. Among the action types, "apply" represents actions related to manipulation of sliders because, after setting sliders, a test taker needed to hit the "apply" box, as shown in **Figure 1**, to see the changed value of temperature and humidity displayed. The changed status of sliders was recorded in the columns "top slider," "central slider," and "bottom slider." The value of status ranges from −2 to 2. Similarly, the action type "diagram" represents drawing a line to link diagrams, as shown at the bottom right of **Figure 1**. The six-digit binary string shown in the table was used to record the association among diagrams that has been established. For example, "100101" indicates that the top slider controls temperature, whereas the central and bottom sliders control humidity.

To facilitate the analysis, observed sequences of actions were collapsed into respective strings. To obtain such a string, each type of action is abbreviated using a single capital letter: "S" for "start," "E" for "end," "R" for "reset," "A" for "apply," and "D" for "diagram." It should be noted that consecutive "D" actions were collapsed into a single "D" action because information related to drawing lines to connect the diagrams is not of central interest in the present study. For the sequence of actions shown in **Table 2**, it can be simplified as "SRAAAAARDE."

### Feature Generation

In this study, features (predictor variables) extracted from the process data can be summarized in three categories: variables extracted from action sequences using n-gram methods, behavior indicators, and time-related variables.

N-gram methods decode a sequence of actions into minisequences (e.g., a string of n letters in length where the letters remain in the same order as the original sequence of actions) and document the number of occurrences of each mini-sequence. Unigrams, analogous to the language sequences in NLP, are defined as "bags of actions," where each single action in a sequence collection represents a distinct feature. However, unigrams are not informative in term of transitions between actions. Bigrams and trigrams are considered in this study, with action sequences broken down into mini-sequences containing two and three ordered adjacent actions. Note that the n-gram method is productive in creating features based on sequence data without loss of much information about the order of sequence. This class of methods has become widely accepted for feature engineering in fields such as NLP and genomic sequence analysis and was recently applied to analyze



"Event" and "event\_type" indicate the type of the current action. "Time" and "event\_num" show the time point and order of the current action. "Top\_slider," "central\_ slider," and "bottom\_ slider" provide information about the status of each control. "Temp\_value" and "humid\_value" give the simulated results. "diag\_state" gives information on the linking among diagrams. Each type of event is abbreviated using a single capital letter: "S" for "start," "E" for "end," "R" for "reset," "A" for "apply," and "D" for "diagram." Data source: This table is extracted from "Log-file databases for released PISA 2012 computer-based items data for problem solving" at http://www.oecd. org/pisa/pisaproducts/database-cbapisa2012.htm.

process data in large-scale assessment (He and von Davier, 2015, 2016). For example, an n-gram can break the action string "SRAAAAARDE" into "S(1), A(5), R(2), D(1), E(1)" for unigrams, "SA(1), AR(1), AA(4), RA(1), RD(1), DE(1)" for bigrams, and "SRA(1), RAA(1), AAA(3), AAR(1), ARD(1), RDE(1)" for trigrams, where the numerals in brackets represent the number of occurrences.

Behavior indicators can also be generated from sequences of actions. Changes to input variables (the positions of controls) shed light on participants' problem-solving strategies and behaviors. As discussed earlier, partial credit was given to students who explored the connection between the inputs and outputs by utilizing the VOTAT (vary one thing at a time) strategy across the three controls at least once. Greiff et al. (2015) treated this scoring rubric as an indicator variable (i.e., VOTAT) and showed that it was highly associated with the probability of answering this item correctly and overall performance on the test.

This study created an ordinal categorical variable with four levels – from 0 to 3 – each number indicating on how many controls a student has used the VOTAT strategy. This ordinal variable was referred to as "VOTAT group" in the analysis. Another variable named "VOTAT num" was created to count the number of times that a student used the VOTAT strategy regardless of which control he or she applied the strategy to. Additionally, the order of "A" and "D" in a sequence of actions could convey information about examinees' decisiveness or hesitancy of their decision-making process. For example, the action string "SRAAAAARDE" can be categorized as a meta-strategy "AD sequence," implying the examinee "draws" the diagrams right after "applying" the simulations on sliders rather than jumping back and forth between applying sliders and drawing diagram lines.

**Table 3** shows the distribution of the AD sequence variable, where N indicates the cases in which participants did not conduct an experiment or generate diagrams. Note that the AD sequence's having an undue number of levels not just hindered interpretation but also caused data sparsity in analysis that followed. Thus a "compact" version of AD sequence with fewer levels was created as shown in **Table 4**. **Figure 2** illustrates how to create the contracted levels in **Table 4** by a tree-like diagram.

Process data also provide rich information related to time. Process data includes timestamps of actions, allowing the time spent on a specific action to be calculated by taking the difference of the time of two adjacent actions. Several time-related predictor variables can be generated as follows. "A time" and "D time" indicate the accumulated time spent on manipulating controls and drawing diagrams, respectively. For example, for an action sequence "SADRE," "A time" is the time used after hitting the "start" box and before hitting the "apply" box; "D time" is the time spent after hitting the "apply" box and before drawing a line among diagrams. By a similar token, "E time" records the time spent after conducting the last action before hitting the "end" box. A special case is "R time," which represents the time spent after hitting the "reset" box but before conducting the next Action. "time\_bf\_action" records the time span between "start" and the first action after "start," which can be considered as the time spent on reading and perceiving the task.

Given the feature generation method described above, 77 variables were created from the process data (a snapshot of the process data is presented as **Table 2**), as presented in **Table 5**. Note that time-related features in this study were binned with



equal percentiles in terms of their frequencies – the frequency of each bin ranges from 10 to 25% of the sample depending on the variables. This was done essentially due to the nature of

TABLE 4 | All contracted levels of AD sequence with sample size and percentage of correctness.


the tree models: continuous variables are discretized to find the best "split" point, as discussed in previous sections. This inherent discretization mechanism tends to create data sparsity when the distribution of a continuous variable is "discontinued" (i.e., having extreme low density at the area between modes), which increases the chance of encountering a computation failure. Therefore, to reduce this chance, practitioners "stabilize" the distributions of these "discontinued" variables by binning before feeding the variables to fit the algorithm. In this study, binning was also applied to n-gram features with levels having sparse sample sizes. However, it should be noted that binning entails a risk of losing information about these variables.

### Feature Selection

Feature selection was conducted using the R package randomForest (Liaw and Wiener, 2002). The selection began with seeking the random forest algorithm having the optimal complexity to fit the dataset. In this study, the complexity of the random forest algorithm is characterized by combinations of number of trees (ntree) and number of predictor variables used to grow a tree (mtry). Empirical studies (Breiman, 2001; Mitchell, 2011; Janitza and Hornung, 2018) showed that mtry and ntree are more influential than other factors in controlling the complexity of the random forest algorithm. In this study the size of a tree (i.e., the number of generations or the total number of nodes) was not restricted and the number of branches used at each split was fixed at 2. The present study was focused on exploring the combinations of mtry and ntree, where ntree = 100, 300, 500, and mtry = 4, 6, 8, 10, 12.

#### Cross-Validation

A typical way to find the optimal model complexity (i.e., the combination of tuning parameters) is to compare the fitted models by their validation error. The validation error is obtained by holding out a subset of the sample (validation set), using the retained sample (training set) to fit the classification algorithm, and then estimating the prediction error by applying the fitted algorithm to the validation set. To efficiently utilize data with a limited size, practitioners (Breiman and Spector, 1992; Kohavi, 1995) have recommended five- or ten-fold cross-validation. In the case of five-fold cross-validation, the data is split into five roughly equal parts. A loop of validations is then conducted – each part is labeled as the validation set once to estimate the prediction error of the random forest model fitted using the other four parts. In a data-rich situation, Hastie et al. (2009) recommended to isolate an additional set (the test set) from

the sample before conducting cross-validation. This set is used to compute the prediction error for the final chosen model. It can also be considered as an assessment of the generalization performance of the chosen model on independent data. The present study randomly selected roughly 10% of the sample (3,000 students) as the test set; the rest was separated into five folds for the training-validation purposes.

#### Variable Selection and Backward Elimination

The core idea of validation is to keep the validation sample from being "seen" by the model training process. Such a principle must also be obeyed when variable selection is involved. An example of violating this rule would be to conduct variable selection based on the whole sample before tuning model parameters based on cross-validation (Hastie et al., 2009).

The variable selection implemented in the current study is based on the recursive feature elimination in Guyon et al. (2002) that iteratively rules out features at the lower end of the ranking criterion. Together with random forests, recursive

TABLE 5 | Variables generated from process data of climate control task 1.

Total Generated Features Unigram 3 D, R, A Bigram 16 DD, AA, RA, AR, AD, DA, AE, SD, SA, DR, DE, RD, RE, RR, SR, SE Trigram 48 ADD, AAR, SRD, DDR, AAE, DRE, AAA, ARD, SDR, ADE, RAA, RRE, DDD, DAR, ARR, DAA, RDA, RRA, DAD, SDA, RRR, AAD, RAD, RRD, ADR, ARE, DRR, RDE, DRR, SRA, ADA, SAR, SRE, ARA, RAR, SDE, DRA, RDD, RDR, SDD, DAE, SAR, DDA, DRD, SRR, SAA, SAD, RAE Behavioral indicators 4 AD sequence, VOTAT group, VOTAT num, n\_actions Time-related features 6 D time, A time, R time, E time, total time, time\_bf\_action. Total 77

feature elimination has been successfully employed in genomewide association studies (e.g., Jiang et al., 2009). The variable selection approach suggested in the present study is not just an application of recursive feature elimination using the random forest algorithm with a specific focus on the process data, but a modification with an emphasis on end-to-end cross-validation.

**Box 1** outlines the suggested backward elimination algorithm for variable selection. Note that to prevent variable selection (i.e., ranking) from seeing the data used for model training (i.e., parameter tuning in this study), the training-validation dataset was divided into five disjoint subsets in this recursive selection process so that at each backward elimination parameter tuning can be conducted using four of the subsets of data while variable ranking can be performed separately based on the other subset. This suggested approach follows the principles of variable selection for study design recommended by Brick et al. (2017).

As indicated by **Box 1**, the backward elimination also documents how the validation performance of the fitted model changes as the number of features reduces, which was illustrated in **Figure 3**. The number of selected features was decided by drawing a cutoff line around where the first large drop in


prediction performance begins (i.e., 49 in **Figure 3**). Setting this cutoff line here is like selecting the number of factors using the scree plot (Cattell, 1966). Given this threshold number (i.e., 77 − 49 = 28), five sets with 28 selected features were obtained, and their intersection gives the final selected set of features (26 features).

The backward elimination in **Box 1** has five separated iterative variable ranking processes, which could be somehow regarded as an implicit self-validation. However, the determination of the cut-off line shared by the five ranking processes (i.e., the feature screening) should be further validated if data are rich enough. Instead of having one training-validation set, five disjointed training-validation sets (notice this is different from the five shown in **Box 1**) were established after the test set was held out. Backward elimination shown in **Box 1** was conducted for each of the five sets. Accordingly, five sets of final selected features were obtained. **Table 6** shows the intersection of these five sets of selected features.

The backward elimination in **Box 1** was structured using a nested loop that might cause inefficiency. Practitioners can increase the number of features eliminated for each round to reduce computation burden. Plus, as noted by Breiman (2001), the value of mtry set around the square root of the number of predictors seems to have minimal effect on validation performance; to increase computational efficiency, one can utilize

TABLE 6 | Features selected through the five-fold validated backward elimination.


Boldfaced cases indicate features considered redundant. Such features are removed from the set of selected features for analysis that follows.

this deterministic way to adapt the value of mtry. In addition, to further increase algorithmic efficiency, researchers (Breiman, 2001; Nicodemus and Malley, 2009; Zhang et al., 2010; Goldstein et al., 2011; Oliveira et al., 2012) recommended employing out-of-bag error as an alternative to cross-validation error. Simulation studies (Mitchell, 2011; Janitza and Hornung, 2018) showed that although out-of-bag error tends to overestimate true error rate when "n<<p"—that is, the sample size is far less than the number of predictors, the overall validation performance is not substantially affected by means of out-ofbag error to determine model complexity. The present study also performed a backward elimination boosted by using the above suggestions, which obtained consistent results with the plain approach shown in **Box 1** in terms of variable selection. Such results were not presented in the manuscript for the sake of simplicity.

### RESULTS

The final set of selected features includes ordinal and binary categorical variables. Pairwise associations among these ordinal variables were measured using the Goodman-Kruskal gamma (γ; Goodman and Kruskal, 1954) with value from −1 (discordant) to 1 (concordant). Given the measure, the final set can be further reduced by removing the redundant features highly related to others.

Among all pairs, "DD" was highly associated with "DDD"(γ = 0.76); "AR" and "RA" was associated with γ = 0.71; other wellassociated pairs (γ > 0.6) included "AD sequence" with "AD," "AD sequence" with "DA," "AD" with "ADD," "DRA" with "ADR," "DRA" with "DR," and "DD" with "DDE."<sup>3</sup> It is not surprising that "AD sequence" was highly correlated with "AD" and "DA." "AD sequence" was preferred since it covered more information than "AD" and "DA" do, as discussed earlier. "DDD" was greatly associated with "DD;" trigram was preferable in this case since it contained more detailed information. "DDE" conveyed trivial information compared to "DD" and "DDD," as did "ADD" to "AD." "AR" and "RA" covered similar information, as did "DRA" with "DR" and "ADR;" the one with higher rank of permutation importance was preferred. In sum, eight features (boldfaced in **Table 6**) were excluded: "AD," "DA," "ADD," "DDE," "DRA," "AR," "DD," and "DR."

With the 13 remaining features, a random forest was fitted with the parameter set where ntree = 100 and mtry = 4. The parameter combination was chosen based on validation performance of the test set that had been held out at the beginning. Applying the test set here was necessary since the association measured above was based on the entire validationtraining sample, which means that variables selected using γ had already "seen" the validation data. Similarly, another random forest was fitted with 77 features; the parameter set was tuned using the test data, where ntree = 300 and mtry = 9. Here the Goodman-Kruskal tau (τ; Goodman and Kruskal, 1954) was used

begins.

<sup>3</sup>As a reminder, "D" refers to drawing the diagram, "A" to applying the simulations on the slider, "S" to start, and "R" to reset.

to measure the proportional reduction of incorrect prediction for the full and the reduced model, respectively, with regard to the random guess based on observed distribution of responses, where τ<sup>77</sup> = 0.810 and τ<sup>13</sup> = 0.797. In this regard, the reduced model performed decently in comparison to the full model.

Features of the simple model ranked by the permutation importance measure are shown in **Table 7**. Unigram "D," "R," and "A" ranked high in the list since they are basic elements constituting action sequences. Furthermore, "D" and "R" are not just fundamental but also imply a student's decisiveness. Using only a few necessary steps of drawing arrows or applying the reset function only a limited number of times might indicate confidence in providing a correct solution. "VOTAT group" and "VOTAT num" are both critical as shown in the list, which is consistent with the results found by Greiff et al. (2015). The top-ranked "AD sequence" indicates that contracting levels shown in **Figure 2** work fine in summarizing students' behaviors on experimenting. Grams such as "AAA," "ADR," and "RA" offer interesting perspectives. For instance, students having a large number of "AAA" tended to show certain patterns in their actions: drawing diagrams right after applying experiments (i.e., the level "AD only" in the feature "AD sequence") and applying the VOTAT scheme across the three sliders. In further investigating these students, we found that they attempted to create an increasing or decreasing slope of the value of temperature or humidity in the display by repeatedly hitting the "apply" box while fixing the sliders at one particular status, indicating a relatively sophisticated behavior of solving the problem. Frequent usage of "ADR" and "RA" indicated participants utilized the reset function to assist their experimenting and exploration on inputs. "D time" and "R time" can be regarded as time spent on deliberation.

### DISCUSSION

The aim of the present study is to pedagogically suggest an integrated approach to analyze action sequences and other

TABLE 7 | Features ranked by permutation importance measure (mean decrease accuracy).


information extracted from process data. Feature generation and selection are two essential parts of the suggested approach and should be treated with equal importance. Features in this study were created following both top-down and bottom-up schemes. The former generates features based on hypotheses that might be developed by item designers and content experts. The latter, as an example, extracts features by utilizing n-gram methods and related methods breaking up the action sequences. Thus, n-gram translates the action sequences into mini-sequences along with their frequencies. Features generated by both schemes are presented in the final set of selected predictive features. The random forest algorithm was implemented in the feature selection part, which simultaneously handled (1) a massive number of categorical predictor variables, (2) the complexity of the variable structure, and (3) model/variable selection in a computationally efficient way. The utility of the suggested approach has been illustrated by implementing it in a publicly available dataset.

The suggested approach is not free from limitations. First, the feature generation process involves breaking up action sequences into mini-sequences encoded as n-grams, suggesting that the information contained in the order of the action sequences – for example, the "longer term" dependencies among actions – would not be completely preserved and exploited. As an outcome, only limited amounts of behavioral indicators are generated; information embedded in students' action sequences might not be fully utilized. For example, the range of states of controls explored by a student is a variable likely associated with the response variable. Technically speaking, to preserve more "complete" information when analyzing action sequences, sequence-mining approaches (e.g., SPADE; Zaki, 2001) employed to find common subsequences provide a possible alternative. Also, ideas stemming from cognitive and learning studies offer a theoretical basis of creating features from action sequences; for example, some studies (Jiang et al., 2015, 2018) employed sequential pattern mining to analyze learning skills and performance in immersive virtual environments.

Second, most features, if not all, are ordinal categorical variables representing frequency. As noted in the previous section, some variables present in excessive levels could cause an issue of data sparsity when conducting the random forest algorithm. This study used equal-percentile binning to address this issue at the expense of losing information provided by the original variables. The sensitivity of binning needs to be further investigated.

Third, the CART-based random forest algorithm using the Gini-impurity index to split nodes (e.g., the randomForest R package used in this study) implemented in this study is generally a suboptimal choice. Strobl et al. (2007) showed that the algorithm tends to favor categorical variables with extensive levels as well as a cluster of variables that are highly correlated. The modified random forest algorithm proposed by Strobl et al. (2007) using the conditional inference tree introduced by Hothorn et al. (2006) should be explored in the context of process data for future studies.

Fourth, even though the efficiency of the suggested backward elimination can be increased by using several steps noted in the

previous section, the computation burden is still a concern for the present study. Backward elimination with the specifications shown in **Box 1** was validated using a five-fold dataset, which took about 19,872 s in total on a Mac Pro desktop with a 3.5 GHz CPU and 16 GB of RAM.

Fifth, like other data-driven algorithms, the random forest approach is not straightforward regarding model interpretation. For example, hypothesis tests on marginal effects of features are not sustained in random forests; the directions of marginal effects are not directly accessible, either. Friedman (2001) suggested plotting the partial dependence between the feature and the outcome variable (logit is used if the outcome variable is categorical) to access the marginal effects. This display method has been implemented in the R package randomForest as the function partialPlot. It is sensible to apply models with more restricted functional forms, such as linear models, to conduct an ad hoc analysis based on the selected features.

Sixth, the random forest algorithm is a data-driven method that is sensitive to sample characteristics. Meanwhile, PISA is an international large-scale assessment involving mixed-type forms of tests and multistage sampling designs. The question on how the sampling designs affect the analysis using data-driven methods (i.e., random forests) in terms of estimation stability is beyond the scope of this study. It is appealing that future methodological research could provide guidance concerning the correct use of cross-validation in different test designs.

Last, the exploratory nature of the suggested approach comes with the purpose of the study. Although interesting patterns of behaviors have been found by the suggested approach, it is still difficult to test a cognitive or psychometric theory with it.

The suggested method offers an alternative to the generation and selection of informative features from a massive amount of process data, given the increasing attention to exploring the usage of process data along with response data in large-scale assessments. Generalizability of the method can be explored by applying it to multiple tasks constructed using a similar

### REFERENCES


approach such as MicroDYN and comparing it with other variable-selection approaches in terms of practical significance.

### ETHICS STATEMENT

This study is a secondary analysis based on released datasets from PISA 2012 log data files (http://www.oecd.org/pisa/pisaproducts/ database-cbapisa2012.htm). No additional data were collected from human subjects for this particular study.

### AUTHOR CONTRIBUTIONS

ZH contributed to the development of methodology exploration, model estimation procedures, conduction of the data analysis, and drafting and revision of the manuscript. QH contributed to the development of the methodological framework, supervision on the model estimation procedures, conduction of the data analysis, and drafting and revision of the manuscript. MD contributed to providing suggestions on the methodological framework and the model estimation procedures, and reviewing and revision of the manuscript.

### FUNDING

This study was partially supported by the Internship Program under Educational Testing Service and the National Science Foundation (NSF – Award #1633353).

### ACKNOWLEDGMENTS

The authors would like to thank Larry Hanover for his help in editing the manuscript and Dr. Xiang Liu for his valuable suggestions.

tutoring system fostering self-regulated learning. J. Educ. Data Min. 5, 104–146.



Test Development, eds S. Downing, and T. Haladyna, (Mahwah, NJ: Lawrence Erlbaum), doi: 10.4324/9780203874776.ch14


**Conflict of Interest:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2019 Han, He and von Davier. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

fpsyg-10-02461 November 21, 2019 Time: 12:26 # 15