Using Virtual Reality to Assess Reading Fluency in Children

Here we provide a proof-of-concept for the use of virtual reality (VR) goggles to assess reading behavior in beginning readers. Children performed a VR version of a lexical decision task that allowed us to record eye-movements. External validity was assessed by comparing the VR measures (lexical decision RT and accuracy, gaze durations and refixation probabilities) to a gold standard reading fluency test—the One-Minute Reading test. We found that the VR measures correlated strongly with the classic fluency measure. We argue that VR-based techniques provide a valid and child-friendly way to study reading behavior in a school environment. Importantly, they enable not only the collection of a richer dataset than standard behavioral assessments but also the possibility to tightly control the environment.


INTRODUCTION
Virtual reality (VR) techniques are a collection of software and hardware technologies that support the creation of synthetic, highly interactive three dimensional (3D) spatial environments, in which the user becomes a participant in a "virtually real" world (Psotka, 1995). An essential ingredient of VR technology is a tracked head-mounted display (HMD) that makes it possible for participants to see new views of the visual world as they move their head (Jensen and Konradsen, 2018). The key concept of VR is immersion (Jennett et al., 2008;Howard-Jones et al., 2014), a sense of "being" in the task environment, of being physically present in a non-physical world (Freina and Ott, 2015). The main motivation for using VR in education and training is that it provides the opportunity to experience situations that cannot be accessed physically (for review, see Freina and Ott, 2015;Stuart and Thomas, 1991) because of problems in time (e.g., visit different historical periods), distance (e.g., exploring the solar system or the functioning of a cell), dangerousness (e.g., training fire fighters to make decisions in life threatening situations) or ethics (e.g., performing surgery by non-experts). Here, we explore a very different advantage of using VR technology in an educational situation, namely the possibility to assess reading skills in a potentially noisy and distracting environment (i.e., classroom). Indeed, running experiments with children in a school environment is often a complex process that sometimes requires to control or measure eye movements and attention. We show that VR technology can provide such controls in a user-friendly way.
One of the keys to success in today's world is becoming a skilled reader, and behavioral investigations of the mechanisms involved in achieving this skill are therefore of utmost importance. However, VR has rarely been used to study reading, which is hardly surprising, because reading provides a way to create a virtual reality without the need to use a computer-based system (Nell, 1988;Jacobs, 2015). Reading, quite naturally, allows one to shape events in a person's brain. Much like a VR system, reading bridges gaps of time, space, and acquaintanceship (Pinker, 1994;Ziegler et al., 2020). The feeling of "getting lost in a book" (Nell, 1988) is probably very similar to the immersion in an artificially created virtual world. So why would one want to use VR to study reading? One possible reason has been put forward in the context of cognitive assessment and rehabilitation: "The potential power of VR to create human testing and training environments which allow for precise control of complex stimulus presentations as well as providing accurate records of targeted responses is a cognitive psychologist's dream!" (Rizzo and Buckwalter, 1997). In recent work in our group, we have started to use VR to study reading behavior in adults  and there has been a general rise in the use of VR techniques in cognitive psychology in general (for a review, see Mirault, 2020).
The goal of the present study was to test to what extent a VR system can provide a valid and reliable reading fluency assessment technique in primary school children. There are several reasons for why this is an interesting and potentially important issue. First, recent HMD systems (i.e., VR goggles) allow the recording of head (3D location and velocity) and eye movements (i.e., fixation locations, fixation durations). The eye movements recorded during silent reading provide a direct measure of reading fluency and the impact of linguistic complexity on reading behavior see Rayner, 1998, for a review of early research on eye movements and reading). Currently, the recording of eye movements requires a rather sophisticated laboratory setup and a rigorous calibration procedure, in which the head must be fixed using a chin rest and/ or a bite bar. This complicates the use of eye movement measures in a classroom context. Second, most psycho-educational reading assessments take place in a school setting where children are potentially distracted by environmental factors. The immersive potential of VR technology makes it possible to blend out much of these distracting factors, thus facilitating testing in a classroom setting. Third, the assessment of the visual, orthographic and attentional factors involved in reading (Facoetti et al., 2010;Ziegler et al., 2010;Zorzi et al., 2012;Stein, 2014;Grainger et al., 2016) requires "the precise control of complex stimulus presentations" (Rizzo and Buckwalter, 1997), such as a fixed distance to the screen, which determines visual angle and stimulus size. Fourth, since VR goggles have become affordable in the past years (Ray and Deb, 2016) making their wide use in schools possible, it is crucial to investigate whether the reading and eye movement measures obtained with this technique are robust and externally valid. Finally, very little research on VR has been conducted with primary school children (Eleftheria et al., 2013) and it remains to be shown that VR systems can reproduce classic laboratory benchmarks of reading, such as effects of lexicality, frequency and length (Grainger and Jacobs, 1996;Coltheart et al., 2001;Perry et al., 2007, Perry et al., 2010. This is important if these systems were to be included in more sophisticated systems, in which participants can interact with letters, words and sentences in a virtual game environment (Pan et al., 2006).
To start simple, in the present experiment, we had children in primary school (grade 2) make lexical decisions about words and pseudowords while wearing VR goggles. This allowed us to measure their reaction times, accuracy, initial fixation durations, total fixation durations, and number of fixations. Besides lexicality (words vs. pseudowords), we varied the length of words and pseudowords to test whether our measures are sensitive to word length, which is an excellent marker for the automatization of reading skills (Ziegler et al., 2003). To test the external validity of our VR test, we compared the VR measures with a One-Minute Reading (OMR) aloud test of words and pseudowords (similar to TOWRE, see Torgesen et al., 2012), which can be seen as the gold standard for measuring reading fluency (Bertrand et al., 2010). We expected to find faster and more accurate responses to words than to pseudowords. With respect to eye movements, we expected to see fewer and shorter fixations to words than to pseudowords. Finally, if our VR-based measures correlate strongly with the OMR gold standard, this could be taken as evidence that VR-based measures obtained during silent reading could potentially replace or complement more classic reading aloud assessments. This is important because the main goal of learning to read is fast, efficient, silent reading for meaning.

METHOD Participants
A total of 102 children aged between 7 and 9 years were recruited from two schools in Marseille (France). Participants were either native speakers of French or grew up in a French-speaking environment since birth or early childhood. They reported having normal or corrected-to-normal vision and were naïve to the purpose of the experiment. Their parents signed an informed consent form in accordance with the provisions of the World Medical Association Declaration of Helsinki prior to the experiment. Ethics approval was obtained from the Comité de Protection des Personnes SUD-EST IV (No. 17/051).

Apparatus
The VR environment was created using the software Unity (Unity Technologies ApS) and displayed on a WQHD OLED screen (2,560 × 1,440 pixels) covering up to 100°of visual angle with a refresh rate of 70 Hz. Eye movements were recorded using the infra-red eye-tracker in the virtual reality headset Fove 0 HMD (FOVE, Inc.). The headset size was adapted to children with a strap at the back of the device in order to make this comfortable for the children and to achieve good immersion (i.e., no lights from the classroom). Children were free to move the head and the design of the experiment allow them to continually see the stimulus in front of their head location. Recording was binocular with a high spatial accuracy (<1°) and a sampling rate of 120 Hz (however, we recorded at 70 Hz in order to match the refresh rate of the screen). The position of the head was obtained by combining a USB Infra-Red position tracking camera with a refresh rate of 100 Hz and an Inertial Measurement Unit (IMU) placed in the headset. A recent graphic card (NVIDIA GeForce GTX 1650) was mounted on a laptop computer (ASUS ROG STRIX G) to display the VR environment in the Fove headset. The VR environment was also duplicated on the laptop LCD screen running with a high refresh rate (144 Hz) for the experimenter. The response was provided by pressing buttons on a gamepad (Trustmaster Dual Analog 4).
From the eye tracker, we recovered 6 measures: three for the origins (x, y, and z) and three for the Gaze Intersection Point (GIP). We defined the Origins as the viewer-local coordinates mapped from eye tracker screen coordinates to the near view plane coordinates. The GIP is given by the addition of a scaled offset to the view vector originally defined by the helmet position and central view line in virtual world coordinates (from Eye Tracking Methodology; Duchowski, 2007). 1

Design and Stimuli
We created 100 items: 50 words and 50 pseudowords (see Stimuli on OSF link at the end of the article) that ranged in length from 4 to 8 characters (10 words and 10 pseudowords for each size). The words had an average frequency of 997.76 parts per million (ppm) (based on the Manulex frequency counts: Lété et al., 2004) which is equivalent to 5.99 Zipf (van Heuven et al., 2014). The pseudowords were constructed to look like real French words and were always pronounceable.

Procedure
In this study, children participated in two tasks: a VR lexical decision task (VR-LDT) and a One-Minute Reading (OMR) aloud test. In order to counterbalance task order, half of the children started with the VR-LDT while the other half started with the OMR. For the VR-LDT, they were seated in front of a school desk at 70 cm from the infra-red position detector and were free to move their head and torso. Testing did not occur in the classroom but in a small room right next to the classroom. Two children were tested at the same time. While one was doing the VR-LDT test, the other one did the OMR test. The instructions were explained to the children as a game, in which they had to detect "true" and "false" words. At the beginning of the experiment, the orientation and the position of the headset were tared, then, the participant's eye position was calibrated using a 5-dot calibration phase. Dots appeared on the VR screen in green with a decreasing size. Children were instructed to focus on the center of the dots. They had the possibility to remove the headset at any time of the experiment. The instructions were repeated one more time and the experiment was initiated if the child was ready to start. Each trial started with a fixation dot during 1,000 ms located in the center of the screen (here, we use the term screen to refer to the calibrated visual field of the virtual environment). Then, the stimulus was displayed in black in the center of the virtual environment. We displayed the stimulus in monospaced vectorial police (no pixelization even if you zoom-in or zoom-out), with a font-size of 36 (it cannot be compared to normal font size to because of the depth of the Z axis). The background was a neutral virtual environment with a brown floor and a blue sky; the horizon line was light blue. Participants had to read the stimulus and press the right trigger on the gamepad if the word existed in French or the left trigger if not, as fast and as accurately as possible. We shuffled the list with all the items (N 100) in order to create a random stimulus presentation and we used the same shuffled list for all children. Gamepad and desk were cleaned with a bactericidal wipe between each participation. There were no practice trials and feedback during the experiment, but the experimenter gave oral examples and invited the children to press the correct button. The experimenter then provided oral feedback and an explanation for any errors made.
Concerning the OMR test, we used the LUM test ("Lecture en Une Minute") developed by Khomsi (1999). It consists of two reading aloud tests: one with a table of 35 existing words (in French), the other with a table of 30 pseudowords. The tables are presented in five rows. The test is explained to the child and then he or she starts by reading two test words (outside the table) that do not count in the number of words read. Then the timer is started when the child reads the first word in the table. Children were instructed to read the words from left to right and then to move to the next line. Scoring the test first involved counting the number of correctly read words and discarding incorrectly read words (mispronounced or not read after 5 s). After one minute the test was stopped. The number of correctly read words is the fluency value. If a child read all the words correctly in less than one minute, then the time to do so was recorded, and the number of words correctly read per minute calculated from that.

Pre-Processing the Eye Movements
We used the emov package (Schwab, 2016) in the R statistical computing environment (Pinheiro et al., 2014). This package implements a dispersion-based algorithm (I-DT) proposed by Salvucci and Goldberg (2000) which measures fixation durations and positions.

Analysis
We used Linear Mixed-Effects models (LMEs) to analyze our data, with items and participants as crossed random effects, including by-item and by-participant random intercepts (Baayen et al., 2008). Items in these analyses were the words/ pseudowords. LMEs were used to analyze response time and fixation durations while Generalized (logistic) LMEs were used to analyze error and refixation rates. The models were fitted with the lmer (for LMEs) and glmer (for GLMEs) functions from the lme4 package (Bates et al., 2015) in R. We report regression coefficients (b), standard errors (SE) and |t-values| (for LMEs) or |z-values| (for GLMEs) for all factors. Fixed effects were deemed reliable if |t| or |z| > 1.96 (Baayen et al., 2008). All durations were inverse transformed (−1,000/duration) prior to analysis.
Following the main analyses, we will present post-hoc analyses concerning length and frequency effects and cross-task correlations.

RESULTS
Prior to analysis we excluded participants who did not finish the experiment (N 2) and those for whom there was a technical incident during the experiment (N 10). The remaining group was composed of 90 participants.

Lexical Decision Response Times
Prior to analysis, we excluded 3.32% of data points for being 2.5 SD below or above the participant's mean such that extreme outlier values do not affect the inferential statistics. This is a standard procedure in experimental psychology (Ratcliff, 1993). We observed a significant effect of lexicality in lexical decision response times (b 0.29; SE 0.02; z 13.87), with participants responding more rapidly to words (M 1,627.87 ms; 95% CI 197.70) than to pseudowords (M 2,625.10 ms; 95% CI 277.65). Figure 1 shows the condition means with response times transformed into reading speed.

One-Minute Reading Test
More words were read aloud correctly per minute than pseudowords (b 24.84; SE 1.49; z 16.67). The condition means are shown in Figure 1.

Fixation Durations
We recorded the first fixation duration (FFD), which is the duration of the first fixation on the word / pseudoword, and gaze duration (GD), which is the sum of all fixations on the word / pseudoword before the eyes left the stimulus. For each measure, we deleted durations beyond 2.5 standard deviations from the grand mean (FFD 2.78%; GD 3.18%) prior to statistical analysis. We observed a significant difference between words and pseudowords for FFD (b 0.82; SE 0.12; t 6.55) and for GD (b 1.19; SE 0.19; t 6.03), with longer durations for pseudowords compared to words. Condition means are reported in Figure 2.

Effects of Length
The average values for all dependent measures for the different lengths are shown in Tables 1 and 2.

Reading Speed
The comparison of reading speed for words and pseudowords in the lexical decision task and the One-Minute Reading test can be found in Figure 1. We calculated the correlations between average reading speed per child in these two tasks separately for words and pseudowords. The correlations were highly significant for both words (r 0.63, p < 0.05) and pseudowords (r 0.59, p < 0.05). Figure 3 shows the scatter plots of these correlations. We also examined the relation between gaze durations, a gold standard for estimating word reading fluency in eye-movement research (e.g., Rayner, 1998), and Lexical Decision (r −0.76, p < 0.05) and One-Minute Reading speed (r −0.62, p < 0.05). The scatter plots of the correlations are shown in Figure 4.
Finally, we examined the complete set of correlations across our different dependent measures. Table 3 provides the matrix of correlation coefficients between all dependent measures. We highlight values of |r| > 0.6. We observed 2 negative correlations with gaze durations (GD): with the OMR and LDT reading speeds, meaning that faster readers (higher reading speed) had shorter gaze durations. We also noted 3 positive correlations: one between the OMR and LDT reading speed scores, one between the refixation rate and first fixation durations, and another between refixation rate and gaze duration. The two latter correlations suggest that participants who made longer first fixations tended to refixate more often, hence the longer gaze durations.

DISCUSSION
The goal of the present study was to provide a proof-of concept that a virtual reality set-up can be used to measure reading fluency  and eye movements during silent reading in primary school children. The internal validity was assessed using two classic benchmark measures of reading, effects of lexicality and word length. The external validity was assessed by comparing lexical decision performance and eye movement measures obtained in the virtual reality setting to a gold-standard reading fluency measure (OMR test). It is important to note that our research follows the AERA recommendations of "Standards for Educational and Psychological Testing" (American Educational Research Association (AERA), 2014). Concerning internal validity, first of all, there were clear effects of lexicality in all our behavioral and eye movement measures obtained in the virtual reality setting. Children made more errors on pseudowords compared to words and took longer to respond to pseudowords compared to words. The two eye tracking measures (FFD and GD) also showed that children spent significantly more time inspecting pseudowords compared to words, and in line with this, children also re-fixated pseudowords more often than words. Secondly, there were clear effects of length on all our dependent measures except for lexical decision error rates to pseudowords. All other measures provided evidence that longer stimuli were harder to process, with longer response times and more errors in the lexical decision task, and longer fixation durations and more re-fixations in the eye movement measures. Given that the effects of word length are excellent measures for automatization of reading processes, they could be used to detect children who have not yet fully automatized word recognition procedures. That is, children who still exhibit some form of serial processing that is characteristic of dyslexia (Ziegler et al., 2003).
Concerning external validity, we found that the VR-LDT and OMR tasks produced almost identical effects of lexicality. Moreover, there was a very strong correlation between the VR-LDT and OMR reading speed measures. It is, of course, the case that silent reading and reading aloud measures naturally correlate and this alone should not be taken to suggest that VR methods produce more robust correlations with reading aloud than classic silent reading tasks. Yet, the high correlation is not a trivial result because the OMR task is a reading aloud (production) measure that requires the exact pronunciation of a letter string, while the LDT task is a silent reading/visual word recognition measure that does not require the computation of word's pronunciation (Grainger and Jacobs, 1996;Dufau et al., 2012). In addition, OMR requires individual and supervised testing (i.e., an adult has to record the number of words read aloud), while the VR-LDT test can be done in an unsupervised, automatized fashion. The fact that these measures correlate so strongly points to a promising avenue for individualized highquality assessment of reading fluency that does not require the intervention of an expert assessor.
Concerning the strong correlation between reading speed (wpm) in the lexical decision task and gaze durations (r −0.76) found in the present study, this is in line with one prior study investigating such a relation with standard eye movement recording techniques during sentence reading (Schilling et al., 1998). However, given the results of more recent investigations that have revealed much lower correlations (Kuperman et al., 2013;Dirix et al., 2019), it seems likely that the high correlation found in our study is linked to the fact that the eye movement measures were obtained with isolated stimuli and not for words presented in a sentence context. In particular, the work of Dirix et al. (2019)   demonstrates the limits of using lexical decision as a proxy to reallife reading, hence the importance of complementing this measure with eye movement measures as in the present work. A major limitation of the present study is that we have not yet used any of the typical features of virtual reality environment related to the construction of a highly interactive 3D spatial environment. Also, as children moved their heads in our study, they did not get different views of the word they were looking at. While we fully acknowledge these limitations, it is important to note that these interesting aspects of VR were clearly beyond the scope of the present article. The primary goal of our study was to provide a proof-of-concept that VR technology can provide a reliable, valid, and child-friendly way to measure reading fluency in children, without the intervention of skilled assessors, and in a normal school environment. We successfully demonstrated that one can obtain reliable word recognition and eye movement measures with a procedure that can be applied in noisy school environments.

DATA AVAILABILITY STATEMENT
The datasets presented in this study can be found in online repositories. The names of the repository/repositories and accession number(s) can be found below. https://osf.io/m8j2z.

ETHICS STATEMENT
The studies involving human participants were reviewed and approved by Ethics approval was obtained from the Comité de Protection des Personnes SUD-EST IV (No. 17/051). Written informed consent to participate in this study was provided by the participants' legal guardian/next of kin.