Assessing goal-directed behavior in virtual reality with the neuropsychological task EPELI: children prefer head-mounted display but flat screen provides a viable performance measure for remote testing

Seesjärvi, Erik; Laine, Matti; Kasteenpohja, Kaisla; Salmi, Juha

doi:10.3389/frvir.2023.1138240

ORIGINAL RESEARCH article

Front. Virtual Real., 26 May 2023

Sec. Virtual Reality and Human Behaviour

Volume 4 - 2023 | https://doi.org/10.3389/frvir.2023.1138240

Assessing goal-directed behavior in virtual reality with the neuropsychological task EPELI: children prefer head-mounted display but flat screen provides a viable performance measure for remote testing

Erik Seesjärvi^1,2*

Matti Laine³

Kaisla Kasteenpohja¹

Juha Salmi^4,5,6

¹Department of Psychology and Logopedics, University of Helsinki, Helsinki, Finland
²Child Neurology, University of Helsinki and Helsinki University Hospital, Helsinki, Finland
³Department of Psychology, Åbo Akademi University, Turku, Finland
⁴Department of Neuroscience and Biomedical Engineering, Aalto University, Espoo, Finland
⁵MAGICS, Aalto University, Espoo, Finland
⁶Aalto Behavioral Laboratory, AMI-centre, Aalto University, Espoo, Finland

Background and objective: EPELI (Executive Performance of Everyday LIving) is a Virtual Reality (VR) task that was developed to study goal-directed behavior in everyday life contexts in children. In this study, we had 72 typically developing 9- to 13-year-old children to play EPELI with an immersive version implemented with a head-mounted display (HMD) and a non-immersive version employing a flat screen display (FSD) in a counterbalanced order to see if the two versions yield similar results. The children’s everyday executive functions were assessed with the parent-rated Behavior Rating Inventory for Executive Functions (BRIEF) questionnaire. To assess the applicability of EPELI for online testing, half of the flat screen display version gameplays were conducted remotely and the rest in the laboratory.

Results: All EPELI performance measures were correlated across the versions. The children’s performance was mostly similar in the two versions, but small effects reflecting higher performance in FSD-EPELI were found in the measures of Total score, Task efficacy, and Time-based prospective memory score. The children engaged in more active time monitoring in FSD-EPELI. While the children evaluated the feeling of presence and usability of both versions favorably, most children preferred HMD-EPELI, and evaluated its environment to be more involving and realistic. Both versions showed only negligible problems with the interface quality. No differences in task performance or subjective evaluations were found between the home-based and laboratory-based assessments of FSD-EPELI. In both EPELI versions, the efficacy measures were correlated with BRIEF on the first assessment, but not on the second. This raises questions about the stability of the associations reported between executive function tasks and questionnaires.

Conclusions: Both the HMD and FSD versions of EPELI are viable tools for the naturalistic assessment of goal-directed behavior in children. While the HMD version provides a more immersive user experience and naturalistic movement tracking, the FSD version can maximize scalability, reachability, and cost efficacy, as it can be used with common hardware and remotely. Taken together, the findings highlight similarities between the HMD and FSD versions of a cognitively complex VR task, but also underline the specific advantages of these common presentation modes.

1 Introduction

The literature of virtual reality (VR) based cognition research is expanding at a rapid pace, reflecting the increasing availability of VR systems and their technological advancements (Cipresso et al., 2018; Krohn et al., 2020). Notably, VR has been suggested as an ideal way of implementing new naturalistic paradigms that mimic real-life functions and situations (Chan et al., 2008; Parsons, 2015; Parsons et al., 2017), as it offers safe and flexible ways to create various easily-reproducible environments and allows diverse behavioral responses (e.g., movements of the eyes, head, and body) to be measured accurately (see Campbell et al., 2009). Such naturalistic tasks could complement more traditional cognitive laboratory tasks which are often repetitive, contain a limited set of stimuli, and permit only restricted behavioral responses such as a button press (Hatfield, 2002). Naturalistic tasks can also be more sensitive to cognitive impairments in situations where more traditional tasks fail to detect them (Shallice & Burgess, 1991; Cipresso et al., 2014) and could offer better predictive value for everyday functioning (e.g., Burgess et al., 2006; Chan et al., 2008; Parsons et al., 2017; Seesjärvi et al., 2022a; Seesjärvi et al., 2022b). Different VR environments permit the researcher to present dynamic stimuli in a way that allows for both the veridical control of laboratory measures and the verisimilitude of naturalistic observation of real-life situations (Parsons, 2015).

VR can be accomplished through several technical solutions that differ, among other things, in their immersiveness. An immersive VR system can be defined as one that allows the participant to perceive the environment and interact with it through natural sensorimotor contingencies (Slater & Sanchez-Vives, 2016), or as a system that blurs the lines between the physical and virtual worlds (Suh & Prophet, 2018). High immersiveness requires effective sensory substitution, which depends on factors like wide field-of-view, stereo vision/sound, head tracking for changing the field of view, short latency from head move to display, and high-resolution displays (Slater & Sanchez-Vives, 2016). In broad terms, the systems implemented with head-mounted displays (HMDs) and dedicated position-tracking controllers or camera-based hand tracking can be regarded as immersive VR, and the systems based on flat screen displays (FSDs) and more traditional interaction devices (e.g., keyboards, joysticks, and mice) as non-immersive VR (e.g., Suh & Prophet, 2018; Di Natale et al., 2020). The sense of presence is a subjective correlate of immersion and can be defined as having the illusion of “being there” in the VR environment while being aware about not actually being there (Slater & Sanchez-Vives, 2016). Importantly, the sense of presence can be considered to be a key aspect of a virtual experience and its ecological validity, as it can be argued that only when the participant is having a strong sense of presence in the virtual experience, s/he will show same kind of reactions to it that may be expected under real-life circumstances (Kothgassner & Felnhofer, 2020) and perform the tasks as s/he would do them in real life (Pan & Hamilton, 2018; Slater, 2018).

HMDs and the related peripherals have several benefits over traditional FSDs and their interaction devices. They can more closely emulate real-life sensory-motor experiences than FSDs by matching the criteria for high immersiveness to a greater extent. For example, turning the head with an HMD alters the view in the virtual world in parallel with the actual physical movements, which cannot be accomplished with common FSDs. Typical hand controllers of the current HMD systems track their rotation and position, so turning and moving the physical controller leads to similar rotations and movements in the controller projected to the virtual space. HMDs offer a stereoscopic visual experience, and with current hardware the field of view (FOV) is markedly larger than that of a typical FSD (Parsons, 2015). Furthermore, HMDs usually block the view of the surrounding physical environment completely, which can further increase immersiveness (see Slater, 2018). These differences can lead to higher perceived presence when using HMDs (Tan et al., 2015; Pallavicini et al., 2018; Makransky et al., 2019; Pallavicini et al., 2019; Pallavicini & Pepe, 2019; Yao & Kim, 2019; Chang et al., 2020; Li et al., 2020; Caroux, 2023) and have behavioral implications, such as greater physical effort with HMDs (Yao & Kim, 2019).

There are also potential disadvantages with HMDs when compared to FSDs. To avoid some of these disadvantages, the implementation of HMD-based neuropsychological tasks calls for special consideration for aspects such as how controls that facilitate naturalistic interactions are achieved, how these controls are learnt by everyone so that gamers will not have an advantage over those participants who do not play regularly or at all, what kind of hardware is required for smooth graphics, how the measurement of targeted cognitive domains or behavior is accomplished, and importantly, how potential cybersickness symptoms like nausea, dizziness, and headache are avoided (Kourtesis et al., 2020). The earlier HMDs were sometimes reported to cause cybersickness symptoms (Bohil et al., 2011), but these have been markedly reduced or have disappeared with the newer generation of HMDs (Kourtesis et al., 2019; Weech et al., 2019). Eradicating cybersickness is not vital only for the comfort of the participant but also for ecological validity, as the sense of presence and cybersickness are negatively associated (Weech et al., 2019). Recent studies have provided insights on how cybersickness is related to display lag in virtual and physical head pose (Palmisano et al., 2020) and how it can be countered by dynamic FOV restriction (Teixeira & Palmisano, 2021). This information helps researchers to design their paradigms in a way that minimizes the risk of these adverse effects. Still, cybersickness symptoms might arise in situations where there is a conflict between perceived and physical movements (Bohil et al., 2011; Palmisano et al., 2020), and some individuals, such as those with autism spectrum disorder, might be especially prone to them (Parsons et al., 2017). Because of these potential adverse effects, FSDs might be the preferred choice in some situations, for example in wheelchair training (Alapakkam Govindarajan et al., 2022) or in a race driving simulation (Walch et al., 2017). FSDs are widely available, and the related interfaces and operating systems are highly familiar even for less technically oriented users. Using HMD systems might require additional investment, and the users may sometimes need training to use the interfaces and software. Overall, the FSD systems are cost-efficient, easy-to-use, and flexible, especially in certain situations such as remote testing with automated web platforms.

As both HMD- and FSD-based systems provide means for implementing similar tasks but have different advantages as discussed above, it is essential to compare their unique characteristics so that informed decisions can be made when choosing between the two. Making such decisions for naturalistic neuropsychological tasks is currently hampered by the small number of studies that compare the two technologies by implementing such tasks in both. Furthermore, because of the rapid advances in the HMD technology, the results of earlier studies with older HMD models might not apply to the current hardware.

Regarding learning outcomes, some studies have compared HMDs and FSDs with a task that was implemented similarly between the conditions (e.g., Makransky et al., 2019; Ventura et al., 2019; Barrett et al., 2022). In a within-subjects study comparing learning in a science lab simulation in HMD and FSD conditions, Makransky et al. (2019) found that students reported having a stronger sense of presence during the HMD condition, but they also learned less and had significantly higher cognitive load based on electroencephalogram (EEG). Studying category learning, Barrett et al. (2022) found no significant group differences in learning accuracy between HMD and comparison conditions (FSD with 3D and 2D stimuli), although the participants in the HMD group had increased fixation counts. Contrasting these findings, Ventura et al. (2019) found stronger memory performance after immersive HMD condition than non-immersive tablet flat screen condition. Thus, the use of either HMDs or FSDs may result in better memory performance and more effective learning, but this could also depend on the specific task and hardware.

Several FSD-based traditional cognitive tasks, which include only a small set of stimuli and behavioral responses, have also been successfully adopted to HMDs. In their original form, many of these laboratory tasks have limitations such as their two-dimensional environment, non-naturalistic responses (e.g., using a keyboard or response box) and stimulus dynamics, and a substantial divergence from looking realistic (Kourtesis & MacPherson, 2021), which affects their immersiveness. Although the original versions of these tasks have low immersiveness, some of their HMD adaptations have taken use of the immersive capabilities of the technology. As an example, Armstrong et al. (2013) compared a Stroop task embedded in an HMD-VR scene with a FSD version and paper-and-pencil version of the same task. They found the reaction time measures in all three conditions (Word reading, Color naming, and Interference) to be correlated between the VR and the FSD version (r = 0.64–0.75), but between VR Stroop Task and the paper-and-pencil version reaction time was only correlated in the Interference condition (r = 0.49). Another cognitive task for which several different HMD versions exist is the Continuous Performance Test. However, several of them do not merely aim to be faithful replications of the FSD versions, but also take advantage of HMDs’ extended possibilities, for example, by including extraneous distractors (see the meta-analysis by Parsons et al., 2019). To study the convergent validity of an HMD-based Continuous Performance Test coined as AULA, Nesplora, Díaz-Orueta et al. (2014) compared it to a FSD version (Conners’ Continuous Performance Test) in a group of children aged 6–16 years. They found that all key measures (omissions, commissions, reaction time, reaction time variability) were correlated between the two versions (ρ = 0.36–0.79). Li et al. (2020) implemented FSD and HMD versions of the Posner task in a within-subject design and studied the related attentional processes, which were found to be enhanced in the HMD version, according to both behavioral data and EEG responses. Based on these findings, the authors suggested that the allocation of attentional resources would be more effective with an HMD compared to the FSD condition. Moreover, their participants evaluated that the sense of presence was strong during the HMD but weak during the FSD condition. In sum, these studies suggest that HMD and FSD versions of these cognitive tasks seem to be measuring the same phenomena, but differences between the two platforms can affect participants’ performance and subjective experience to some extent.

Another important methodological issue concerns the pros and cons of laboratory-based versus home-based testing via the Internet. Home-based remote testing can be an attractive and efficient option in many research and clinical settings, such as in a large-scale data collection (Feenstra et al., 2017). It is especially well-suited for the FSD systems, as the required hardware (i.e., regular home computers) are widely available. As the COVID-19 pandemic has shown, face-to-face testing might become impossible for reasons that are beyond researcher’s control (Zuber et al., 2021). However, it is not guaranteed that unsupervised remote testing with varying hardware would produce results as reliable as those from laboratory-based testing with fixed equipment. While some authors have found comparable performance between web- and laboratory-based testing (Germine et al., 2012), others have found some disparity between laboratory and online results (Crump et al., 2013). Backx and others (2020) used a within-subject design to examine the comparability of performance in the Cambridge Neuropsychological Test Automated Battery (CANTAB) under two conditions: an unsupervised web-based test situation and a typical in-person lab-based assessment. The test-retest stability was found to be comparable to previous studies with CANTAB, as the intraclass correlations ranged from 0.23 to 0.67, with high correlations (>0.60) in 3/9 performance indices and 2/5 reaction time measures. Performance indices did not differ between the conditions and generally showed satisfactory agreement, and learning effects were present in 3/9 indices. However, reaction times were slower during web-based assessments, which undermined their equivalence and agreement. This was likely due to variations in computer hardware. Also using a within-subjects design, Zuber and others (2021) found moderate-to-high correlations (r = 0.56–0.68) between laboratory and online assessment in a prospective memory task called the Geneva Space Cruiser. Overall, while remote testing is an attractive option for various research and clinical settings, each online implementation needs to be studied separately to ensure its applicability and the robustness of the results.

There are several studies on naturalistic VR tasks that simulate daily functions and activities. Some have used FSDs (e.g., Rand et al., 2009; Jovanoski et al., 2012; Raspelli et al., 2012; Cipresso et al., 2014; Ruse et al., 2014) while others have employed HMDs (e.g., Barnett et al., 2021; Chicchi Giglioli et al., 2021; Kourtesis et al., 2021; Ouellet et al., 2018; Parsons & Barnett, 2017; Porffy et al., 2022; see also the reviews by Neguț et al., 2016; Parsons, 2015; and Pieri et al., 2023). Regarding the Multiple Errands Test that was at first devised to be performed in real-life environments (Shallice & Burgess, 1991; see also Rotenberg et al., 2020), there are several desktop FSD versions have been implemented (Rand et al., 2009; Jovanoski et al., 2012; Raspelli et al., 2012; Cipresso et al., 2014), as well as a simplified tablet version to serve as a brief screening tool (Webb et al., 2021). These studies have not included any direct comparison between FSDs and HMDs, although the tasks included could be implemented with small adjustments in both. However, some other studies with relatively recent HMD hardware have compared the two technologies directly by implementing the same task with both, albeit their scenarios were not taken directly from ordinary daily life (Brooks et al., 2017; Chang et al., 2020). Brooks and others (2017) compared the HMD and FSD versions of a military flight simulator in a within-subjects study and found no difference in target detection performance between the two versions, but their participants reported higher mental workload and discomfort when using the HMD. Contrasting these findings, Chang and others (2020) performed a between-subjects study using a driving simulation with an embedded Stroop task to compare HMD and FSD conditions. They found that participants using an HMD performed better for the virtual driving but did not differ in self-reported mental effort and psychophysiological responses compared to the FSD condition. However, the authors found that users in the FSD condition had a shorter average reaction time on the Stroop trials, which they interpreted as an indication that driving required more selective attention in the HMD condition. This may have led to slower responses in the Stroop task. These two studies as well as the before-mentioned studies of learning outcomes and traditional cognitive tasks provide an important reference for further studies comparing FSD and HMD platforms but leave open what differences could exist between the FSD and HMD versions of tests with more open-ended naturalistic scenarios.

Recently, we developed EPELI (Executive Performance in Everyday LIving) with HMD to study goal-directed behavior of children in everyday contexts (Seesjärvi et al., 2022a; Seesjärvi et al., 2022b). To our knowledge, EPELI is the first naturalistic VR task for children that requires the participants to carry out multiple tasks from memory by navigating a virtual home and interacting with the relevant target objects, while keeping track of the time and ignoring non-relevant distracting objects and events. Successful performance in such goal-directed actions requires attentional, executive, and memory resources (Seesjärvi et al., 2022a). We have previously shown that the most important measures in HMD-EPELI show acceptable internal consistency, and the measure of task efficacy in particular is associated with parent-rated problems of executive function (Seesjärvi et al., 2022a; Seesjärvi et al., 2022b). The children evaluated HMD-EPELI to be very enjoyable and reported only negligible cybersickness symptoms (Seesjärvi et al., 2022a; Seesjärvi et al., 2022b). In a study using HMD-EPELI and a sample of school-aged children, some measures were associated with age (older children outperforming younger), gender (girls outperforming boys), and verbal encoding ability (children with better ability outperforming those with worse; Seesjärvi et al., 2022a). Notably, there were no significant associations of gaming background, task familiarity, or HMD type (Oculus GO vs. Pico Neo 2 Eye) with the EPELI measures (Seesjärvi et al., 2022a). Even though HMD-EPELI does take advantage of additional benefits of HMDs, such as using natural head movements for looking around the VR environment, the task itself is also well-suited for the FSD systems.

The main aim of the current study was to compare HMD and FSD implementations of a naturalistic VR task, EPELI. To our knowledge, this is the first study to make such a comparison with a naturalistic task that calls for goal-directed behavior in varied but typical everyday scenarios. Therefore, the study was expected to make a valuable contribution to the VR-based literature of cognition research, as these function-led paradigms take full advantage of the new technological possibilities and can be the hallmark of VR-based cognition research (Parsons et al., 2017). For this study, we developed a FSD version of EPELI that enabled us to examine the similarities and differences between the HMD and FSD implementations in a counterbalanced within-subjects design. A successful FSD implementation of an HMD-based naturalistic cognitive task could significantly widen its applicability in various situations, especially in remote testing. Therefore, we also studied the feasibility of parent-supervised remote testing by asking half of the participants to perform FSD-EPELI at home. Furthermore, we wanted to re-examine the associations between EPELI efficacy measures and parent-rated difficulties in executive function, which have previously been reported between HMD-EPELI and BRIEF (Seesjärvi et al., 2022a; Seesjärvi et al., 2022b). Finally, inter-version (FSD/HMD) and test-retest correlations were analyzed, as these provide important insights into the reliability and stability of a task.

The specific research aims were as follows:

1) To examine similarities and differences in task performance measures between the FSD and HMD versions and learning effects between the first and second assessment.

2) To probe similarities and differences in subjective experience ratings between FSD- and HMD-EPELI.

3) To study similarities and differences in FSD-EPELI task performance measures and subjective experience ratings between experimenter-supervised laboratory testing and parent-supervised home testing.

4) To inspect possible associations between FSD- and HMD-EPELI efficacy measures and parent-rated EF difficulties (BRIEF questionnaire; Gioia et al., 2000).

5) To assess the inter-version (FSD vs. HMD) correlations and test-retest stability of EPELI.

2 Materials and methods

2.1 Participants

The study included 101 typically developing children from Kirkkonummi and Espoo, Finland (see Supplementary Material for further information about the recruitment process). The inclusion criteria were a) native language Finnish and b) age of 9–12 years when recruited for the study. The exclusion criteria were a) any psychiatric, behavioral, or neurodevelopmental disorders (F00–F99 in ICD-10; World Health Organization, 1992) and b) decision of special support at school. For 29 children, the EPELI data for one of the two sessions (see 2.3 Procedure) was missing because of dropping out of the study after the first session or due to technical problems. Thus, the final sample comprised 72 typically developing children (29 girls and 43 boys, mean age of all participants 11.0 years and age range 9.4–13.0 years; for descriptive statistics, see Supplementary Table S1), who had successfully taken part in both sessions. The study was approved by the Helsinki University Hospital Ethics Committee, and informed consent according to the Declaration of Helsinki was obtained from children and their parents. Each child received four movie tickets for participating.

2.2 The EPELI task

EPELI (https://aalto.cloud.panopto.eu/Panopto/Pages/Viewer.aspx?id=3eb4836f-1238-4f27-853a-ad3700745b31; for the original description, see Seesjärvi et al., 2022b) is a naturalistic task of goal-directed behavior. It was designed with equal contributions by ML, JS, and ES, inspired by tasks simulating everyday life requirements, such as the Virtual Week (Rendell & Craik, 2000) and Multiple Errands Test (Shallice & Burgess, 1991). With all 13 scenarios and the practice session, EPELI takes on average approximately 27 min to complete. It was first implemented with HMD technology and then converted to FSD for this study. The key differences between the versions are as follows: a) in the FSD version, the participant uses a mouse/trackpad to change the direction of the view, whereas when using HMD, this can be accomplished by rotating the head; b) in the FSD version, the FOV is markedly smaller (101 versus approximately 25–60°, see Supplementary Material); c) in the HMD version, the view of the surrounding physical environment is blocked by the goggles, and the technology provides stereoscopic view; d) while in both versions participants interact with objects by pointing at them and clicking a button, in the HMD version this can be done independently from the direction of the view by rotating the hand controller until the ray coming from the virtual hand controller object is pointing at the desired object, whereas in the FSD version the participant is required to turn the direction of the view until the desired object is located in the crosshairs in the middle of the screen (see Figure 1D); e) In the HMD version, the clock is viewed by raising the hand and looking at the virtual hand controller object (see Figure 1C). In the FSD version, there is a white circle at the lower right corner of the screen that reveals a clock when the second mouse/trackpad button is pressed (see Figure 1D). In the HMD version, the participants used Oculus Go goggles (2560 × 1440 resolution, 60/72 Hz refresh rate, 16:9 aspect ratio, and 101-degree horizontal FOV) and the related hand controller in a sitting position (see Figure 1A). In the FSD version, the participants used typical laptop/desktop computers and a web browser (see Figure 1B). For further details on the version differences, see Supplementary Material.

FIGURE 1

FIGURE 1. Pictures and screenshots from an EPELI session. (A), a participant performing HMD-EPELI. (B), the same participant during FSD-EPELI. (C), a screenshot from HMD-EPELI showing the virtual hand controller with the clock. (D), a screenshot from FSD-EPELI showing the clock in the lower right corner of the screen and the crosshairs in the middle of the screen.

For both versions, the eight EPELI performance measures (Total score, Task efficacy, Navigation efficacy, Controller motion, Total actions, Time-based prospective memory score = TBPM, Clock checks, and Event-based prospective memory score = EBPM) described in an earlier study (Seesjärvi et al., 2022a) were included in the analyses. The only difference in the descriptions concern the measure of Controller motion in FSD-EPELI. In the FSD version, rotating the view needs to be done with mouse/trackpad as opposed to the natural head movement utilized in the HMD version. Therefore, this measure is likely to tap somewhat different aspects of behavior in the two versions.

2.3 Procedure

The study included two assessment sessions, one with HMD-EPELI and the other with FSD-EPELI, performed in a counter-balanced order 3.3–10.5 months apart (Figure 2). After both EPELI versions, the children orally answered a translated version of the Simulator Sickness Questionnaire (Kennedy et al., 1993; see also Seesjärvi et al., 2022a) and a shortened version of the Presence Questionnaire 3.0 (Witmer et al., 2005; see also Seesjärvi et al., 2022a). After HMD-EPELI, they also answered a gaming experience questionnaire (Seesjärvi et al., 2022b). To probe their familiarity with the task contents, the children were also asked “From a scale of 1 (not at all) to 7 (very much), how much have you performed similar tasks in real life?“. After FSD-EPELI performed at home, the family also filled out a hardware questionnaire (see Supplementary Table S2). The parents filled out the Behavior Rating Inventory for Executive Functions questionnaire (BRIEF; Gioia et al., 2000), from which the raw score of Global Executive Composite (GEC) was used. There was no difference in the average time between the sessions between the groups who performed the HMD part and the FSD part first, but for both groups, the delay between the sessions was longer than planned, as affected by the restrictions imposed by the global COVID-19 pandemic. All participants performed the HMD-EPELI session in the laboratory, while the FSD-EPELI session was performed in laboratory or an equivalent dedicated school room by 37 children and at home by 35 children. The children were assisted and supervised in laboratory by one of the researchers or by a trained research assistant, and at home by a parent. After performing EPELI (either HMD or FSD) and the related questionnaires in the second session, the children were also asked which version was more realistic, preferable, and easier to play, with response alternatives HMD/FSD/“I don’t know”. For detailed information about the procedure, see Supplementary Material.

FIGURE 2

FIGURE 2. The study design.

2.4 Statistical analyses

All statistical analyses and data visualization were done in R version 4.0.3 (R Core Team, 2020) with the additional packages data. table (Dowle & Srinivasan, 2021), stringr (Wickham, 2019), stringi (Gagolewski, 2020), lme4 (Bates et al., 2015), lmerTest (Kuznetsova et al., 2017), effectsize (Ben-Shachar et al., 2020), tidyverse (Wickham et al., 2019), ppcor (Kim, 2015), dplyr (Wickham et al., 2021), ggplot2 (Wickham, 2016), gridExtra (Auguie, 2017), patchwork (Pedersen, 2020), and psych (Revelle, 2020).

First, the data were inspected for missing values, data handling errors, and possible outliers. The questionnaires to be filled after FSD-EPELI (see 2.3 Procedure) were missing from six participants in the home group. Also, one parent had not answered the BRIEF questionnaire after FSD-EPELI. Univariate outliers in EPELI, BRIEF and presence questionnaire were first identified visually and confirmed numerically using a cutoff of three standard deviations above or below the mean. For FSD-EPELI, this was done separately for the lab and home groups. As a result, three HMD-EPELI gameplays, two FSD-EPELI gameplays and two BRIEF questionnaires were removed from the data, as at least one variable was confirmed to be an outlier. The observations removed comprised 3.2 % of the total data. The data was then checked for multivariate outliers using the same cutoff, but none were found. The average administration time was equal between the versions (on average 27.5 min for the FSD version and 27.8 min for the HMD version, t (122) = -0.59, p = 0.55) and very close to what had been observed in previous studies (Seesjärvi et al., 2022a; Seesjärvi et al., 2022b).

Similarities and differences in task performance between FSD- and HMD-EPELI and between the first and second sessions were evaluated with general linear mixed models (LMM) with each EPELI variable except Controller motion at time as the dependent variable, EPELI version (FSD/HMD) and time (first/second session) as fixed factors, and participant as a random factor. This analysis was not performed for Controller motion, as it measures somewhat different aspects in the FSD and HMD versions due to the differences in the control interfaces. In the models with Total actions and Clock checks as the dependent variable, the error terms distributions did not follow normal distribution. Therefore additional generalized LMMs using Poisson distribution were fitted for these variables. These models yielded very similar results, and therefore only the general LMMs are reported. The lmer function from the lme4 package was used for the LMMs, and the effect sizes were estimated with t_to_d function from the effectsize package. Effect sizes were estimated as Cohen’s d and interpreted as suggested by Conner et al., 2022 as small (>0.20), medium (>0.50), or large (>0.80).

Similarities and differences in subjective experience between the FSD and HMD versions were assessed as follows. First, LMMs with each Presence questionnaire item as the dependent variable while using version (FSD/HMD) and time (first/second session) as fixed factors and participant as a random factor. As the error term distributions in the models of questions 5, 6, 7, 8, and 12 were not normal, the main effects of version and time on these questions were confirmed with Wilcoxon signed-rank tests with continuity correction. These results are in line with those obtained with LMMs and are not shown. Second, the difference in Simulator Sickness Questionnaire between the version was tested with Wilcoxon signed rank test with continuity correction. Third, any possible differences in the three questions regarding head-to-head comparison of the versions (FSD/HMD) were tested with exact binomial tests.

Similarities and differences in FSD-EPELI performance and subjective experiences between laboratory and home testing were assessed with LMMs using each EPELI measure and Presence questionnaire item at a time as the dependent variable, place of the assessment (lab/home) and time (first/second session) as fixed factors, and participant as a random factor.

The associations between EPELI efficacy measures and BRIEF were examined with bivariate correlations. Based on visual inspection, all distributions were near to normal and thus Pearson’s correlation coefficients were used. The correlations were calculated both for each EPELI version (HMD/FSD) and its corresponding BRIEF questionnaire, and for each EPELI session (first/second) and its corresponding BRIEF questionnaire.

The inter-version (FSD/HMD) and test-retest stabilities of EPELI were first assessed with bivariate correlation coefficients to allow the comparison with earlier literature. Then, intraclass correlations (ICCs) were calculated with single-rating, absolute agreement, two-way random effect models (ICC 2,1 in Martel et al., 2015), to account not only for the within-subject change but also for the differences in the group means between the versions. For ICCs, function ICC from package psych was used. To assess the effect of one factor (version or time) while controlling for the other but without accounting for the within-subject variation, partial correlations were also provided for both inter-version and test-retest correlations with the other factor as a covariate. The partial correlations were calculated with function pcor from package ppcor and were chosen as the primary correlation measures. All distributions in both EPELI versions were visually evaluated to be near to normal and Pearson’s correlation coefficients were used, except those of Total actions, which were strongly skewed to the right. To evaluate the effect of this skewness to the results, these distributions were successfully normalized with logarithmic transformations, and the inter-version and test-retest correlations were recalculated. As these results were practically almost identical (i.e., within ±0.01 units) with those obtained with the original measure, only the results with the original measures are reported.

3 Results

3.1 Task performance in FSD/HMD and learning effects

Table 1 shows the effects of version (FSD/HMD) and time (first/second) on EPELI task performance measures and related descriptive statistics. Children achieved higher Total scores, TBPM scores, and Task efficacies in the FSD version with small effect sizes. They made almost twice as many clock checks in the FSD version compared to the HMD version, which is in line with their better TBPM performance in the FSD version. To inspect this phenomenon further, we reran the analysis by using clock-viewing duration (i.e., the total duration of clock-viewing in seconds) as the dependent variable and found a medium-sized version effect (t (69.155) = 4.544, p < .001, d = 0.55). As Total score also includes the TBPM tasks, we did a post hoc analysis for Total score without the TBPM tasks. This analysis found both effects of version (t (67.40) = 2.642, p < 0.01, d = 0.32) and time (t (67.38) = 6.786, p < 0.001, d = 0.83), which suggests that the difference in Total score between the versions is driven not only by a better TBPM performance.

TABLE 1

TABLE 1. The effects of version (FSD/HMD) and time (1^st/2^nd session) on EPELI task performance measures and related descriptive statistics.

In the second session, the children achieved higher Total scores (large effect size), higher TBPM scores (medium effect size), and higher EBPM scores (small effect size). They also performed more actions and navigated more efficiently, for which the effect sizes were small. However, Task efficacy did not change, indicating that they also did more irrelevant actions during the second session compared to the first. This is reflected in the fact that the number of irrelevant actions (i.e., actions that do not work towards given goals) as analyzed separately also increased from the first session to the second (t (70) = 3.501, p < 0.001, d = 0.40). Because learning effects were found in five variables, we checked with post hoc analyses if their magnitude was different depending on which version was performed first. The learning effect was larger after the HMD version than after the FSD version for Total score (mean change: after HMD 8.42, after FSD 3.18; t (63.72) = 3.477, p < 0.001, d = 0.44) and TBPM (mean change: after HMD 3.55, after FSD 1.00; t (64.50) = 3.395, p = 0.001, d = 0.42). For other measures, the learning effect was not affected by the version used in the first measurement.

3.2 Subjective experiences in FSD/HMD

The results of the Presence questionnaire, which was used to examine differences in subjective experience between the EPELI versions, are displayed in Table 2. The children evaluated that in the HMD version, the environment involved them more (small effect size), their experiences felt more consistent with the real world (medium effect size), they could concentrate better on the assigned tasks (small effect size), and the task seemed more interesting compared to the FSD version (small effect size). They also reported more problems in the display quality after the HMD than the FSD version (small effect size), but the problems were minor in both versions (HMD mean 2.22 and FSD mean 1.70 on a scale of 1–7). There were no differences between the two sessions, except that the children evaluated the tasks as appearing more interesting after the first session than the second (small effect size). The children reported very few potential cybersickness symptoms after both the HMD (mean sum 0.83 on a 0–14 scale) and FSD version (mean sum 0.56), and there was no difference between the versions (V = 358.5, p = 0.07). When asked to compare the two versions after the second EPELI session, most children evaluated the HMD version as being more realistic (48 out of 51, exact binomial test, p < 0.001) and preferable (36 out of 48, exact binomial test, p < 0.001) than the FSD version. Majority of the children (31 out of 49) also evaluated the HMD version as being the easier to play, but this difference was not significant (exact binomial test, p = 0.09).

TABLE 2

TABLE 2. The linear mixed models with the Presence questionnaire items as dependent variables and EPELI version (HMD/FS) and time (1^st/2^nd session) as fixed factors.

3.3 Similarities and differences between experimented-supervised laboratory testing and parent-supervised home testing

The groups who performed FSD-EPELI either supervised by experimenter in laboratory or by parent at home displayed very similar results, as there were no group differences in task performance (Supplementary Table S3) or perceived presence (Supplementary Table S4). There were no differences regarding age, handedness, gender, parental education, or family income between the laboratory testing and home testing groups either (Supplementary Table S1).

3.4 Associations between EPELI efficacy measures and BRIEF

The correlations between EPELI efficacy measures and BRIEF across EPELI versions (FSD/HMD) and sessions (first/second) are shown in Table 3. BRIEF GEC correlates with both Task efficacy (r = -0.37) and Navigation efficacy (r = -0.33) on the first session, but not on the second. To interpret this result, we computed correlation between BRIEF GEC in the two sessions and found that association to be strong (r = 0.77, t (67) = 9.905, p < 0.001). To evaluate how carefully the parents had considered their answers on each test session, we also compared the testing times and found out that parents had used less time on the second test session (median time between opening the questionnaire and closing it, with 9.00 min for the 1^st session, and 7.13 min for the 2^nd session, U = 3166, p = 0.014). When the correlations are inspected with each version at a time but including both assessment sessions, only Navigation efficacy is associated with BRIEF.

TABLE 3

TABLE 3. Correlations of EPELI efficacy measures and BRIEF.

3.5 Inter-version correlations and test-retest stability

Inter-version and test-retest correlations for the eight EPELI measures are presented in Table 4, and distributions of the EPELI variables in both versions are shown in Figure 3. Regarding partial correlations across EPELI versions, the highest were found in Total score, Task efficacy, and Total actions (0.43–0.52), followed by Navigation efficacy, Controller motion, TBPM, EBPM, and Clock checks (0.29–0.40). The highest partial correlations across test sessions were obtained in Total score, Task efficacy, and Total actions (0.43–0.54), followed by Navigation efficacy, TBPM, EBPM, Controller motion (0.31–0.39). Clock checks was not correlated across test sessions. As the effects of version (HMD/FSD) and session (first/second) were also analysed for clock-viewing duration (see 3.1), we also calculated the inter-version and test-retest correlations for this measure and found it to be correlated both between test versions (partial r = 0.46, p < 0.001) and test sessions (partial r = 0.28, p < 0.05).

TABLE 4

TABLE 4. EPELI measure intercorrelations between the HMD- and FSD-versions and the first and second sessions.

FIGURE 3

FIGURE 3. Distributions of EPELI variables in HMD and FSD. TBPM, Time-based prospective memory score. EBPM, Event-based prospective memory score.

4 Discussion

Rapid advances in VR display technology now allow researchers and clinicians to choose from a wider set of technical platforms than before. This has created a need to inspect the strengths and weaknesses of each platform and to compare the results they yield. To this end, the current study set out to compare the HMD and FSD versions of EPELI, a naturalistic task of goal-directed behavior.

Overall, the results attest to the viability of both hardware implementations, as task performance and subjective experience ratings were to a large extent comparable, and all task performance measures were correlated across the two versions. We also found some differences between the versions but with mostly small effect sizes. Most notably, children’s performance was somewhat better on FSD, while the HMD version was preferred and received better evaluations on several questions related to user experience. There were no differences between the parent-supervised home group and the examiner-supervised laboratory group in FSD-EPELI, which supports its feasibility for remote testing. Both versions are associated with parent-evaluated problems of executive function on the first assessment, but interestingly, not on the second one that took place several months later. All in all, both versions have their own benefits, such as the more sophisticated body movement tracking and higher immersiveness in the HMD version, and opportunities to improve cost-effectiveness and reachability via home-based testing with the FSD version. Below, we discuss each key finding in greater detail.

4.1 Task performance in the FSD and HMD versions of EPELI

Although the level of task performance was similar in the two versions, some modest but noticeable differences emerged. The children achieved higher Total and TBPM scores and task efficacies in the FSD version but with small effect sizes. This is in line with some previous research suggesting that even though HMDs might produce a superior feeling of presence, the use of FSDs can in some cases lead to better performance outcomes (Makransky et al., 2019; Barrett et al., 2022). This raises interesting questions regarding the role of immersiveness in the measurement of cognitive performance. Considering the case with EPELI, it is important to note that the task instructions are given orally, and the audio is delivered in a similar way between the versions (i.e., using stereo headphones, adequate volume, and by placing each sound source in the stereo image in the place that corresponds to its location in virtual space). The children reported that they experienced no problems in hearing the sounds in either version. Even though the dragon character can be seen talking, the facial expressions and mouth movements are not synchronized with the words, which means that looking at the dragon does not necessarily help in memorizing the instructions. Nevertheless, the children can look around (but not walk around) in the VR environment while listening to the instructions and might be more tempted to do so with HMD, as they report the HMD version to be more involving. This could mean that in the HMD version, they focus less on listening to the instructions and more on irrelevant but appealing visual stimuli in the environment, which would lead to better performance with FSD. During the execution phase that follows the instructions, as well as the instruction phase itself, the more immersive experience of the HMD version could lead them to be drawn more strongly to the irrelevant stimuli. Supported by their eye tracking data, Barrett and others (2022) speculate that just looking around might be more fun with HMD as compared to FSD, which would be consistent with this explanation. As there was no difference in reported effort between the versions, and as the children reported the tasks as more interesting when using HMD, lack of motivation during HMD performance is unlikely to explain the better performance in FSD. Therefore, a logical explanation for these inter-version differences could be that the higher immersiveness of the HMD version more easily disengages the participants from listening to instructions for the given tasks and performing them, which leads to worse performance.

Another potential reason for the differences in Total score, TBPM score, and Task efficacy between the versions lies in the differences in the control interfaces (i.e., using head movements and a hand controller vs. traditional devices, mouse/trackpad). However, several pieces of evidence render this explanation unlikely. First, the children reported very few problems with the control devices, with no differences between the two versions. Second, the children familiarized themselves with the controls during the demo section of EPELI, and it was ensured that all participants could perform the required actions as needed. Third, EPELI does not place heavy time constraints on the participant, as there is sufficient time to perform all required actions even at a relaxed pace, if one keeps focused on them and avoids getting into task-irrelevant behaviors that the naturalistic environment allows. This means that no quick actions or particularly skillful use of the control devices are needed to perform well in EPELI. Fourth, although inter-version differences favored FSD-EPELI in these three measures, most of the children evaluated that HMD-EPELI was easier to play. Thus, it is unlikely that the differences in the control interfaces would have played a major role in the results.

The previous studies comparing mental load between FSD and HMD conditions provide ground for speculating upon possible cognitive processes behind the inter-version differences. At least two studies that compared FSD and HMD conditions found higher mental load when using an HMD, either based on self-report (Brooks et al., 2017) or EEG responses (Makransky et al., 2019). In contrast with these findings, other studies have found no differences in self-reported mental effort and psychophysiological responses (Chang et al., 2020) or total EEG activation (Li et al., 2020). It should be noted that Li and others (2020) used quite narrow FOV (<20°) that was the same for both conditions, which might explain the lack of differences. In our study, a possible higher cognitive load with the HMD could have been induced by extraneous visual information due to its larger FOV and stereoscopic view. In EPELI, most visual stimuli in the environment are irrelevant to the tasks at hand and have the potential to distract the participant from performing these tasks. Therefore, in the HMD version the load on the bottom-up visual processes could be higher and thus might cause more interference with the top-down cognitive processes (e.g., working memory) required to perform the instructed tasks (see, e.g., Repovš & Baddeley, 2006). This line of thought is compatible with the interpretations given above.

The differences between the two hardware versions were particularly prominent in time monitoring. The number of clock checks in FSD-EPELI was almost double that in HMD-EPELI, which corresponds to a large effect size. At least three explanations could account for this finding. First, if the suggestion above regarding the higher bottom-up visual processing load with HMD is correct, less cognitive resources could be available for time monitoring in the HMD version. This explanation is compatible with previous research showing that increasing the cognitive demands of the ongoing task in a prospective memory paradigm can result in less active time monitoring (Khan et al., 2008). Second, it takes less effort to check the time in FSD-EPELI, as the watch can be viewed in the lower right corner of the screen with a single click. In HMD-EPELI, the participant needs to raise or turn his/her arm slightly and look towards the hand controller in virtual space to see the watch, like checking the time from a wristwatch. This might reduce the tendency to check the time. Third, while in the HMD-EPELI the hand controller does not display the watch or a white circle until purposefully raised and then kept still for a second, in FSD-EPELI a white circle where the watch appears is displayed in the lower right corner of the screen also when the clock is not shown, which might serve as a cue for time monitoring (see Figure 1D). These three explanations do not rule out each other and all might play some role behind the differential time monitoring in the two versions. Future research should establish which factors contribute most to the observed differences between the HMD and FSD versions.

One should note here that the inter-version effect of Total score remained even when TBPM score was subtracted from the Total score. Thus, the inter-version differences in Total score and Task efficacy are unlikely to be driven only by the more accurate time monitoring performance in FSD.

4.2 Learning effects between 1^st and 2^nd sessions

There was a large learning effect on the Total score and TBPM score and a small one on the EBPM score from the first session to the second, even though the interval between the sessions was long, over 7 months on average. Given that verbal word list or story learning tests are often reported to yield at least moderate learning effects when the same material is used on both sessions (e.g., Wechsler, 1997; Woods et al., 2006), it is not surprising to find some learning effects here. The learning effect might have been amplified by the fact that EPELI involves tasks that are acted upon rather than just orally repeated, as practice effects in neuropsychological tasks have been hypothesized to be related not only on declarative (e.g., remembering the test items) but also to procedural (e.g., remembering how to perform the test) memory (Duff, 2012). The learning effects observed in the present study did not cause ceiling effects on the second assessment and therefore do not compromise EPELI’s utility for test-retest settings. With a commonly used word list learning task, The California Verbal Learning Test, the learning effects are notably smaller when alternative materials are used on the second assessment (Woods et al., 2006). This suggests that with EPELI, alternative scenarios with different task instructions should be used in retest situations where a minimal learning effect is desired.

The children also performed more actions during the second assessment than in the first. This small effect is partly explained by a better Total score, but the amount of irrelevant behavior also increased between the sessions. It could be that when performing EPELI the second time, the children were more prone to experiment with the environment freely and less compelled to limit themselves only to the instructed tasks. However, this possible change from the first session to the second one was not reflected in their self-reported effort, which stayed constant between the sessions.

The children navigated more efficiently in the second session, which could reflect that they had become more familiar with the apartment through practice and were therefore able to plan their routes more efficiently. Even so, there was no difference in Task efficacy (i.e., the efficacy in interacting with the objects) between the sessions. This is because the children not only performed more of the given tasks in the second session than in the first, they also engaged more in extraneous, task-irrelevant behavior. However, based on more efficient navigation, this extraneous behavior did not include excessive walking around the apartment, but was present only in the interactions with the objects.

4.3 Subjective experiences

We found that when compared with FSD, the HMD environment was perceived as more involving, the experiences felt more consistent with the real world, the children reported that they were able to concentrate better on the given tasks, and the tasks seemed more interesting to them. When asked to compare the two versions directly, the children evaluated HMD-EPELI as being more realistic and preferable. These findings are consistent with studies on commercial games finding that HMD elicits a stronger sense of presence (see Caroux, 2023, for a meta-analysis) and immersion, and a greater arousal of positive emotions (e.g., Tan et al., 2015; Pallavicini et al., 2018; Pallavicini et al., 2019; Pallavicini & Pepe, 2019), as well as user satisfaction (Shelstad et al., 2017) than FSD-based hardware. Also Makransky and others (2019) reported that students felt more present in HMD than in FSD condition during a learning task. Using driving simulation with an embedded Stroop task, Chang and others (2020) found that students reported an HMD to be easier to use than an FSD. This was echoed in our data as most of the children evaluated HMD-EPELI as being easier to play than FSD-EPELI, although this difference was not statistically significant. In case this is a true effect, it could relate to interaction with the environment being more naturalistic in the HMD version than in the FSD version, which could be achieved with the position-tracked hand controller. For example, checking the time in the HMD version took place by looking at the controller similar to looking at a wristwatch and playing the drums by swinging the controller at them like using a drumstick, as opposed to performing these actions merely by clicking the mouse button.

Regarding the two sessions, the only difference in subjective experiences was that the children reported the tasks as being more interesting in the first session than on the second one. Still, their average evaluation as to how interesting the tasks felt remained high in the second session.

The children reported very few sickness symptoms after either version. This is an important finding given the negative association between cybersickness and the sense of presence (Weech et al., 2019) and in line with the results in our earlier results, some of them obtained with a different HMD model, Pico Neo 2 Eye (Seesjärvi et al., 2022a; Seesjärvi et al., 2022b). Overall, the current HMD systems seem to be able to offer an enjoyable VR experience without cybersickness symptoms (Kourtesis et al., 2019). This is of course not to say that sickness symptoms would always be absent under all conditions. Using a target detection task involving flying, Brooks and others (2017) reported higher mental workload and discomfort in HMD compared to FSD. Given the results described above, their findings might stem from the fact that somewhat older and less sophisticated HMD hardware (NVISOR ST50) was used. Also, flying simulations might be more prone to cause sickness symptoms as compared to EPELI where the movements are self-paced and walking is done via teleporting. Therefore, when developing new VR tasks, researchers should continue to evaluate any possible sickness symptoms carefully and, if needed, modify their tasks to eradicate these symptoms.

The current study employed Oculus Go HMD hardware released in 2018, and more technically advanced models have since been introduced to the market. In a previous study, we employed both Oculus Go and a more advanced Pico Neo 2 Eye HMD and found no differences on the same Presence questionnaire that was used in this study, except for fewer problems with the hand controller for the Pico (Seesjärvi et al., 2022a). As these problems were on average very few for both models, the findings between the different HMD hardware we have used can be considered very similar. Taken as a whole, we expect the perceived presence to remain quite the same if similar HMD equipment with slightly different specifications is used. This being said, more realistic interaction methods, such as hand tracking-based object manipulation, walking based on a treadmill instead of teleporting, and augmented reality setups could lead to an enhanced feeling of “being there”. Further research on perceived presence is therefore warranted when such advancements are adopted.

Previous studies with different tasks provide useful insights on why different implementations of the same task could prove useful in different situations. The Multiple Errands Test was first developed to be performed in a real-life shopping precinct (Shallice & Burgess, 1991) and later modified for different settings, such as a hospital (see the review by Rotenberg et al., 2020). Later, several desktop FSD variations that have the benefit of greater experimental control but could be less ecologically relevant as being less presentative and more removed from the everyday environments, have been developed (e.g., Rand et al., 2009; Jovanoski et al., 2012; Raspelli et al., 2012; Cipresso et al., 2014). Recently, Webb et al. (2021) created a simplified tablet version, OxMET, to be used as a brief screening tool. Even though this tablet version differs more from the original real-world MET than the desktop versions (e.g., it does not include any walking in a three-dimensional environment, but the participant navigates by touching pictures of shops in a cartoon shopping street), it has strong potential for its intended use, that is, as a screening tool for the executive problems that this kind of naturalistic paradigm aims to capture. For EPELI, the FSD version allows, for example, large-scale data collection, even though it might offer less natural sensorimotor contingencies and therefore be less immersive than the HMD version. The development of the VR-EAL, which is a neuropsychological test battery implemented by using immersive HMD-VR (Kourtesis et al., 2021), provides another interesting comparison point on this theme. The VR-EAL, which has been developed for adults, has some qualities that might enable it to provide better sensorimotor contingencies than the current EPELI HMD version, as it uses a combination of physical movement and teleportation as a navigation method, and it is performed in an upright position instead of a sitting position. For safety reasons (see Seesjärvi et al., 2022a), we have chosen to use only teleportation and a sitting position with school-aged children. The interface system of the VR-EAL reflects some of the more advanced possibilities of immersive VR systems, and a possible FSD version of the same paradigm using a keyboard or mouse as the interaction method might be markedly less immersive. However, such a version could surely be implemented, and would probably provide new use cases for the VR-EAL as well.

4.4 Comparisons between laboratory and home testing

The fact that there were no differences between laboratory and home testing in task performance or subjective presence ratings supports the feasibility of parent-supervised remote testing. In line with this, Zuber et al. (2021) found that the laboratory and online versions of a prospective memory task, Geneva Space Cruiser, yielded similar results in an adult lifespan sample. The findings of Backx et al. (2020) with several tests from Cambridge Neuropsychological Test Automated Battery are also promising for remote testing, as no differences were found in the performance indices between laboratory and home testing. It should be noted that Backx et al. (2020) found the reaction times to be slower at home than in the laboratory, which they suggested was caused by variation in the computer hardware. This should be kept in mind when new measures are developed for EPELI, but as the present EPELI performance variables did not contain any reaction time measures or other indices that would be very sensitive to subtle variations in time measurement, it is not a concern here.

Thus, the present study shows that laboratory and remote testing can produce comparable results also in naturalistic tasks such as EPELI. It is also important to note that the present participants were children who could be more prone than adults to perform differently when not supervised by an experimenter. As the COVID-19 pandemic has shown, unexpected events with tremendous impacts on societies are possible and can challenge the routines of scientific research and clinical work. Hence, the present findings on the feasibility of remote testing are very timely and have broader relevance for cognitive assessment. As remote assessments save time and resources both for the assessor and the assessee, they are very likely to become even more common in future.

4.5 Associations between EPELI efficacy measures and BRIEF

The current study indicates that when administered the first time, FSD-EPELI efficacy measures are also associated with parent-rated problems of executive function (BRIEF), as previously shown for HMD-EPELI (Seesjärvi et al., 2022a; Seesjärvi et al., 2022b). However, we were surprised to find that these associations disappeared in the second assessment. As BRIEF was strongly correlated between the sessions and parents seem to have used, on average, adequate time to fill out the questionnaire on both occasions, this change was likely to be caused mostly by a change in children’s behavior in EPELI from the first session to the second. As children engage in more actions on the second session, one possible explanation could be that in the second session, also those children who do not exhibit executive problems in everyday life resort to extraneous behavior more easily, which makes Task efficacy less representative of these problems. Another possibility is that it is the novelty of the task that makes Task efficacy representative of executive function problems in the first session. This relates to the finding that the involvement of executive functions is considered to be highest when the task is new (Rabbit, 2004). Barrett et al. (2022) speculate that with more exposure to a VR interface beyond the first session, the novelty and thereby the initial enthusiasm would diminish and thereby alter the results in following sessions. Whatever the explanation turns out to be, these results should be kept in mind when using EPELI in test-retest settings or in longitudinal studies.

4.6 Inter-version correlations and test-retest stability

Earlier, we showed that HMD-EPELI has acceptable internal consistency in six out of eight measures (Seesjärvi et al., 2022a). The internal consistency was highest (Cronbach’s α = 0.83–0.88) for the measures with the most data points, i.e., Controller motion, Total actions, and Task efficacy which is closely related to Total actions. The internal consistency was acceptable for Total score, Navigation efficacy, and Clock checks (α = 0.70–0.74). Here, all these six measures were associated between HMD- and FSD-EPELI (partial r = 0.29–0.52) and all except the number of clock checks were correlated between sessions (partial r = 0.31–0.54), which attests to their stability across the two versions and a time interval that on average spanned over 7 months.

As a comparison point, Backx et al. (2020) reported correlations of ρ = 0.39–0.73 between laboratory and home assessment sessions 1 week apart with several indices from the Cambridge Neuropsychological Test Automated Battery. Using a prospective memory task with a typical dual-task paradigm, Zuber et al. (2021) found correlations of r = 0.56–0.68 between laboratory and home assessment sessions 1 week apart for three indices (ongoing task score, prospective memory performance, time monitoring), and correlations of r = 0.66–0.78 for another sample where two sessions were done in the laboratory 1 week apart. Using the somewhat more complex prospective memory task of Virtual Week, Mioni et al. (2015) found varied correlations (r = 0.13–0.74) with a time interval of 1 month. These earlier studies indicate that the test-retest correlations can vary considerably based on, among other things, task complexity, type of the measure, and environment (laboratory/home).

When considering test-retest stability, it should be noted that although the two EPELI versions are highly similar, they are not identical which is the case in some other studies referred here (Backx et al., 2020; Zuber et al., 2021). Also, the average time interval between the sessions was exceptionally long, spanning over 7 months. The test-retest stability has been found to decrease with increasing test-retest intervals (Duff, 2012), as the correlation might not reflect only measurement error but also true change. It should also be noted that the complex nature of executive function tasks, which involve multiple cognitive processes, could make them more prone to performance variability (Delis et al., 2004). To acquire an estimate of test-retest stability of each EPELI version that would be more comparable to the earlier literature, future studies should be conducted using only one version at a time, doing both sessions in the laboratory, and with markedly shorter time interval between the sessions. Even though the long interval between the sessions makes comparison with the earlier literature more difficult, the current study has the benefit that it proves that such test-retest correlations do exist even after such a long delay.

Time monitoring, when measured as the number of clock checks, was correlated between the versions, but not between the sessions. It is worth noting here that the clock checking mechanism is different between the versions (raising the controller and looking at it in the HMD condition, pressing a button to view the time in the lower right corner of the screen in the FSD condition) and only in the FSD version a white circle remains in the lower right corner of the screen where the clock will appear, which can work as a cue for time checking. Also, it has been found that time perception can be compressed with an HMD as compared to FSD condition (Mullen & Davidenko, 2021). If this phenomenon varies from individual to individual, it can weaken the correlation of time monitoring measures in the two conditions. Children might also engage in different time monitoring strategies between immersive VR environments and conventional FSD conditions. Further test-retest studies using only one of the two versions will reveal to what extent the lack of test-retest correlation in the number of clock checks here is tied to the differences between HMD and FSD conditions. Lastly, it should also be noted that the duration of clock-viewing correlated both between the versions and between the sessions. Therefore, in a naturalistic prospective memory task like EPELI, using clock-viewing duration instead of the number of clock checks might be a more robust way to measure time monitoring behavior.

Regarding prospective memory, both TBPM and EBPM scores were correlated between the versions and sessions (partial r = 0.30–0.32). Thus, even though the internal consistency of these measures was earlier found to be poor (Seesjärvi et al., 2022a), it appears that these measures are relatively stable across time. Using Virtual Week, which is another attempt at a more ecologically valid test for prospective memory, Mioni et al. (2015) reported lower or similar test-retest correlations for EBPM (r = 0.13–0.40) but higher for TBPM (r = 0.58–0.74). Further attempts should be made to improve the psychometric properties of naturalistic prospective memory tasks, as prospective memory is important for self-dependent everyday functioning.

4.7 The future potential of EPELI for healthcare settings

Considering the potential future clinical use of EPELI, some discussion regarding the steps already taken in this direction is in order. In a joint position paper of the American Academy of Clinical Neuropsychology and the National Academy of Neuropsychology, Bauer et al. (2012) identify eight key issues relevant to the development and use of computerized neuropsychological assessment devices in healthcare settings. These eight issues concert marketing and performance claims; end-user requirements; hardware/software/firmware issues; privacy/data security/identity verification/testing environment; reliability and validity; cultural/experiential/disability factors; use of computerized testing and reporting services; and the need to control for response validity and effort. Regarding reliability, we have previously shown with the HMD version that most EPELI measures show acceptable internal consistency (Seesjärvi et al., 2022a). As regards to validity, EPELI shows predictive and discriminant validity in differentiating between children with ADHD and typically developing controls (Merzon et al., 2022; Seesjärvi et al., 2022b) and ecological validity (veridicality) by correlating with parent-rated problems of everyday executive function both in children with ADHD (Seesjärvi et al., 2022b) and typically developing children (Seesjärvi et al., 2022a). The children reported only negligible cybersickness symptoms in this and previous studies (Merzon et al., 2022; Seesjärvi et al., 2022a; Seesjärvi et al., 2022b), and all were able to learn the controls and perform the whole task. The current study shows that the FSD version has potential to be used remotely, as the performance and subjective ratings of home and laboratory groups were equal. However, several issues remain to be addressed. The marketing claims of any potential product should be based on solid scientific findings. The end-user (i.e., the assessor) requirements should be defined while considering the required knowledge of psychological assessments and technical competence, and the results should be represented in a clear format that is easy to interpret. The possible online data storage needs to be implemented by using proven and secure platforms. Regarding this, a fully online study that used an adult version of EPELI and Microsoft Azure platform has already been successfully performed (Jylkkä et al., 2023). In case EPELI versions for other languages and groups of other age groups (e.g., older adolescent) and clinical diagnoses (e.g., brain injury or dementia) are developed, their feasibility should be examined separately. Kourtesis and MacPherson (2021) have pointed out with their work with VR-EAL and young adults that immersive VR paradigms have the potential to meet all the key criteria mentioned by Bauer et al. (2012). A similar work that considers all the eight issues simultaneously should be pursued for any potential healthcare version of EPELI. To our knowledge, such a work would be the first to consider all these aspects in a task that can be used to study goal-directed behavior of children in immersive VR. Kourtesis and MacPherson (2021) suggest that a future version of VR-EAL should consider hand and head movement to evaluate whether the examinee is motivated to engage with the given tasks. This should be attempted with EPELI as well, as the effort level has been found to substantially affect performance on neuropsychological tests (Constantinou et al., 2005; Stevens et al., 2008; West et al., 2011).

4.8 Limitations and future directions

As with all research, this study has its limitations. The delay between the two assessment sessions was unusually long, on average over 7 months, caused by restrictions imposed by the COVID-19 pandemic. This hinders the comparison with earlier literature, as the present inter-version and test-retest correlations could be weaker and learning effects more modest than in studies with considerably shorter time intervals. With a shorter delay, the inter-version associations could be evaluated more accurately. On the other hand, the delay employed here comes closer to typical minimum interval between clinical neuropsychological assessments, which is usually at least a year in children, even though we are not aware of empirical data that would allow the development of guidelines for generally acceptable minimum test-retest intervals in clinical settings (see Heilbronner et al., 2010). Therefore, different time intervals come with different strengths and drawbacks. In the current study, the primary aim was to study the associations between the two versions using a within-subject design and at the same time acquire some estimates about test-retest stability. To acquire true test-retest correlations of each EPELI version, further studies should be conducted with a single version repeatedly taken by each participant. As for many other neuropsychological tests tapping memory, it could be that an optimal version of EPELI for multiple measurements within the same individual would require parallel versions, that is, several task sets of equal difficulty.

In future, the FSD version should also be employed to study clinical groups, such as ADHD and autism spectrum disorders. Our previous research has shown robust and distinctive differences between children with ADHD and matched controls in HMD-EPELI (Merzon et al., 2022; Seesjärvi et al., 2022b). As the current study attests to the feasibility of children’s FSD-EPELI for parent-supervised remote testing, an online study with a markedly larger dataset could be pursued. In this first study with the FSD version, all families that performed it remotely at home could do so with the given instructions and no differences between the groups that performed FSD-EPELI either at home (supervised by a parent) or at lab (supervised by a researcher) were observed. The children also reported high ratings on questions regarding their enthusiasm, how interesting EPELI was and how much effort they put into their performance. Also the ratings considering the display/control device quality were favorable for both versions. However, the usability and acceptability of different EPELI versions were not fully probed, which should be conducted in future studies.

New technologies continue to emerge rapidly, which calls for continuing the research on human-computer interfaces. This study was limited to two technical configurations, while there would be many possible alternatives to be tested. Thus far, we have chosen to ask the children to perform HMD-EPELI in a sitting position, as during the piloting stages of our earlier studies (Seesjärvi et al., 2022a; Seesjärvi et al., 2022b) especially younger children with no prior VR experience had problems playing in a standing position (e.g., they tried to reach something to lean on) and some reported feeling slightly dizzy (see Seesjärvi et al., 2022a). This decision made to ensure participant safety, renders performance on the HMD and FSD versions more similar, but a standing position could improve sensorimotor contingency in HMD-EPELI. As even more natural VR technologies and human-computer interfaces emerge, their benefits for naturalistic cognitive tasks should be examined, too. As an example, these technologies include using hand position tracking for interacting with the environment without any additional hand controller and various augmented reality (AR) technologies that allow researchers to incorporate real-world and virtual elements into their studies. One of the key rationales for using VR for naturalistic cognitive tasks is to be able to mimic the environments and functions of everyday life as closely as possible while being able to measure behavior accurately. As pointed out by Slater and Sanchez-Vives (2016), “VR is different from other forms of human-computer interfaces since the human participates in the virtual world rather than uses it”, and eventually there will be a paradigm shift with new ways of presenting tasks. Hopefully, these new advancements will be embraced by the research community to develop new task versions with even greater clinical and research utility.

4.9 Conclusion

The current study fills an essential gap in the literature as, to our knowledge, it is the first study to compare FSD and HMD implementations of a naturalistic, open-ended task. This is particularly important, as naturalistic tasks might become the hallmark of VR-based cognition research by taking advantage of the technology’s benefits to the fullest (Parsons et al., 2017). Our results show great similarity between the results acquired with the FSD and HMD versions of EPELI, but also distinctive strengths and benefits associated with each version. This information is beneficial not only for the future use of EPELI, but also for researchers developing other naturalistic VR tasks. The feasibility of FSD-EPELI for remote testing also received support. The issue of remote testing is also very timely, as online testing is nowadays common as a cost-effective and flexible alternative to traditional laboratory-based research. We hope that this study will in its part further the naturalistic cognitive research, which has a huge potential to broaden our understanding of human goal-directed behavior.

Data availability statement

In compliance with the research permission by the Ethics Committee of the Helsinki University Hospital, supporting data for this study is not available due to patient confidentiality restrictions. Requests to access the datasets should be directed to ZXJpay5zZWVzamFydmlAaGVsc2lua2kuZmk=.

Ethics statement

The studies involving human participants were reviewed and approved by the Ethics Committee of the Helsinki University Hospital, Helsinki, Finland. Written informed consent to participate in this study was provided by the participants’ legal guardian/next of kin.

Author contributions

JS, ES, and ML designed the experiment. The EPELI task was designed with equal contribution by JS, ES, and ML. ES and KK recruited the participants and collected and preprocessed the data. ES analyzed the data. ES, JS, and ML wrote the manuscript, which was commented and complemented by KK, and agreed on by all authors.

Funding

The study was supported by the Academy of Finland (grants #325981, #328954, and #353518 to JS, grant #323251 to ML). ES received support from the Finnish Cultural Foundation (grant #00201002), the Arvo and Lea Ylppo Foundation (grant #202010005), and the Instrumentarium Science Foundation (grant #200005).

Acknowledgments

We thank Sascha Zuber for reading an earlier version of the manuscript and providing many beneficial comments on it.

Conflict of interest

The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Publisher’s note

All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article, or claim that may be made by its manufacturer, is not guaranteed or endorsed by the publisher.

Supplementary material

The Supplementary Material for this article can be found online at: https://www.frontiersin.org/articles/10.3389/frvir.2023.1138240/full#supplementary-material

References

Alapakkam Govindarajan, M. A., Archambault, P. S., and Laplante-El Haili, Y. (2022). Comparing the usability of a virtual reality manual wheelchair simulator in two display conditions. J. Rehabilitation Assistive Technol. Eng. 9, 205566832110671. doi:10.1177/20556683211067174

Assessing goal-directed behavior in virtual reality with the neuropsychological task EPELI: children prefer head-mounted display but flat screen provides a viable performance measure for remote testing

1 Introduction

2 Materials and methods

2.1 Participants

2.2 The EPELI task

2.3 Procedure

2.4 Statistical analyses

3 Results

3.1 Task performance in FSD/HMD and learning effects

3.2 Subjective experiences in FSD/HMD

3.3 Similarities and differences between experimented-supervised laboratory testing and parent-supervised home testing

3.4 Associations between EPELI efficacy measures and BRIEF

3.5 Inter-version correlations and test-retest stability

4 Discussion

4.1 Task performance in the FSD and HMD versions of EPELI

4.2 Learning effects between 1st and 2nd sessions

4.3 Subjective experiences

4.4 Comparisons between laboratory and home testing

4.5 Associations between EPELI efficacy measures and BRIEF

4.6 Inter-version correlations and test-retest stability

4.7 The future potential of EPELI for healthcare settings

4.8 Limitations and future directions

4.9 Conclusion

Data availability statement

Ethics statement

Author contributions

Funding

Acknowledgments

Conflict of interest

Publisher’s note

Supplementary material

References

4.2 Learning effects between 1^st and 2^nd sessions