System Immersion in Virtual Reality-Based Rehabilitation of Motor Function in Older Adults: A Systematic Review and Meta-Analysis

Background: As the elderly population continues to grow, so does the demand for new and innovative solutions to tackle age-related chronic diseases and disabilities. Virtual Reality (VR) has been explored as a novel therapeutic tool for numerous health-related applications. Although findings frequently favors VR, methodological shortcomings prevent clinical recommendations. Moreover, the term “VR” is frequently used ambiguously to describe e.g., video games; the distinction remains vague between immersive VR (IVR) systems and non-immersive VR (NVR). With no distinct demarcation, results of outcome measures are often pooled in meta-analyses, without accounting for the immersiveness of the system. Objective: This systematic review focused on virtual reality-based rehabilitation of older adults (+60) in motor rehabilitation programs. The review aims to retrospectively classify previous studies according to the level of immersion, in order to get an overview of the ambiguity-phenomenon, and to utilize meta-analyses and subgroup analyses to evaluate the comparative efficacy of system immersion in VR-based rehabilitation. Methods: Following PRISMA guidelines, we conducted a systematic search for randomized controlled trials, describing virtual rehabilitation or video games interventions for older adults (+60). Main outcomes were pain, motivation, mobility, balance, and adverse events. Results: We identified 15 studies which included 743 patients. Only three studies utilized IVR. The rest used various NVR-equipment ranging from commercial products (e.g., Nintendo Wii), to bespoke systems that combine tracking devices, software, and displays. A random effects meta-analysis of 10 studies analyzed outcome measures of mobility, balance, and pain. Protocols and dosage varied widely, but outcome results were in favor of immersive and non-immersive interventions, however, dropout rates and adverse events were mostly in favor of the control. Conclusions: We initialize a call-for-action, to distinguish between types of VR-technology and propose a taxonomy of virtual rehabilitation systems based on our findings. Most interventions use NVR-systems, which have demonstrably lower cybersickness-symptoms than IVR-systems. Therefore, adverse events may be under-reported in RCT-studies. An increased demand for IVR-systems highlight this challenge. Care should be given, when applying the results of existing NVR tools to new IVR-technologies. Future studies should provide more detail about their interventions, and future reviews should differentiate between NVR and IVR.


INTRODUCTION
By 2050 the world population is projected to reach 9.7 billion people, with older adults ≥65 accounting for approximately one fifth (1.7b) (United Nations, Department of Economic and Social Affairs). Increased life-expectancy implies a higher risk of developing various chronic diseases, including cardiovascular diseases, cancer, dementia, osteoarthritis, and stroke (Christensen et al., 2009;Fontana and Hu, 2014;Kennedy et al., 2014). Consequently, the diagnosis and treatment of these chronic diseases, which often require special care or hospitalizations, leads to rising expenditures for the healthcare systems around the world (United Nations, Department of Economic and Social Affairs). As an approach to prevent these trends, it has been suggested that increased physical activity, as regular exercising provides multiple health benefits, and reduces the risk of obtaining chronic diseases (Duncan et al., 2010;Anderson and Durstine, 2019;De la Rosa et al., 2020).
Despite evidence for the health benefits of keeping active, low motivation is often a challenge when seeking to counteract physical inactivity and sedentary lifestyles, through exercise programs (Teixeira et al., 2012). In the context of rehabilitative interventions, outcomes, and recovery often depend on the patient's motivation, leading programs to suffer from low adherence as a consequence. This has been identified as a challenge within different fields of rehabilitation, including pulmonary rehabilitation (Bourbeau and Bartlett, 2008;Salinas et al., 2011), acute stroke (Maclean et al., 2000), and diabetes (Rizzo et al., 2011).
Novel technologies such as active video games and virtual reality (VR) technologies, when used appropriately, have the potential to solve some of the challenges with low motivation and adherence. However, implementation into clinical practice has not yet been fully realized. On the other hand, VR has seen a commercial breakthrough within the last 5 years, and steadily increased the technological awareness of consumers and health professionals (Keshner et al., 2019). Within this field of rehabilitation, the therapeutic effects and value of VR technologies has been evaluated and scrutinized for over two decades, often under the general term of Virtual Rehabilitation (Burdea, 2003;Tieri et al., 2018). Within this highly specialized and diverse field, VR has successfully been applied to rehabilitation for adults with simple phobias (Rothbaum et al., 2006;Parsons and Rizzo, 2008;Powers and Emmelkamp, 2008;Maples-Keller et al., 2017); Post-traumatic stress disorder (PTSD); (Rothbaum et al., 2001;Difede et al., 2007;Kothgassner et al., 2019); acute and chronic pain treatment (Gold et al., 2005;Hoffman et al., 2008;Li et al., 2011;Pourmand et al., 2018;Matamala-Gomez et al., 2019;Wittkopf et al., 2020); post-stroke treatment, brain injury, and various other forms of neurological disorders (Rizzo et al., 2004;Rose et al., 2005;Stewart et al., 2007;Laver et al., 2017;Karamians et al., 2020).
For motor rehabilitation as an example, advantages with immersive characteristics of VR include how the sense of presence can induce an illusion of virtual body ownership and agency through multisensory feedback (Kilteni et al., 2015). The sensorimotor loops needed for motor rehabilitation can be strengthened through the introduction of a virtual context, to connect relevant cognitive associations to otherwise isolated repetitive motor tasks (Tieri et al., 2018). This is highly relevant in rehabilitation to reestablish cognitive function processes as motor skills, for instance with stroke patients (de Bruin et al., 2010). For geriatric rehabilitation, virtually augmented exercise is similarly proposed to influence cognitive abilities, for instance in cases including dementia (Garcia-Betances et al., 2015).
Nevertheless, VR remains an umbrella term within the field of rehabilitation, used to describe many and vastly different technologies, from "non-immersive" single desktop displays to "immersive" high fidelity motion-sensing input devices and wearable technologies such as head-mounted displays (HMDs) (Tieri et al., 2018). Hardware aside, variations between software solutions used to study the efficacy of "VR-based" rehabilitation (Burdea, 2003) (VRBR) is equally pluralistic. Hence, attempting to define VR, entails a certain ambiguity across a large body of research. However, interventions rarely use immersive VR (IVR)-technology as a facilitator (Tieri et al., 2018).

Current Systematic Reviews
In systematic reviews exploring the efficacy of virtual systems, VR is likewise ambiguously defined, and is frequently specifically defined as the use of commercial non-immersive consoles such as Nintendo Wii (Donath et al., 2016). Systematic reviews have explored the use of VR for improving mobility and balance (Donath et al., 2016;Neri et al., 2017;Amorim et al., 2018;Porras et al., 2018), physical functioning (Molina et al., 2014) and in general, to improve health-related domains (Miller et al., 2014). However, included articles frequently only describe interventions using NVR; again, most commonly using the Nintendo Wii (Miller et al., 2014;Molina et al., 2014;Amorim et al., 2018;Reis et al., 2019). For example, of the 10 articles included in Amorim et al.'s review (Amorim et al., 2018), 6 use Nintendo Wii console, while the remaining used Playstation EyeToy (n = 1), Xavis measured step system (n = 1), or bespoke systems with pressure mats or balance boards (n = 2). Likewise, of the 13 articles included in the review by Molina et al. (2014), most used Nintendo Wii (Fit) (n = 8), Balance Rehabilitation Unit from Medicaa (n = 1), or video games or bespoke systems (n = 4). Indeed, in a recent review by Karamians et al. (2020), exploring the effectiveness of VR and gaming-based interventions for UE post-stroke rehabilitation, only three of the included 38 articles described IVR technology. And while the authors are well aware of the distinguishing features of VR-systems (Karamians et al., 2020), this crucial differentiation may easily be lost, if the review is included in future syntheses. While findings frequently demonstrate a significant improvement in favor of virtual rehabilitation (for example Neri et al., 2017, P < 0.01), the quality of the evidence is often low with a high risk-of-bias (RoB) (Laver et al., 2012;Donath et al., 2016;Neri et al., 2017;Amorim et al., 2018). Therefore, the need remains to explore the efficacy of virtual rehabilitation in larger and better controlled studies.
Previous attempts have sought to delimit VR, by simply referring to devices which utilize immersive technology (Iosa et al., 2012;Rizzo and Koenig, 2017;Tieri et al., 2018). However, "immersion" has likewise seen its share of ambiguous usage, and is often confused with related terms, such as presence (Nilsson et al., 2016). VR-systems of high fidelity (e.g., HMDs), are usually referred to as fully immersive VR, or simply immersive VR (IVR). Lower fidelity systems are in these cases mostly referred to as non-immersive VR (NVR). For clarification, we will outline these aspects, before commencing the review's methodology.

Defining Immersion
VR can be described as a computer-generated interactive virtual environment. The defining feature separating VR from traditional media, is arguably VR's ability to give users a compelling illusion of "being there" in virtual environments. This illusion is often referred to as presence or place illusion (Slater, 2009), and has been described as the subjective correlate of immersion (Slater and Sanchez-Vives, 2016). Place illusion describes the subjective experience of a user, whereas immersion relates to objective characteristics of the system used to deliver this experience. The more immersive a system is, the higher degree of presence it can elicit. Immersive systems have been characterized based on the sensorimotor contingencies (SCs) they support (Slater, 2009). SCs are the actions a person can perform in order to perceive the world (e.g., changing one's gaze direction by moving the head or eyes, or kneeling to see underneath something). The level of immersion supported by a given VR-system depends on how well it supports normal SCs. Therefore, in this review, when discussing immersion, we operate with the term "system immersion" (Nilsson et al., 2016). A number of factors related to both displays and tracking can affect system immersion, however, for the purpose of the current review, we adopt a simple dichotomous categorization with respect to immersion. Level of immersion is distinguished between two broad categories of systems: immersive systems and non-immersive systems. Immersive systems allow users to view virtual content in all directions (i.e., they have an unlimited field of regard, FOR), even though the field of view (FOV) usually is smaller than the users visual field. Contrarily, non-immersive systems only offer a limited FOR and a limited FOV (e.g., screenand projection-based systems).

Specific and Non-specific Systems
When the Nintendo Wii launched in 2006, it quickly became an affordable closed system, that supported physical activity with games and entertainment, with researchers soon after applying it to physical therapy programs (Deutsch et al., 2008). This caused a shift from bespoke systems (i.e., software and hardware solutions created for specific users and contexts) toward commercially available solutions (Keshner et al., 2019). A recent systematic review exploring types of VR applications within rehabilitation, characterize these different systems dichotomously as either specific (systems specifically built for rehabilitation) or nonspecific (i.e., computerized systems meant for recreational activities and gaming) (Maier et al., 2019). However, systems can be simultaneously commercial and specific. This is evident from the increasing amount of companies developing high-end equipment, where gamification principles are embedded into the therapy (IREX, VRRS-systems, and others; Maier et al., 2019). Systems can also be custom-built from existing hardware and software, tailored for specific needs (i.e., bespoke systems). We argue that a distinction has to be made between commercial and bespoke systems, since low availability and accessibility of certain VR-systems challenges the reproducibility of findings or clinical applications. This is most commonly a trait of bespoke systems, which are usually developed in closed ecosystems, specifically to solve contextual challenges. Conversely, commercially available "off-the-shelf " systems can more reliably reproduce results. This means that cross-study analyses would gain a homogeneous data sets, and that any heterogeneity found in e.g., meta-analyses, would more confidently be attributed to sampling error.

The Potential Challenges of Ambiguous Classifications
The caveat to IVR and a main reason why a clear distinction is important for systematic reviews, is how the technology leads to demonstrably larger levels of side-effects, when compared to conventional displays (Sharples et al., 2008;Kim et al., 2014;Dennison et al., 2016;Chang et al., 2020). These side effects are also known as VR-sickness, cybersickness, VR-induced symptoms, and effects (VRISE) (Sharples et al., 2008), or visually induced motion sickness (Rebenitsch and Owen, 2016). In a study from 2008, Sharples et al. compared side-effects between different display technologies, including HMD, desktop monitor and projection screens (Sharples et al., 2008). The results indicated a significant increase in nausea symptoms when using HMD, compared to desktop and projection screens. Technology has progressed substantially since 2008, by including improved frame rate-and refresh rates frequencies to accommodate humaneye resolution and sensorimotor contingencies (LaViola, 2000). However, cybersickness remains an unsolved problem with IVR technology. A commonly used measure of VR-sickness is the simulator sickness questionnaire (SSQ) (Kennedy et al., 1993). While ironically not developed for IVR, it is a frequently used, standardized and validated measure of the severity of symptoms related to nausea, oculomotor disturbances and disorientation, while using a VR-system. It has also previously been used to measure adverse events related to VRBR-use (Dahdah et al., 2017), although often reported incorrectly. Additionally, it has been suggested that only IVR-technology should be defined as VR (Tieri et al., 2018). More specifically, solutions utilizing nonimmersive technologies to facilitate an immersive and interactive digital environment. We agree with Tieri et al. (2018), therefore, another aim our work is to propose a model to better classify the use of VR-equipment in clinical contexts.
This review seeks to distinguish between the broader uses of VR, which encompasses non-immersive VR (NVR), for example video games and consoles such as Nintendo Wii, and the more discrete use of IVR, where the "immersion" is a property of the technical system (Nilsson et al., 2016) such as with HMDs. Like Tieri et al. (2018), we believe taxonomic consistency is more pertinent now than it was previously, as the availability of commercial IVR-systems will continue to increase the demand for clinical applications. Paradoxically, the evidence in favor of the safety, affordance, feasibility, efficacy, and implementation within clinical use, is still in its infancy. Furthermore, a classification of VRBR solutions for clinical application is needed, to frame such evidence and to allow practitioners suitable awareness, before including solutions into daily practice. And since the geriatric population is the largest group with rehabilitation needs, this is a good place to commence.

Research Questions
This review focuses on VRBR of older adults (+60) in motor rehabilitation programs. The aim of this review is to: 1. Retrospectively classify previous studies according to level of immersion, in order to get an overview of the ambiguityphenomenon. 2. Utilize meta-analyses and subgroup analysis to determine outcome effect variations between IVR and NVR 3. Evaluate the comparative effectiveness of system immersion in IVR and NVR systems 4. Analyze comparative risks and adverse events between IVR and NVR systems.

METHODS
The systematic review protocol was registered in PROSPERO (ID: CRD42019121172), and the reporting of the review was conducted following the Preferred Reporting Items for Systematic Reviews and Meta-Analysis (PRISMA) guidelines (Liberati et al., 2009).

Information Sources
The systematic search was undertaken on the following databases: PubMed/MEDLINE, Web of Science, CINAHL, Cochrane Library, and EMBASE to find articles describing randomized controlled trials (RCT), published in English, Danish, Swedish, or Norwegian.

Search
The search strategy was developed by and approved by all authors. The searches were performed by SMF and SWH who (1) extracted studies from the databases into EndNote; (2) performed duplicate removal; (3) uploaded them into Covidence for screening. All databases were searched from inception to the 30th April 30, 2020. Search strings were adapted to fit each database individually using boolean search operators and limited to Human studies and Randomized Controlled Trials whenever possible. Search themes included rehabilitation or physical therapy using virtual reality, "exergames" or video games. A broad search for video games as well was to not only search for interventions describing nonimmersive applications. A full list of all searches performed can be found in the Supplementary Material.

Study Selection
Title and abstract screening was performed by ERH and TMP; articles were excluded based on the following criteria: no full text available, wrong language, not peer-reviewed, wrong study design, wrong study population (participants are healthy adults, under the age of 60 or principle diagnosis was neurological), wrong outcomes or wrong setting.
A blinded full-text screening was performed independently by two reviewers (ERH and TMP). Conflicts were resolved by ERH and TMP through discussion, or with arbitration by third reviewer (KBJ).

Data Collection Process
Quantitative data was extracted from the included studies by pairwise independent reviewers (ERH, TMP, KBJ, JPE, and NCN) using a standardized data extraction form, which was presented and agreed upon at a joint meeting. Inter-rater conflicts in the data extraction process was resolved by ERH and TMP in consensus.

Assessment of Risk of Bias in Included Studies
We utilized the Cochrane Collaboration's RoB Tool (Liberati et al., 2009) to evaluate the methodological quality of the included studies. RoB assessment was independently performed by two paired reviewers (ERH, TMP, NCN, JPE, and KBJ). Conflicts were resolved by ERH and TMP.

Data Analysis
Results from the different trials were pooled using RevMan 5.4 (The Nordic Cochrane Centre, The Cochrane Collaboration, 2020). The primary outcomes for balance was determined as either timed tasks such as Timed Up and Go (TUG), or composite scores such as the Berg's Balance Scale (BBS). For functional mobility, the outcomes were limited to timed tasks such as the six minute walk test (6mWT) or the 10 meter walk test (10mWT). Pain-measurements were limited to standardizable self-reported uni-dimensional measures, such as visual analog scales (VAS), numerical rating scales (NRS). To estimate effect sizes of outcomes, we used standard mean difference (SMD) for different (or variations of the same) instruments, including VAS and BBS. For studies using similar instruments the mean difference (MD) was used.
Random-effects were used for all analyses, as the included populations were likely not functionally equivalent, since the interventions described different outcomes and patient populations. Variability within studies are reported in forest plots. Subgroup analysis to determine differences between IVR and NVR studies were performed. Due to the low amount of IVR-studies, subgroup analysis was only undertaken for pain. Heterogeneity was assessed individually for each outcome and considered insignificant if the I 2 value was beneath a moderate level >50% as suggested by Higgins and Thompson (2002). Effects are considered statistically significant if p ≤ 0.05. All analyses use End of Treatment (EoT) scores.

Study Identification
Through the different databases, we identified 3,507 articles matching the search strategy. No additional records were identified. After removing duplicates, 2,202 articles were screened, and 2,039 articles were excluded based on title and abstract, because they did not match the inclusion criteria. The full search strategy is outlined in Figure 1. Many of the excluded articles described interventions that did not include virtual rehabilitation. Among the most frequent occurrences were interventions with cold-water immersion. One hundred and sixty-three articles published between 1999 (Kim et al., 1999) and April 2020 were assessed for eligibility and 15 articles (n = 743) satisfied the inclusion criteria and 10 (n = 555) articles satisfied the correct outcomes required for conducting a meta-analysis.
Six studies were included in the meta analysis for the overall effect of NVR (VFB treadmill, Nintendo Wii Fit, Playstation 2 + EyeToy, NIRVANA VR Interactive System) on BBS scores as a measure of balance (Figure 3). The mean BBS scores ranged from 30.1 (±8.84) to 53.41 (±1.49) across the studies. The MD between experimental and control groups ranged from 0 to -2.49 s. Considerable heterogeneity was found between studies (Tau 2 = 0.96, Chi 2 = 9.03, df = 5, I 2 = 45%, df = 5). Compared to the control group, the SMD between groups on BBS scores was significantly greater for the VR group, demonstrating a significant treatment effect (Z = 4.02, p ≤ 0.0001).
Two studies were included in the meta-analysis for the overall effect of NVR (VFB treadmill, Nintendo Wii Fit) on 6MWT scores (Figure 4). The mean 6MWT scores ranged from 323.7 (±25.9) s to 387.8 (±70.8) s across the studies. Considerable heterogeneity was found between studies (Tau 2 = 0.00, Chi 2 = 0.73, df = 1, I 2 = 0%). Compared to the control group, the SMD between groups on 6MWT scores was not significantly different between the VR group and control group (Z = 1.85, p = 0.06).

Motivation
Adherence, enjoyment and motivation were measured in only one study (Oesch et al., 2017). The primary outcome of the study was the adherence to exercise as measured by the duration of exercise each day. Motivation and enjoyment were measured after each training using a five-point Likert scale. Each of these outcomes, adherence, motivation, and enjoyment were found to be favored in the conventional exercise group. Another study measured game satisfaction on a custom dichotomous scale with direct "yes/no" questions, but only for the experimental group (Mugueta-Aguinaga and Garcia-Zapirain, 2017).

Pain
Two studies were included in the meta-analysis for the overall effect of IVR (Khymeia VRRS, HTC Vive) on pain scores for patients following total knee arthroplasty. The SMD between    groups on pain scores was significantly greater for the VR group, demonstrating a significant treatment effect (Z = 2.57, p = 0.01). Two studies were included in the meta-analysis for the overall effect of NVR (Nintendo WiiFit) on pain scores for people with chronic low back pain and balance disorders. There was no significant difference between the NVR and control groups (Z = 0.78, p = 0.44). Comparison of IVR and NVR demonstrated no significant difference in overall treatment effect between type of VR (Z = 1.03, p = 0.02) (Figure 5).

Adverse Events and Dropouts
Two studies mentioned that adverse events were observed during the intervention (Laver et al., 2012;Oesch et al., 2017), and five articles mentioned that no adverse events were detected (  Dropouts varied across studies and interventions (Figure 6). The absolute risk of the experimental group was 41 vs. 20 for the control groups. The weighted average dropout rate was 10% for experimental groups and 5% for control groups. The highest number of dropouts were seen in Oesch et al. (2017) 11/28 in the experimental group, whereas the authors identify 7 dropouts related to the treatment either due to dislike (n = 5) or experience of pain (n = 2).

Risk of Bias Assessment
The risk of bias analysis was performed to assess the methodological quality of the articles included in the quantitative synthesis. The resulting summary can be seen in Figure 7. No additional sources of bias were discovered.

Selection Bias
All studies except two (Kwok and Pua, 2016;Jin et al., 2018) described a random component when allocating participants to groups (sequence generation). However, six of the included studies did not conceal allocation when assigning participants to the intervention groups (allocation concealment) assessed as a high risk (Lee and Shin, 2013;Sobral Monteiro-Junior et al., 2015;Tsang and Fu, 2016;Yeşilyaprak et al., 2016;Anson et al., 2018;Jin et al., 2018).

Performance Bias
Blinding of participants and personnel is almost always impossible in physical health research (Karanicolas et al., 2010), which is also reflected in all of the study receiving a high risk assessment.

Detection Bias
While blinding of participants or personnel is impossible, blinding of outcome assessors is still feasible. However, half of the studies (Lee and Shin, 2013;Morone et al., 2016;Tsang and Fu, 2016;Yeşilyaprak et al., 2016;Jin et al., 2018) reported that outcomes were assessed by the same people who performed the experiment which we estimate as a high risk. One study (Kwok and Pua, 2016) did not specify details (unknown risk) and four studies (Laver et al., 2012;Sobral Monteiro-Junior et al., 2015;Anson et al., 2018;Gianola et al., 2020) took steps to blind outcome assessors (low risk).

Attrition Bias
Concerning incomplete outcome data, three studies had no missing data (Morone et al., 2016;Tsang and Fu, 2016;Jin et al., 2018), (low risk, see also Figure 6). Two studies reported dropouts, but outcomes were calculated based the number of participants, and intention-to-treat was used to account for missing data (Laver et al., 2012;Morone et al., 2016) (low risk). Two studies (Lee and Shin, 2013;Anson et al., 2018) had a low and slightly disproportionate dropout-rate in favor of the control group, however performed a per-protocol analysis (uncertain risk). Three studies (Sobral Monteiro-Junior et al., 2015;Yeşilyaprak et al., 2016;Gianola et al., 2020) had moderate dropouts, disproportionately in favor of the control group, and conducted a per-protocol analysis with no attempts at an intention-to-treat analysis (high risk).

Reporting Bias
We did not compare trial protocols with published outcomes, therefore intervention effects could be overestimated. Selective outcome reporting was assessed based on whether or not the articles made a reference to an existing protocol. Only three studies referenced a prospectively registered trial protocol (Sobral Monteiro-Junior et al., 2015;Anson et al., 2018;Gianola et al., 2020), two studies were retrospectively registered (Laver et al., 2012;Kwok and Pua, 2016), and five studies made no reference to a protocol receiving a high risk assessment (Lee and Shin, 2013;Morone et al., 2016;Tsang and Fu, 2016;Yeşilyaprak et al., 2016;Jin et al., 2018). We justify an unknown risk for retrospectively registered trials because reported outcomes in the registry technically could reflect findings of the study, which could also indicate overestimated intervention effects.

Frequency and Classification of Publications With IVR Interventions
To further gain an understanding of the inconsistent use of "VR" as a term to describe technological systems, articles excluded in the full text screening (n = 163), were reviewed. After excluding protocols, wrong study design, and "no full text available" articles, the following was extracted from the articles (n = 108): information such as intervention description, immersion type, specificity, and availability of software/hardware (commercial or bespoke), the equipment used, and how the authors describe the intervention. Over half om the articles (n = 60) used VR to describe the intervention, and three articles used the descriptor "VR" only as a keyword, with no further mention in the paper. Of the 60 articles, 49 (82%) used non-immersive equipment, 10 (17%) used high-immersive equipment (HMD or CAVE systems) and one article did not describe the equipment used in detail, but just referred to "VR-technology" (Cacau et al., 2013), which made classification impossible.

DISCUSSION
Virtual rehabilitation continues to evolve as an independent field of study (Keshner et al., 2019). However, despite spanning over two decades, the effectiveness of VR-systems continues to elude, whether specifically made for rehabilitation purposes or adapting recreational non-specific games (Maier et al., 2019). Furthermore, the use of the technology for older adults in nonneurological disorders is still scarce. Even more surprisingly, VR remains a "buzzword" used to describe interventions that do not use IVR-equipment. Both the Oculus Rift CV1 and the HTC Vive were released commercially in early 2016, and the Oculus Rift DK1 was available as early as 2012. Yet, even though VR-systems are now of higher quality and lower prices than previously, IVR-systems appear to be still under-represented in virtual rehabilitation (Figure 8). We argue that an increasing public awareness of what could constitute a VR-system, paired with a general lack of research consensus on how it should be specifically interpreted and understood, poses a potential healthrisk. An example is how the assessment of adverse events are generally under-prioritized in RTCs (Bonell et al., 2015). Our findings affirmed this, as we found adverse events to be generally poorly reported. Although the reporting of no events may be due to a lack of occurrence, it may also be due to only serious events being considered and negligible effects. An example could be how a slight dizziness could easily go unreported. Meanwhile, there is a complexity to adverse effects evaluation, as negligible symptoms may be ignored for (or by) some patient populations, FIGURE 7 | Risk of bias (RoB) summary of included studies (only quantitative synthesis). Icon explanations: green "+",low risk of bias; red "-", high risk of bias; yellow "?", uncertain risk of bias.
while the same symptoms could be considered severe for (or by) others.
Nevertheless, measuring nausea or other VR-related sideeffects using standardized tools, is seldom an independent outcome prioritized in randomized trials. However, users experiencing VR-sickness, remains an unsolved challenge which is more frequently observed in IVR-systems (Sharples et al., 2008;Kim et al., 2014;Dennison et al., 2016;Chang et al., 2020). Therefore, if clinical trials are included in syntheses, without accounting for the degree of system immersion, prevalent adverse events may go unnoticed. This has potential harmful human consequences, as national-or international health authorities base their clinical guidelines on these RCT-studies, reviews and meta-analyses, that may not differentiate systems or adverse effects correctly.
The subgroup-analysis between IVR and NVR revealed that the dropout rate for IVR-studies were higher than for NVRstudies. While both tended to have a higher retention for the control group, the dropout rate for IVR experimental groups were significant (p = 0.02) while the NVR experimental groups was not (p = 0.10). Adverse events were often not properly addressed, except for two studies (Laver et al., 2012;Gianola et al., 2020), who both included a detailed description and discussion. Due to poor reporting we cannot infer causality between dropouts and adverse events. However, description from Gianola et al. (2020) does highlight that dropouts might also be connected to the participants feeling "uncomfortable" wearing the HMD, or lacking face-to-face contact with the therapist. A recent study exploring the acceptance of HMDs among older adults, concluded that attitude changed to positive after experiencing the technology with minimal symptoms. However, there are some caveats related to the authors' conclusion, that negative attitudes or VR-sickness is negligible. Firstly, the results relate to healthy older adults, thus not synonymous and possibly not applicable to more vulnerable users. Secondly, the VRapplication used in the experiment (Perfect by nDreams) has the lowest rating on the Oculus comfort spectrum (nDreams, 2016), which implies that related symptoms will be very low. VR content has a significant impact on the amount of symptoms experienced (Saredakis et al., 2020), and symptoms should therefore be evaluated across different content characteristics, before validating a generalized use.
At least one article describes preliminary steps to delimit adverse events (Sobral Monteiro-Junior et al., 2015), but it would be beneficial if adverse events, related to IVR-systems, are measured more consistently with standardized instruments (e.g., the SSQ) in future studies. This would allow to gain a more systematic understanding of the potential challenges with VR as a therapeutic tool across different patient populations, age-groups, and systems.

Summary of Main Findings
The studies included in this review varied widely across the intervention type and dosage, outcome measures participant characteristics and setting. Participants in the included studies ranged from hospital inpatients, to residential aged care, to people living in the community. This range of settings and focus on different conditions or diagnoses, suggests that participants may be different at baseline, making it difficult to compare. Additionally, all analyses had high heterogeneity, demonstrating large variation across the included studies. While motivation, engagement and adherence are commonly cited as benefits of the use of VR in the therapy setting, only one study evaluated this outcome (Oesch et al., 2017). This seems paradoxical, since motivation is often a central principle in the reasoning for using the technology in the first place.

Taxonomy of Virtual Rehabilitation Systems
Although the intention to classify VR-systems based on level of immersion was pre-specified in the protocol, a Taxonomy of Virtual Rehabilitation Systems was developed a posteriori to the findings in this review, expand upon the different types of VRsystems, both in terms of immersion [non-immersive (NVR) vs. immersive (IVR)] and specificity [specific (S) vs. non-specific (NS)] (see Figure 9). The latter is describing systems developed exclusively for rehabilitation purposes (specific), as opposed FIGURE 8 | Frequency plot of the articles (n = 60), published between 1999 and 2020, which were excluded in the eligibility assessment, who used "Virtual Reality" to refer to the intervention. The graph shows interventions post-classified as immersive or non-immersive, by the authors, according to the taxonomy of virtual rehabilitation systems (see Figure 9). to recreational and/or off-the-shelf video games, which have simply been applied to rehabilitation interventions (non-specific) (Maier et al., 2019). Furthermore, to account for implications for practical applications and availability of the systems, the taxonomy sub-classifies each type of system. Specifically, this depends on whether or not the systems are commercially (C) available as a "closed system," or have been developed as bespoke (B) technology, which presumably makes it less accessible as an off-the-shelf product.

Non-immersive VR -Non-specific (NVR-NS)
This sub-category most notably entails commercial NVR-NS(C) systems, such as the Nintendo Wii, with studies that are more easily reproducible, due to the consistency and availability of the systems and software. Likewise, the studies are frequently larger, and span a wide spectrum of patient populations. The caveat is that the systems are not developed for the target population, i.e., people with disabilities. Therefore, studies will encounter users who are not able to operate the system, which may introduce frustration and lack of motivation. Bespoke NVR-NS(B) systems within this sub-category will likely be underrepresented. We have not identified any studies using NVR-NS(B) systems.

Non-immersive VR -Specific (NVR-S)
Acknowledging the issues with NS systems, many studies have also utilized specifically designed systems, to tackle some of these problems. Issues with commercial NVR-S(C) systems include that they are often expensive purchases, or requiring renewable licenses. Bespoke NVR-S(B) systems are also frequently represented in the literature, however, are often designed specifically for the study and often not publicly available. Functionalities are sometimes described in great detail, but we argue, mostly not sufficiently, to reproduce and replicate findings.

Immersive VR -Non-specific (IVR-NS)
Similar to what the Nintendo Wii achieved in 2006, VR-headsets are now an affordable and commercial off-the-shelf solution. We therefore anticipate an increase of studies evaluating IVR-NS(C) applications within rehabilitation contexts in the near future. For example, we identified one recent publication with preliminary results (Erhardsson et al., 2020) using the IVR-NS(C) application Beat Saber (2018). Other potential IVR-NS(C) applications currently available, could include Job Simulator (2016) or OhShape (2019). The primary challenge, similar to NVR-NS(C) systems, is how such systems are developed for users with normal function and abilities. Most likely, there will be no specific settings constructed to allow inclusivity toward "extreme users." Bespoke IVR-NS(B) systems for rehabilitation, while unlikely, could in practicality exist.

Immersive VR -Specific (IVR-S)
Commercial IVR-S(C) systems have been available since at least 2010 (Medicaa's Balance Rehabilitation Unit TM (BRU), but as with NVR-S(C) systems, IVR-S(C) systems are often expensive and are likely to require renewable license. More of these will appear, as companies with an already established brand in NVR-S(C) systems, apply "immersive modules" to their existing hardware. We expect them to acknowledge the increasing demand for such systems, for example Khymeia VRRS R . Bespoke IVR-S(B) systems will also likely start to appear more frequently, both clinically and within research, which FIGURE 9 | Taxonomy of virtual rehabilitation systems.
will likely create a more balanced representation between NVR and IVR systems. However, we argue that a problem with IVR-S(B), similar to NVR-S(B), is how bespoke systems are rarely commercially available, but developed and maintained in closed research ecosystems. This makes research reproducibility very challenging.
Looking at the applicability of this taxonomy, we see how a bulk of research included in this review, is the adaptation of the NVR-NS(C) classified Nintendo Wii. While disheartening from an IVR review-based needs-perspective, the advantages of the Nintendo Wii's (C) classification are clear as they include availability, technological reliability, and production value. This infers that working within the (C) classifications, can provide preconditions for studies, resulting in fundamental advantages. These include how studies can prepare quickly, do not face technological inconsistencies, and can be easily and globally reproduced. Whether "NS" is ultimately a serious disadvantage, depends on how well the contextual rehabilitation needs, converge with the demands and effects of the non-specific solution. Noticeably remaining in this case example, is the role of the NVR nature of the Wii.
While IVR is a technology with high potential benefitstypically amplified from the sensation of presence), it also entails an increase in risks such as falls or injuries (e.g., from not being aware of ones surroundings, while wearing the headset), to nausea, ocular disturbances, and disorientation, which may be negligible or severe depending on the individual participants. While (B) applications may be aimed to fit contextual needs more precisely, they may also lack the refinement and additional benefits of some (C) grade products.
IVR-NS(C) products are currently undergoing rapidly increasing development, both in terms of quantity and quality. With products such as Beat Saber and Half-Life Alyx breaking records for IVR software sales, IVR-NS(C) titles are gradually demonstrating potential for usage, across entertainment-and clinical settings. These represent a point for IVR, where their success is likely gaining more from effectively utilizing the defining features of IVR to their advantage, than they are losing from any adverse effects. Researchers and practitioners should definitely consider any apprehension, on utilizing (C) products as their vehicle to explore the viability of IVR-based rehabilitation.
This does require researchers to find proper interventions for the IVR-NS(C) applications, and to design their studies around those spaces, where their applications are therapeutically and methodologically useful. Meanwhile, as this was possible for the Nintendo Wii, it should be considered within a range of possibility for current or future IVR applications. We are still to see the IVR-NS(C) application, which achieves weight and role within IVR-based rehabilitation, as the Wii did in the past for NVR-NS(C) based rehabilitation.
Meanwhile, developing this taxonomic classification for VR systems is a starting part of this. If a non-discreet distinction between the non-immersive and immersive exists (on a continuum), the current state of RCT-descriptions of technologies pinpoint a demarcation problem of immersion. The taxonomy proposed in this review, is a layer to this. Acknowledging the placement of an intervention is important, especially based on the findings of this review, to make initial judgement on the research field it should be place.
Despite the taxonomy proposed in this review, however, more detailed classification methods remain needed to further distinguish IVR-based interventions. For example, from the usage of FOV and FOR. Currently, information about interventions are seldom sufficient enough, to use those measurements as variables.

Overall Completeness and Applicability of Evidence
Limited detail about the intervention was provided in the included studies, which limits the ability to replicate the research. This is especially true of specific bespoke systems. When not commercially available, and when details about hardware, software and interactions are not described in detail, bespoke systems become exceedingly difficult to include in cross-study evaluations or comparisons. Future research endeavors should carefully consider and attend to this inclusion.

Potential Biases in the Review Process
This systematic review verifies and supports previous suggestions in narrative reviews, where the term VR has been used inconsistently, when describing interventions. Furthermore, given that a majority of the articles are published after 2016, which correlates with the availability of high-immersive commercial and affordable VR-equipment, this review supports the need for development and evaluation of more high-quality interventions. Partly to better understand the effectiveness and adverse events of IVR-equipment in motor rehabilitation of older adults, but also in other domains where better evidence exists, such as stroke therapy (Laver et al., 2017).
Many factors contribute to the sense of the immersion (see section 1.2), thus, the dichotomous classification applied in this review is quite reductionistic. Although it can be argued that immersion exists on a continuum, the extent of interacting elements that nurtures it, curtails clear demarcations between the different features. Furthermore, classifying virtual rehabilitation systems a posteriori on a continuum, would require detailed technical descriptions (e.g., FOV, FOR, and frame-rate), which RCT-studies do not traditionally supply.

Limitations of This Review
To our knowledge, this is the first systematic review attempting to evaluate differences in treatment effects, by differences in the properties of the system, though subgroup analyses. However, the authors acknowledge that there are limitations to this approach. Firstly, the scope of the review spanned a variety of different outcomes, and potentially heterogeneous populations, as long as it was non-neurological rehabilitation. This raises the question about whether the data from independent studies can be validly pooled. One of the criteria for pooling data in meta-analyses, is that treatment effects are investigated for the same fundamental impairments, using similar or identical systems and comparators. While this review does include a very specific population (i.e., older adults), it differs in the purpose of the interventions, as well as the potential functional capacity of the included participants. Likewise, the comparator offered in the control groups differed from being no activity at all, to leaflet and the same exact intervention minus the digital augmentation. Therefore, the substantial heterogeneity observed, for example in TUG (I 2 = 68%), can be due to differences across participants, study design and outcomes, rather than sampling errors. Furthermore, the small number of included studies describing non-neurological IVR-interventions for older adults (60+), the insufficient reported reasons for dropping out, as well as a generally poor description of adverse events, do pose severe limitations. As 56% of studies did not report adverse events, however, we cannot assume there were no adverse events, simply because none were reported. Moreover, since IVR and NVR is usually pooled, safety, and feasibility of the technology may be inflated.

Future Directions
As VR-systems improve (e.g., wider FOV, higher pixel density, frame-rate, and resolution), the adverse symptoms experienced by many users, will likely be mitigated. However, other challenges may also be relevant to consider, when implementing IVR in rehabilitation programs. Technological innovations will need to be continuously monitored and deemed appropriate for clinical use, as new barriers may arise when new interfaces are inevitably added, as new design standards. For example, mass-market brain-computer interfaces are likely to become embedded in wearable computing devices, within a foreseeable future. Although it definitely will be a game-changer for patient monitorization during therapy, it is not unlikely that such interfaces can be considered in violation with personal data protection regulations, when placed in off-the-shelf commercial products. For researchers seeking to implement clinical VR, it may be valuable to theorize on the potential harms of the technology, and evaluate it continuously during the process. One approach to evaluating the potential harmful consequences, could be through the development of "dark logic models" (Bonell et al., 2015).
In this review we have proposed a taxonomy expanding the previous distinction between specific and non-specific VR (Maier et al., 2019) to include the distinction between immersive and non-immersive VR, as well as differentiation between commercial and bespoke systems. Admittedly, the field of Virtual Rehabilitation has so far used VR as an umbrella-term. However, to avoid confusing consumers, researchers, and healthcare professionals alike, who are leading the change, the recent commercialization of VR should re-establish discussions -and reach a taxonomic consensus on whether (or not) the term of "VR" should be reserved exclusively for IVR-systems, as a subcategory of Virtual Rehabilitation.
Finally, the authors encourage that similar methods are taken, to distinguish between NVR and IVR interventions in more focused reviews, to better understand the differences in treatment effects and related adverse events. Potentially, this task can be undertaken through umbrella reviews, to synthesize results from systematic reviews, while accounting for a posteriori classifications.

CONCLUSION
The majority of studies included in this review evaluated the use of non-specific, commercially available NVR systems. Three of the 15 studies included in this review evaluated IVR interventions. Two of these studies met the criteria for meta-analysis. Six studies included in the metaanalysis indicated a significant treatment effect of NVR on TUG scores and BBS scores compared to the control intervention. No significant difference in 6MWT scores were found in the meta-analysis of the two studies using NVR interventions. Pain scores were significantly different for the two IVR interventions compared to control for patients, following total knee arthroplasty. Yet, no significant difference was found in pain scores between the NVR interventions and control, for people with chronic back pain or balance disorders.
We initialize a call-for-action, to distinguish between types of VR-technology, and propose a taxonomy of virtual rehabilitation systems, based on our findings. Most interventions uses NVR systems, which has demonstrably lower VR-sickness than IVR-systems. Therefore, RCT adverse events may be underreported. An increased demand for IVR-systems highlight this challenge. Care should be taken when applying the results of existing NVR tools to new IVR technologies. NVR could improve functional outcomes, and should not be underestimated, simply by to the contemporary existence of IVR. Future studies should provide more detail about their interventions, and future reviews should differentiate between NVR and IVR.

Implications for Practice
The heterogeneity in VR intervention, participant type, study setting and outcome measures across the included studies, along with small sample sizes, provide limited ability to draw strong conclusions to support the use of VR in practice. Stakeholders and clinicians should be careful when applying the results of existing NVR interventions to new IVR technologies. While both NVR and IVR can effectively improve functional outcomes, IVR generally causes more adverse events, such as VR-sickness which can lead to higher dropout-rates, or even worse pose health-risks if patients are not properly monitored.

Implications for Research
Future studies should provide more detail about the equipment used in the interventions, and also better monitor, measure and report system-specific side effects through standardized tools. Future reviews should differentiate between NVR and IVR.

DATA AVAILABILITY STATEMENT
The original contributions presented in the study are included in the article/Supplementary Material, further inquiries can be directed to the corresponding author/s.

AUTHOR CONTRIBUTIONS
EH, JB-P, TP, and SF conceived and formulated the research questions for the study. EH drafted the a priori protocol, analyzed the data, and drafted the manuscript. SF and SH performed the literature search, identified and removed duplicate records, and prepared files for Covidence. EH and TP performed title and abstract screening as well as full-text screening, and resolved conflicts between raters. EH, TP, JB-P, KH, and NN performed independent data extraction and Risk of Bias assessment. EH, BL, and JB-P interpreted results of the data synthesis. EH, BL, JB-P, NN, SS, TP, SF, and CK edited and contributed to the manuscript. All authors reviewed and accepted the final version.

FUNDING
This systematic review was funded as a joined effort between Aalborg University and VihTek Research and Test Center for Health Technologies. The systematic review was written as part of a Ph.D. study undertaken by Emil Rosenlund Høeg, funded by the municipality of Frederiksberg.

ACKNOWLEDGMENTS
Many thanks to Professor Hanne Tønnesen and colleagues from WHO Collaborating Centre for Evidence-based Health promotion in Hospitals and Health Services, Frederiksberg for letting me attend the already full Ph.D.-Course on Systematic Review Techniques, 2018. Without that course, this review would not have happened.