The Effects of Virtual Reality on Procedural Pain and Anxiety in Pediatrics: A Systematic Review and Meta-Analysis

Distraction and procedural preparation techniques are frequently used to manage pain and anxiety in children undergoing medical procedures. An increasing number of studies have indicated that Virtual Reality (VR) can be used to deliver these interventions, but treatment effects vary greatly. The present study is a systematic review and meta-analysis of studies that have used VR to reduce procedural pain and anxiety in children. It is the first meta-analytic assessment of the potential influence of technical specifications (immersion) and degree of user-system interactivity on treatment effects. 65 studies were identified, of which 42 reported pain outcomes and 35 reported anxiety outcomes. Results indicate large effect sizes in favor of VR for both outcomes. Larger effects were observed in dental studies and studies that used non-interactive VR. No relationship was found between the degree of immersion or participant age and treatment effects. Most studies were found to have a high risk of bias and there are strong indications of publication bias. The results and their implications are discussed in context of these limitations, and modified effect sizes are suggested. Finally, recommendations for future investigations are provided.


INTRODUCTION
The management of pain and anxiety in children undergoing medical procedures remains suboptimal (Stevens et al., 2011;Birnie et al., 2014;Friedrichsdorf and Goubert, 2020). As well as causing excessive and unnecessary suffering, undertreated procedural distress may have long-term negative effects on child health and development, as well as treatment outcomes (Young, 2005). Current best practice guidelines recommend that non-pharmacological interventions are routinely implemented in treatment plans (Wilson-Smith, 2011). Two common, non-pharmacological approaches are distraction and procedural preparation. Distraction involves the use of distractors like music and television to divert attention away from noxious stimuli, whereas preparation techniques usually entail information about the procedure or exposure to the procedural setting (e.g., a tour of the clinic). Over the last couple of decades, researchers have explored whether virtual reality (VR) can be used to deliver and possibly enhance distraction and preparation interventions in pediatrics.
Previous reviews have indicated the potential of VR in pediatrics (e.g., Indovina et al., 2018;Eijlers et al., 2019a;Georgescu et al., 2020;Lambert et al., 2020). Its immersive, interactive nature is thought to provide particularly captivating distraction, as well as a cost-effective and engaging medium for procedural preparation. However, previous meta-analyses have revealed great heterogeneity in treatment effects and little is known about the underlying mechanisms and factors that determine the effectiveness of VR interventions (Li et al., 2011).
The present study is a systematic review and meta-analysis of studies that have used VR to reduce procedural pain and anxiety in pediatrics. To address the variability of effect sizes that have been observed across studies, the potential influence of various VR, procedural, and participant characteristics will be explored. The main focus will be on characteristics of VR systems, including the technical specifications and degree of usersystem interaction. While some evidence suggest that VR characteristics influence treatment effects (e.g., Hoffman et al., 2006;Wender et al., 2009;Johnson and Coxon, 2016), this has not yet been assessed in a meta-analysis.

Virtual Reality in Healthcare
Virtual reality (VR) may be described as an interactive, immersive, computer-generated environment or experience (Gigante, 1993;Pan and Hamilton, 2018). Typically presented on a head-mounted display (HMD), the screens are positioned close to the users' eyes with full or partial occlusion of their physical surroundings. Images are often three-dimensional and continuously adjusted in accordance with the user's head movements (Slater and Sanchez-Vives, 2016). Such features contribute to the sense of being surrounded by or present in the virtual environment that is unique to VR.
Various applications of VR in health have been explored, including in the assessment and treatment of patients. Reviews of the literature have reported significant methodological issues and a need for further research, but nevertheless indicate a considerable potential for VR in various clinical settings. For example, VR interventions have been applied in rehabilitation (Laver et al., 2017), habilitation (Snider et al., 2010), psychiatry (Freeman et al., 2017), geriatrics (Neri et al., 2017), and palliative care (Niki et al., 2019). An increasing number of studies have demonstrated its utility in the management of pain and anxiety caused by medical procedures in adult and pediatric populations (Malloy and Milling, 2010;Chan et al., 2018;Eijlers et al., 2019b;Georgescu et al., 2020).

Procedural Pain and Anxiety in Pediatrics
Children in developed countries undergo an increasing number of potentially painful and anxiety-inducing medical procedures (Curtis et al., 2012). Depending on their age and development, children may experience these procedures as more aversive than adults due to limitations in their ability to communicate their pain and need for pain management, to understand why the procedure is necessary, and to self-regulate (Cohen et al., 2008;Slifer, 2013;McMurtry et al., 2015). While conditions like cancer and burn injuries often require repeated or particularly distressing procedures (Gandhi et al., 2010;Twycross et al., 2015), routine procedures like venipuncture and immunizations are also known to induce considerable pain and anxiety in children (Reid et al., 2014). If poorly managed, procedural pain and anxiety could have detrimental effects on child health and development, as well as treatment outcomes (Mathews, 2011;Wilson-Smith, 2011). For example, painful and frightening medical procedures in childhood have been linked to alterations in pain responses later in life (Pate et al., 1996;Taddio et al., 1997;Kennedy et al., 2008), reduced effects of future pharmacological analgesia (Weisman et al., 1998), and development of needle phobia (McMurtry et al., 2015).
The International Association for the Study of Pain (The International Association for the Study of Pain (IASP), 2011) defines pain as "an unpleasant sensory and emotional experience associated with, or resembling that associated with, actual or potential tissue damage". Procedural pain refers to pain associated with medical (or dental) procedures. Procedural anxiety may be described as a response to such procedures characterized by feelings of dread and apprehensiveness, accompanied by physical symptoms such as sweating and increased heart rate (Lavoie, 2013). The relationship between procedural pain and anxiety is intertwined and complex -for example, they frequently co-occur and exacerbate each other (Cohen et al., 2004;McMurtry et al., 2015;Kao and Schwartz, 2019).
The experience of pain is modulated by multiple biological, psychological, and social processes (Bentley, 2014). Some factors known to modulate pain top-down include attention toward painful stimuli, expectation of pain, anxiety, and previous experiences with pain (Linton and Shaw, 2011;Bentley, 2014). Knowledge of these and other pain-modulating mechanisms have informed the development of various non-pharmacological pain management approaches, including distraction and procedural preparation (Curtis et al., 2012). Current best practice guidelines recommend a combination of pharmacological and nonpharmacological interventions in the treatment of procedural pain and anxiety (e.g., The Association of Paediatric Anaesthetists of Great Britain and Ireland, 2012). Over the last couple of decades, researchers have explored whether VR can be used to effectively deliver distraction and preparation interventions in pediatrics.

Distraction and Preparation Techniques
Distraction techniques are commonly used during painful or frightening procedures of shorter durations (DeMore and Cohen, 2005). They involve the use of stimuli such as videos, music, and conversation to divert attention away from noxious stimuli (Schechter et al., 2007). No single theory can fully account for the effects of distraction analgesia (DeMore and Cohen, 2005), but they are often understood in terms of attentional capacities. It is assumed that pain perception requires attention, and that by focusing on distractors, less attentional resources are available for pain perception (McCaul and Malott, 1984;Gupta et al., 2017). However, distraction may also work through other mechanisms. For example, pleasant distractors may have inherent positive effects on mood, arousal, and anxiety, all of which have the capacity to alter pain perception (Johnson, 2005). Attention, mood, arousal, and anxiety can all be understood as processes inhibiting nociceptive signals as described in the gate control and neuromatrix theories of pain (Melzack and Wall, 1965;Melzack, 1999). Due to its immersive, interactive, and multisensory properties, VR is thought to be particularly captivating and thus provide superior distraction (Slifer, 2013).
Another common way of reducing pain and anxiety is procedural preparation, often in the form of a verbal briefing, written materials, or a tour of the clinic (Curtis et al., 2012). Such techniques are meant to reduce anxiety (and possibly also pain) by promoting a sense of control and adaptive behaviors, as well as desensitizing the child to the medical procedure and the setting in which it takes place (Jaaniste et al., 2007;Edward et al., 2015). Research on virtual reality exposure therapy (VRET) has established that VR can be used to expose users effectively and ecologically to feared stimuli (Botella et al., 2017;Boeldt et al., 2019). Based on these findings, researchers have recently begun exploring whether VR can be used for procedural preparation (Eijlers et al., 2019a). In addition to exposure to the medical procedure and the environment in which it takes place, VR preparation may involve modeling, instructions, and rehearsal of the procedure (e.g., Ryu et al., 2018;Han et al., 2019;Liszio et al., 2020).

The Influence of Virtual Reality Characteristics
VR systems offer varying degrees of interaction with the user. Less interactive forms of VR include videos converted to a 360/180°f ormat for viewing on a VR headset. While the user may effect changes in perception (i.e., looking around the virtual environment in 360/180°through tracking of head movements), he or she is nevertheless a passive spectator of the virtual environment. On the other hand, VR games or simulations may offer interactivity beyond head tracking, such as navigation in the virtual environment, social interaction with avatars, or manipulation of virtual objects. In the present study, head tracking will be considered an aspect of immersion, and not interactivity.
A potential impact of VR interactivity on procedural pain and anxiety seems plausible. It is generally assumed that active distraction poses greater attentional demands on patients than passive distraction, thus providing superior analgesia (Slifer, 2013). Some studies have reported this pattern for VR specifically (e.g., Dahlquist et al., 2007;Wender et al., 2009;Gutiérrez-Maldonado et al., 2011;Gutiérrez-Martínez et al., 2011). In addition, VR interactivity may augment learning and memory (e.g., James et al., 2002;Tuena et al., 2019), which could be beneficial when used for procedural preparation.
VR systems also vary in terms of technological sophistication, which may be conceptualized as varying degrees of immersion (Nilsson et al., 2016;Agrawal et al., 2019). According to Slater and Wilbur (1997), a highly immersive system should minimize signals from the physical world (e.g., fully occlude the user's physical surroundings), stimulate multiple senses (e.g., visual, auditive, and tactile), visually surround the user (e.g., a wide field of view), provide a vivid representation of the virtual environment (e.g., high screen resolution) and match the actions of the participant with the sensory output of the system (e.g., low latency between head rotation and subsequent change in images displayed). This concept of immersion provides a useful framework for comparison of VR systems, as it can be operationalized and objectively measured (Slater, 2009;Cummings and Bailenson, 2016).
The degree of immersion may have an impact on the effectiveness of VR interventions. According to Slater (2018), higher levels of immersion facilitate the perceptive illusion that the virtual environment is real, which he referred to as presence. Presence is commonly thought to increase the effectiveness of various forms of VR interventions (Cummings and Bailenson, 2016). VR studies have indicated a possible relationship between immersion/presence and the effectiveness of VR distraction analgesia (e.g., Hoffman et al., 2004;Hoffman et al., 2006). Some previous reviews have employed somewhat vague definitions of VR in their inclusion criteria. For example, some authors have specified that they would only include 'immersive VR' (Chan et al., 2018) or 'fully immersive VR' (Eijlers et al., 2019b), but did not explicitly state their definition of these terms. It is crucial that these terms are clearly defined and consistently applied to avoid confusion. For example, it can be argued that some of the technologies (e.g., the eMagin 3DVisor) included in Eijlers et al. (2019a) are not fully immersive because their users can still see some of their physical surroundings (see Slater and Wilbur, 1997). Perhaps more importantly, unclear definitions of VR and immersion have resulted in an inconsistent inclusion of less advanced technologies that are often referred to as 'audiovisual glasses' (AV-glasses), rather than 'VR'. These often lack features such as stereoscopy and head tracking, and often have a narrower field of view (Wismeijer and Vingerhoets, 2005). However, as review authors do not include 'audiovisual glasses' in their search strategies, many studies using comparable technologies have previously been overlooked. The present review will therefore employ an inclusive definition of VR and a wider search strategy that also includes AV-glasses. The term 'VR' will mostly be used in the current study.

OBJECTIVES
Previous reviews have indicated the potential of VR in pediatrics (e.g., Eijlers et al., 2019a;Iannicelli et al., 2019;Georgescu et al., 2020). However, nearly half of the studies included in the present review were published in 2019 and 2020. As the literature search of the most recent review (Georgescu et al., 2020) was conducted in 2018, an updated review is necessary. Another motivation for the present study is that previous reviews have not quantitively assessed the differences between VR interventions. Considering the potential impact immersion and interactivity may have on treatment effects, such assessments could have important clinical implications.
Previous reviews have reported much heterogeneity in effect sizes (Eijlers et al., 2019b;Georgescu et al., 2020), which may reflect VR characteristics, but also differences between medical procedures and patients (e.g., age). The increased number of studies gained from also including AV-glasses will provide greater statistical power to explore these variables as potential sources of the heterogeneity. Identifying any such moderators of treatment effects may help inform the process of designing and implementing VR interventions for clinical use. Moreover, the increased number of studies may also provide more accurate estimates of the true effects of using VR during medical procedures.
The present study consists of a systematic literature review and meta-analysis of studies that have used VR to reduce procedural pain and anxiety in pediatrics. It also provides a meta-analytic assessment of the role of VR hardware specifications (i.e., immersion) and the degree of interaction between the patient and the VR system. The different groups of medical procedures and the age of participants will also be explored as potential moderators of treatment effects.
The research questions were as follows: • Do VR interventions reduce pain and anxiety in pediatric patients undergoing medical/dental procedures more than standard procedures? • Does effectiveness of VR interventions vary depending on the type of medical procedure, VR characteristics, and the age of patients?

METHODS
The effects of VR interventions on procedural pain and anxiety in children was evaluated through a systematic literature review and meta-analysis. Reporting will follow the Preferred Reporting Items of Systematic Review and Meta-Analysis (PRISMA) guidelines (Moher et al., 2009).

Protocol and Registration
A study protocol (CRD42020155056) was submitted to the Prospective Register of Systematic Reviews (PROSPERO) in May 2019. Some deviations from the protocol were deemed necessary. Firstly, as the differentiation between 'VR' and 'audiovisual glasses' was somewhat inconsistent in the literature, the search strategies were changed to also include 'audiovisual glasses' and variants of this term. Due to the resulting increase in search results, it was necessary to limit the volume of retrieved studies by also adding the terms 'preparation', 'distraction', 'pain', and 'anxiety'. Secondly, it was discovered that the reporting of technical specifications of VR systems was poor and inconsistent, particularly in older studies. Selective reporting of technical specifications by authors and VR manufacturers hindered calculations that are required for accurate quantitative comparison in terms of screen resolution and field of view (see subsections 'screen resolution' and 'field of view'). The screen refresh rate was also rarely disclosed in older studies. Screen resolution, field of view and refresh rate were thus omitted from quantitative analyses.

Study and Publication Characteristics
Studies were considered eligible if a VR intervention was compared experimentally or quasi-experimentally with any non-VR interventions or a no-intervention control group. Studies with single-case studies and pretest-posttest designs without control groups were excluded. Unpublished studies were eligible for inclusion. Only publications in English or one of the Scandinavian languages were considered eligible. No time constrains were applied.

Participant Characteristics
Only pediatric samples were eligible for inclusion. Pediatric patients were defined as 0-21 years of age, in accordance with recommendations issued by the American Academy of Pediatrics (Hardin and Hackell, 2017).

Intervention Characteristics
Studies were considered eligible if an intervention involving VR was used to reduce pain and/or anxiety in pediatric patients associated with medical or dental procedures through distraction or procedural preparation. VR was defined as a computergenerated virtual environment presented on a head-mounted device or other VR system that perceptually surrounds the user (i.e., cover all or most of the field of view). VR presented on conventional screens (with or without 3D-effects) were thus not eligible for inclusion. So-called audiovisual glasses were eligible for inclusion. Augmented reality (AR) technologies render images on a transparent screen that reveals the user's physical surroundings and were thus excluded.

Outcomes
Questionnaire and observational measures of pain and (state) anxiety were considered eligible. Stress and fear measures were accepted as anxiety measures, as these were thought to have a high degree of conceptual overlap with state anxiety (Öhman, 2008). Studies that used measures of procedural distress were excluded, as this concept includes dimensions of both pain and anxiety (McMurtry et al., 2015). Physiological measures and measures of maladaptive behavior were not considered valid pain or anxiety measures for the same reason.

Comparison Groups
Studies were eligible for inclusion if they compared VR interventions with non-VR interventions or no intervention. Non-VR interventions may involve non-VR distraction (e.g., television, videogames), non-VR procedural preparation (e.g., verbal or written information about the procedure), standard of care (SOC) procedures or behavior management techniques (e.g., positive reinforcements, tell-show-do technique). The inclusion of both no-intervention, SOC and other non-VR conditions was deemed necessary as Eijlers et al. (2019a) found that standard of care was often poorly defined, and often involved a variety of both pharmacological and nonpharmacological interventions.
were included to also identify any 'gray literature', such as unpublished studies and theses. Only the first 150 publications were extracted from Google Scholar, due to the diminishing relevance of hits produced by this search engine. Unpublished studies were collected by contacting researchers identified in bibliographies, search results or elsewhere. Article reference lists of included studies were also searched manually.

Search
Databases were searched using the following terms and their synonyms: Virtual reality/audiovisual glasses + pediatrics/child + anxiety/pain/preparation/distraction. Search strategies were adapted for each database. The complete search strategy for PsycINFO is presented in Table 1. The last search was conducted October 1, 2020, but manuscripts were received from contacted authors until November 25, 2020.

Study Selection
Upon completion of the literature search and after removal of duplicates, each publication was screened for potential eligibility by the first author. Researchers identified in trial registries and conference abstracts were contacted if any corresponding, published research articles were not identified in the search results. The resulting list of studies were considered for eligibility by both authors. Reasons for exclusions were recorded at this point. Any disagreements were resolved through discussion.

Data Collection Process
Data extraction was conducted using an Excel spreadsheet. The spreadsheet was piloted with five randomly selected studies that were coded independently by both authors. As coding agreement was deemed satisfactory, the remaining data was collected independently by the first author. Numerical study results were coded by the first author and double checked for accuracy by the second author. Any disagreements were resolved through discussion. If sufficient information was not available in the articles, information was requested from corresponding authors on multiple occasions between May and November 2020. Co-authors were contacted if corresponding authors could not be reached. Efforts were made to locate updated contact information for researchers that did not respond. VR hardware or software specifications were also sourced from direct communication with manufacturers, technical manuals published online or vendors. Specifications sourced directly from articles were preferred, as authors may have reconfigured HMD settings.

Data Items
All data items were extracted as specified in the review protocol. If more than one measure of pain or anxiety were available, retrospectively, self-reported measures were prioritized. Selfreported measures were preferred as pain and anxiety are subjective and private experiences, and because observers' ability to accurately describe the patient's distress may be compromised as the VR headsets cover parts of the patient's face. For pain specifically, measures of sensory pain were preferred over measures of the affective or cognitive aspects of pain. Final values were preferred over change scores.
The following information was extracted from each primary study: 1) publication and study details (author(s), year published, study design, sample sizes, description of comparison groups); 2) participant characteristics (average age and a measure of dispersion, gender distribution, other health-related characteristics); 3) details regarding the pain and anxiety measures that were used (name of measures, timing of administration, informant); 4) the procedural setting (clinical context in which the procedure took place, the kind of medical procedure, timing of VR intervention); 5) results (key findings, summary statistics for VR and non-VR groups); 6) VR characteristics (technical specifications, degree and form of interactivity, and descriptions of media displayed). The VR characteristics (immersion and interactivity) are described in further detail below.

Immersion
The variables describing technical specifications are primarily based on Cummings and Bailenson (2016), who compiled a list of VR features that increase the level of immersion and thus the sense of being present in the virtual environment. The list of VR characteristics included in the present study is not exhaustive, but rather focused on the objective, purely technical properties that were deemed realistic to code. For example, the overall level of detail and realism in virtual environments were not included. In addition to hardware specifications, information was extracted regarding the number of senses stimulated, the level of usersystem interactivity, and the media displayed to participants.

Screen Resolution
The screen resolution refers to the number of pixels the screen displays per frame (Kourtesis et al., 2019). A screen with a high resolution will be perceived to have greater fidelity, or 'crispness', of images displayed. Resolution is typically reported as horizontal x vertical pixels (e.g., 1,280 × 1800), or pixels per inch (ppi). However, as pointed out by Hugues (2019), the pixel per degree (ppd) format more truly reflects the fidelity of the display, as it is independent of the field of view. Calculating the ppd requires knowledge of the horizontal field of view, which is rarely disclosed. The screen resolutions were therefore not compared quantitively.

Field of View
The field of view (FoV) refers to the degrees of the VR user's visual field that is occupied by the virtual environment (Cummings and Bailenson, 2016). FoV may be reported as diagonal, horizontal or vertical. Manufacturers oftentimes reveal only one measure (diagonal) of the FoV, whereas others withhold this information completely. The FoV may also be artificially increased by reducing the stereo overlap, i.e., the area of the screen in which the user can perceive depth (Hugues, 2019). It was thus decided that the field of view of devices could not be quantitively, fairly compared and this variable was omitted from quantitative synthesis.

Screen Refresh Rate
The screen refresh rate refers to the rate at which the screens update the images displayed on the screen, based on input generated by the computer (Kourtesis et al., 2019). A low screen refresh rate would be perceived as a lack of fluency in images, or a lag between the user's actions and visual input. The screen refresh rate is either reported in cycles per second (Hz) or frames per second (FPS). As this information was frequently missing, particularly in older studies, the screen refresh rate was not used to compare VR interventions.

Stereoscopy/Three-Dimensional Graphics
Stereoscopy is achieved by presenting separate images to each eye with slight differences in perspective that reflects the interpupillary distance. It provides an illusion of depth in the virtual environment and may increase immersion (Yang et al., 2012).

Head Tracking
Some VR systems track user movements and use this information to adjust images (and sometimes sound) accordingly. All parts of the body can be tracked, but tracking of head movements is the most common. According to Slater (2009), tracking strengthens the illusion of being present in the virtual environment as the participant can perceive through natural sensorimotor contingencies (O'Regan and Noë, 2001). For example, a participant may tilt his or her head to inspect a virtual object from several angles, which is not possible on conventional screens.

Visual Occlusion
This variable refers to whether the VR system fully covered the participant's physical surroundings. HMDs that are not fully occlusive may have a gap between the device and the participant's face that lets light through and allows the participant to see parts of the procedural setting. Minimizing input from the physical reality may strengthen the illusion of being present in the virtual environment (Slater and Wilbur, 1997).

Non-Visual Sensory Stimulation
This variable described whether the VR intervention involved any non-visual, sensory stimulation. This would typically be in the form of auditive stimuli (e.g., music or sound effects from games), but also tactile stimuli (e.g., force feedback or vibration from controllers). Researchers may choose not to include audio to avoid disruption in communication between patients and personnel delivering the medical procedures. However, it is commonly assumed that multisensory stimuli provide greater immersion and sense of presence (Cummings and Bailenson, 2016).

Interactivity
This variable was used to declare whether the VR system offered any user-system interaction beyond control of the field of view (i.e., tracking of head movements). Interactivity may for example include navigation in the virtual environment or manipulation of virtual objects.

Risk of Bias Assessment in Individual Studies
Assessment of study risk of bias was conducted in accordance with the Cochrane Handbook of Systematic Reviews . The effect of interest was the Intention-To-Treat (ITT) effect, i.e., the effect of allocation to intervention. Risk of bias was assessed at outcome level independently by the first author. The ROB 2.0 (Sterne et al., 2019) and ROBINS-I (Sterne et al., 2016) tools were used for RCTs and non-randomized studies, respectively. The RCT characteristics assessed were 1) bias arising from the randomization process, 2) bias due to deviations from intended interventions, 3) bias due to missing outcome data, 4) bias in measurement of the outcome, and 5) bias in selection of the reported result. Additional considerations for cross-over trials were applied (Sterne et al., 2019). However, they were evaluated with the parallel design tool if only data from the first study period was analyzed. Non-randomized studies were evaluated in terms of the following domains: 1) confounding, 2) selection bias, 3) bias in classification of interventions, 4) bias due to deviations from intended interventions, 5) bias due to missing data, 6) bias in measurement of outcomes, and 7) bias in selection of the reported result. The risk of bias judgements for each domain are illustrated in separate figures for randomized and non-randomized studies. Additional bar plots illustrate the overall judgment for each domain across studies, with each study's contribution weighted by their standard error. The figures were constructed using the robvis web application (McGuinness and Higgins, 2020). A separate, additional analysis excluding studies deemed to have a high risk of bias in two or more domains was conducted.

Summary Measures
The differences in mean pain and anxiety scores for the VR and control groups were calculated as Hedges' g (Hedges, 1981).
While similar to d, the Hedges' g includes a correction term that yields a less biased estimate, particularly when sample sizes are small (Borenstein et al., 2009). If a study had multiple VR or non-VR arms, summary statistics were combined by calculating their weighted mean and standard deviations, based on their number of participants. Means and standard deviations were ideally extracted directly from articles or obtained from study authors. If necessary, they were estimated. Sample means were estimated from the median by the method of Shi et al. (2020). Estimation of variance based on the median, interquartile range and sample sizes were based on the method of Wan et al. (2014).
For studies that also reported the minimum and maximum values, the formula proposed by Luo et al. (2018) was used for additional precision. These estimations were performed using an online calculator by Shi et al. (2020). The Campbell Collaboration effect size calculator (Wilson, n.d.) was further used to estimate effect sizes from t-statistics. Cross-over trials were only included for quantitative synthesis if data from the first study period only was available.
Several studies reported multiple measures of pain and anxiety. As specified in the review protocol, only one measure for each outcome was used for quantitative synthesis. The selection was based on the following pre-specified criteria: 1) Self-reported measures were preferred over observational measures; 2) measures of sensory pain were preferred over measures of the cognitive or affective aspects of pain. If two or more measures fit the abovementioned criteria, the most frequently used measure was selected.

Synthesis of Results
The methodology was guided by Borenstein et al. (2009) and the Cochrane handbook of systematic reviews . All statistical analyses (except selection models) were conducted using Stata 16 (StataCorp, 2019). Standardized mean differences in pain and anxiety were combined using a random-effects model. The random-effects model assumes that the study effect sizes are drawn from different populations of study effect sizes, i.e., that observed variance consists of both sampling error and differences in true effect sizes (Borenstein et al., 2009). This model was selected as the studies were expected to be diverse in terms of study designs, participant characteristics, medical procedures, and VR characteristics, to name a few. The restricted maximum likelihood estimator of between-studies variance (τ 2 ) was selected based on recommendations by Veroniki et al. (2016). The results of the two meta-analyses are presented in separate forest plots. The magnitude of effect sizes will be compared with those of comparable studies, as compiled by Lipsey and Wilson (1993). The standardized mean effect will also be expressed as absolute mean differences on the Wong-Baker Faces scale and the Child Fear Scale. These scales were selected as they were the most frequently used one-item scales among the outcomes included in the meta-analysis. The absolute mean difference will be calculated by multiplying the standardized mean difference with the combined standard deviations from every study in which these measures were used in the meta-analysis (Schünemann et al., 2020).
Heterogeneity among all included studies was assessed by consulting the Cochran's Q test. A significant result indicates that the observed variation in effect sizes reflects true heterogeneity (Borenstein et al., 2009). The I 2 statistic was then used to quantify the magnitude of heterogeneity. It describes the percentage of total variation that is due to heterogeneity (Higgins et al., 2003), with higher values indicating greater heterogeneity.

Risk of Bias Across Studies
Publication bias compromises the validity of the results of metaanalyses and systematic reviews. The term is typically used to refer to the selective publication of studies with a particular outcome, most often "positive" or statistically significant results (Ferguson and Brannick, 2012;Augusteijn et al., 2019;Vevea et al., 2019). This tendency leads to an over-estimation of the summary effect sizes, in particular when the population of studies being sampled from is characterized by low statistical power (Ioannidis, 2008;Button et al., 2013).
We followed recommendations to assess publication bias using a number of different methods in a sensitivity analysis approach, since no single method alone provides reliable results (Carter et al., 2019;Vevea et al., 2019). Publication bias was assessed visually with a funnel plot (Light and Pillemer, 1984) in which study effect sizes (horizontal axis) were plotted against their inverse standard error (vertical axis). Areas representing three intervals of p-values (contours) were added to facilitate interpretation (Peters et al., 2008). As the standard error is directly related to the number of participants, plot asymmetry may be indicative of small-study effects . Visual inspection was supported by statistically testing for asymmetry using Egger's test, which involves regression analysis of the relationship between effect sizes and their standard error (Egger et al., 1997;. If the regression intercept differs from zero, this may indicate publication bias. The trim-and-fill algorithm (Duval and Tweedie, 2000) was used to estimate an effect size adjusted for publication bias. This procedure is conducted in two steps. During the first step, studies that cause funnel plot asymmetry are removed from the mean effect size estimate until symmetry is achieved (iteration step) (Borenstein et al., 2009). An adjusted mean effect size is then estimated. The removed studies are finally re-applied, along with the studies that are assumed to be missing from either side of the funnel plot (pooling step). This final step estimates the variance of the new mean effect size. The trim-and-fill method is widely used, but its performance may vary depending on the presence of substantial heterogeneity or outlying studies, as well as which combination of models, methods, and estimators that is used. Researchers are thus encouraged to use various versions of the trim-and-fill method (Shi and Lin, 2019). In the present study, fixedand random-effects (restricted maximum likelihood method) models with the linear (L 0 ) and run (R 0 ) estimators were used.
The methods above are all based on assessing funnel plot asymmetry. While publication bias will lead to asymmetry, asymmetry can also occur for a number of other reasons . Furthermore, tests for asymmetry may lead to misleading results under conditions of high between-study heterogeneity ). An alternative is to model study selection more directly (Hedges and Vevea, 2005). We adopted the selection model approach proposed by Vevea and Woods (2005), which essentially estimates the robustness of metaanalytic effect size estimates to hypothetical patterns of selection bias. Specifically, we modeled three different selection probabilities (0.2, 0.5, and 0.8) for conventionally non-significant studies, representing severe, moderate and mild publication bias, respectively. Selection models were run using the weightr package for R .

Subgroup and Meta-Regression Analyses
Moderator analyses were conducted to explore potential sources of heterogeneity in effect sizes. The differences between subsets of the studies were initially explored with subgroup analyses. Categorical and continuous variables were then used as predictors in a random-effects meta-regression analysis. It is generally recommended that there are approximately ten studies per predictor (Borenstein et al., 2009). As the present study was focused on the differences between VR interventions, these variables were prioritized in the meta-regression analysis rather than the kind of medical procedure.
As previously discussed, the screen refresh rate, resolution and field of view were omitted from quantitative analysis due to insufficient information. After coding the remaining immersion variables, it was discovered that only one study included any non-visual stimuli. This variable was thus also omitted from the composite immersion variable. As information regarding the four remaining immersion variables was lacking for several studies, it was decided to code VR interventions as either highly immersive (included auditive stimuli, head tracking, stereoscopy/three-dimensional images, and full visual occlusion) or less immersive/insufficient information. The VR interventions were also coded as either interactive or passive (i.e., no interactivity beyond head tracking). Medical procedures were categorized as either 'dental', 'needlerelated procedures', 'pre-operative', or 'wound care'. The mean study-level age was included as a continuous variable. All potential moderators were pre-specified in the review protocol.

Sensitivity Analyses
Sensitivity analyses were conducted to ensure that the summary effect estimates were robust to the removal of the following studies: 1) under-powered studies, 2) non-randomized studies, and 3) studies deemed to have a high risk of bias in two or more domains. Assuming a one-tailed alpha of 0.05 and an 80% power to detect an effect size of 0.50, studies were considered underpowered if they had less than 50 participants in each group (Cohen, 1988).

RESULTS
65 primary studies derived from 64 articles published between 2000-2021 were included in qualitative synthesis. 13 studies were not included in the meta-analyses due to missing numerical results (Gershon et al., 2004;Khan et al., 2019), only change from baseline scores being reported (Kipping et al., 2012), or insufficient data to include cross-over trials (Sullivan et al., 2000;Das et al., 2005;Chan et al., 2007;El-Sharkawi et al., 2012;Attar and Baghdadi, 2015;Atzori et al., 2018a;Atzori et al., 2018b;Garrocho-Rangel et al., 2018;Hoffman et al., 2019;Koticha et al., 2019). Two data sets were obtained from contact with authors to calculate the effect size for the first study period only (Schmitt et al., 2011) and summary statistics (Jeffs et al., 2014). Two unpublished studies were acquired by contacting authors identified in the trial registries (Gerceker et al., 2021;Osmanlliu et al., 2021). Another two published manuscripts were received from contacted authors after the final database search was conducted (Buldur and Candan, 2020;Litwin et al., 2020). The process of study selection is illustrated in Figure 1. Qualitative results and study characteristics are presented in Table 2. VR characteristics are listed in Table 3. A narrative synthesis of study and VR characteristics is presented in the following paragraphs.

Study Characteristics
Most of the studies (k 61) were RCTs, of which 43 employed a parallel-groups design and 18 studies employed a cross-over design. Four non-randomized studies were included.

Participant Characteristics
The total number of participants included in the qualitative review was 4,654, with sample sizes ranging from 5 to 220, and averaging at 72 participants. Included participants were between 6 months and 21 years of age, and the mean studylevel age was 9.23 years 4,162 participants were included in the quantitative analysis. Sample sized ranged from 20 to 220, with an average of 80 participants. The mean study-level age of these children were 9.13 years of age.

Measures
Self-reported measures of pain were available in all but two studies (Wolitzky et al., 2005;Khadra et al., 2020), whereas observational measures had to be used for 11 of the anxiety studies. The Wong-Baker Faces Scale (Wong-Baker FACES Foundation, 2018) and the (revised) Faces Pain Scale (Hicks et al., 2001) were the most widely used pain measures, followed by visual analogue scales ([VAS], Bailey et al., 2012). VAS scales were also frequently used to measure anxiety. The most used observational measure of anxiety was the modified Yale Preoperative Anxiety Scale (Kain et al., 1997).

Settings and Medical Procedures
Studies were mostly conducted in pediatric hospitals or dental clinics. Most of the procedures were classified as needle-related procedures (k 25), followed by dental (k 24), pre-operative (k 8), and wound care (k 8).

Intervention Characteristics
Most of the distraction studies (k 61) used VR as a distraction during the medical procedures. Only Al-Nerabieah et al. (2020) used VR as a distraction before the procedure (i.e., in the waiting room before dental procedures). In one cross-over trial, the effect of receiving VR distraction during the first treatment on preoperative anxiety before the second treatment could be extracted (Fakhruddin et al., 2015).
Four studies (Eijlers et al., 2019a;Ryu et al., 2017;Ryu et al., 2018;Ryu et al., 2019) were categorized as preparation studies. These VR interventions involved virtual tours of the preoperative settings, in which children were exposed to the procedural environment and medical personnel, as well as information about the procedures. Ryu and colleagues incorporated popular cartoon figures that explained and modeled the procedures. Participants in Eijlers, et al. (2019b) and Ryu et al. (2018) were also able to interact with virtual medical devices and receive further information about them.

Virtual Reality Characteristics
Head-mounted devices (HMDs) were used in all but three studies (k 62). In Khadra et al. (2020), patients were placed in front of a wide, curved screen that images were displayed on with a projector. This study was included as the screen covered the majority of the patient's field of view and resembled a surrounding, dome-based VR system. Jeffs et al. (2014) and Hoffman et al. (2019) used HMDs that were mounted on either a tripod or a robotic arm to facilitate participation by patients with burn injuries in the head and neck region, or to facilitate use during hydrotherapy. In 28 studies, so-called smartphone-based systems were used in which a smartphone or other device is inserted into the HMD to serve as the screen and tracking device (Fuchs, 2019). The most common combination was the Samsung Gear headset coupled with various Samsung smartphones.
As previously mentioned, information regarding at least some technical specifications were lacking for many studies, particularly in older studies and in studies that used less advanced VR systems. However, it was clear that the quality of the VR equipment varied considerably between studies. 37 of      the VR systems offered stereoscopy/three-dimensional graphics, whereas seven did not. Unfortunately, this information was not available for 21 studies. Nearly half of the VR interventions (k 32) involved head tracking, 17 VR interventions did not, and information was lacking for the remaining 16 studies. Most of the VR devices fully covered the patient's field of view (k 41), whereas 13 did not. For 11 of the studies, this information was not available. Nearly all of the VR interventions involved auditive stimuli (k 60), and one study also included tactile feedback in the form of tactile feedback from controllers (Gold et al., 2006). Two studies did not include any audio (Aydin and Ozyazicioglu, 2019;Dumoulin et al., 2019), whereas this information could not be confirmed for three studies (Das et al., 2005;Isong et al., 2014;Attar and Baghdadi, 2015). 27 VR systems were classified as interactive, meaning that the system afforded interactivity beyond head tracking. Three studies (Gerceker et al., 2020;Gerceker et al., 2021;Piskorz et al., 2020) included both interactive and non-interactive subgroups. The interactive group of VR interventions was diverse; while some merely involved visual effects as the patient focused his or her gaze on a virtual object (e.g., Aydin and Ozyazicioglu, 2019), others involved more interactivity with virtual objects (e.g., Eijlers et al., 2019a) or more demanding tasks and games (e.g., Piskorz and Czub, 2018).
In most of the studies, patients viewed videos (k 37), followed by simulations (k 14), and games (k 11). Information regarding the VR software was not available for Attar and Baghdadi (2015).

Comparison Groups
Comparison groups were diverse and not always clearly described. They included a range of non-VR distractions (e.g., other electronic devices or conversation) or procedural preparation (e.g., informative videos or verbal briefings), behavior management techniques (e.g., positive reinforcements, tell-show-do technique), or standard of care procedures (SOC). The SOC conditions were also diverse, with some involving no intervention at all and others a combination of several interventions. Three dental studies used sunglasses or protective eyeglasses, either as part of standard care (Hoge et al., 2012), as a behavior management technique (Bagattoni et al., 2018) or as a form of placebo (Buldur and Candan, 2020).

Risk of Bias Within Studies
Risk of bias was assessed per outcome for all included studies. The risk of bias judgements of each domain combined are illustrated in Figure 2 (randomized studies). Contributions from each study toward the combined risk of bias judgements are weighted by standard error of their effect sizes. Individual risk of bias judgements per domain are listed in Figure 3 (pain) and Figure 4 (anxiety) for randomized studies, and Figure 5 for non-randomized studies.
None of the included studies received an overall low risk of bias judgment, and the vast majority were deemed to have an overall high risk of bias. This was partially because it is not possible to blind patients, parents and personnel delivering the VR interventions. Reports of pain and anxiety are highly subjective and may be influenced by beliefs regarding the efficacy of distraction methods. As self-reported measures were prioritized, most of the studies thus received a high risk of bias judgment in domain 4 (bias in measurement of the outcome). Blinding of outcome assessors and personnel conducting the medical procedures was only feasible in studies that applied VR before the medical procedure and only reported observational measures of either pain or anxiety (Al-Nerabieah et al., 2020;Eijlers et al., 2019b;Ryu et al., 2017;Ryu et al., 2018;Ryu et al., 2019). The lack of blinding may also have affected the behavior of patients, parents, carers, and others. Most studies therefore received at least an intermediate risk of bias judgment in domain 3 (bias due to deviations from the intended interventions), and high if data was not analyzed in accordance with intention-to-treat principles.
In addition to issues related to blinding, prospective trial registrations and/or pre-specified data analysis plans were identified for only a few studies. Many studies were thus deemed to have at least an intermediate risk of bias due to selective reporting. Potential issues related to the randomization process were also observed in roughly half of the included studies. Frequently, the methods of randomization and concealment of allocation sequence were not described in sufficient detail or at all. Some studies also performed blockrandomizations with small, evenly sized blocks or used other methods that might enable prediction of the forthcoming allocation for at least some participants.
All the non-randomized trials (del Castillo et al., 2019;Piskorz and Czub, 2018;Piskorz et al., 2020;Sullivan et al., 2000) were deemed to have a serious risk of bias. Some of the issues observed in randomized trials were also seen in non-randomized trials, such as lacking pre-specified analysis intentions. Perhaps more importantly, the studies were considered to have a serious risk of bias due to confounding. For example, in Sullivan et al. (2000), children that were too anxious to receive VR on the first study day received VR on the second study day instead. In the remaining three studies, allocation was determined by either the timing of admission to the hospital in children that were regularly hospitalized for chronic disease (Piskorz and Czub, 2018;Piskorz et al., 2020), or whether the medical procedure was performed during the day or evening/night shifts (del Castillo et al., 2019). Although it is difficult to ascertain exactly how the timing of hospitalization or the medical procedure may have influenced study results, participants in the VR and non-VR groups may differ systematically in clinically relevant ways.

Results of Individual Studies and Syntheses of Results
Numerical results of each study and results of the meta-analyses are illustrated in forest plots for pain ( Figure 6) and anxiety (Figure 7). Positive values (toward the right) indicate that results are in favor of VR. Qualitative results are presented in the study characteristics and results table ( Table 2). The results from studies that were not included in the meta-analyses were mixed; six studies reported results in favor of VR, two reported no difference between the groups, two studies did not    Field of view may vary slightly depending on the size of the smartphone screen. Note. Specifications for two studies using the same VR equipment may differ due to study authors' configurations (e.g., disabling head tracking function or displaying two-dimensional graphics on a headset that is capable of stereoscopy). Frontiers in Virtual Reality | www.frontiersin.org July 2021 | Volume 2 | Article 699383 find any difference in child and parent reported outcomes, and one study found that pain levels were higher in the VR group. The two anxiety studies both reported no difference between the VR and comparison groups.
Pain 42 studies reporting pain outcomes were synthesized. The overall mean effect (Hedges' g) for pain was estimated to 0.72 (95% CI [0.45, 0.98], z 5.31, p < 0001). This effect size may be considered large, compared to effect sizes that have previously been obtained for educational or counseling interventions for medical patients (Lipsey and Wilson, 1993). Expressed in units of the 6-point Wong-Baker Faces scale, this would correspond to a mean difference of 1.76 points. As will be discussed in sub-section 'Risk of bias across studies', the true effect is likely considerably lower than the estimate that was obtained here.
The Q-statistic indicated statistically significant heterogeneity in effect sizes (Q (41) 400.72, p < 0.001). A large proportion of the observed variation (I 2 92.72%) was found to reflect differences in true effect sizes. Six studies reported results in favor of the control/non-VR group (Hoge et al., 2012;Jeffs et al., 2014;Mitrakul et al., 2015;Bagattoni et al., 2018;Eijlers et al., 2019a;Walther-Larsen et al., 2019). Potential sources of heterogeneity are assessed in the 'Additional Analyses' section.
Anxiety 35 studies reporting anxiety outcomes were synthesized. The mean effect size (Hedges' g) for anxiety was estimated to 0.90 (95% CI [0.55, 1.26], z 4.98, p <0. 001), which too may be considered a large effect size compared to the effect sizes compiled in Lipsey and Wilson (1993). On the five-point Child Fear Scale (CFS), this would amount to a mean difference of 1.22 points. However, the true effect is likely to be smaller than this estimate (see 'Risk of Bias Across Studies'). As for pain, the Q-statistic indicated statistically significant heterogeneity in effect sizes (Q (34) 437.69, p < 0.001), with a similarly large proportion (I 2 95.43%) of variation attributable to differences in true effect sizes. Four studies reported results in favor of the control/non-VR treatment (Shah and Bhatia, 2018;Eijlerset al., 2019b;Ryu et al., 2019;Litwin et al., 2020). Potential sources of heterogeneity are further explored in the 'Additional Analyses' sub-section.

Risk of Bias Across Studies
The funnel plots ( Figure 8) seem to show a lack of smaller studies reporting statistically non-significant results (i.e., toward the lower left part of the plot) in both the pain and anxiety data sets). The plot asymmetries are confirmed by significant Eggers' regression tests (p ≤ 0.01).
The trim-and-fill procedure was conducted with various settings as previously described. For the pain studies, three to six studies were imputed, with adjusted mean effect sizes ranging from 0.41 (95% CI [0.35, 0.48]) (fixed-effects with the L 0 estimator) to 0.57 (95% [0.25, 0.88]) (random-effects with the R 0 estimator). Based on these adjusted estimates, the true mean difference would be closer to 1.00-1.40 points on the Wong-Baker Faces scale. For anxiety, three to eight studies were imputed, which yielded adjusted estimates ranging from 0.40 (95% CI [0.33, 0.47]) (fixed-effects with the L 0 estimator) and 0.66 (95% CI [0.21, 1.11]) (random-effects with the R 0 estimator). This suggests that the true mean difference is closer to 0.54-0.90 points on the Child Fear Scale. These estimates are thus considerable moderations of the original effect size. For the pain data set, selection models yielded adjusted summary effects sizes at 0.16, 0.48, and 0.64, corresponding to severe, moderate and mild publication bias, respectively.
Adjusted estimates for the anxiety data set are at 0.21, 0.62, and 0.81. Assuming some publication bias is present, summary estimates are probably somewhat lower than those of the naïve random effects models. Taken together, the results of trim and fill and selection modeling suggest somewhere around 0.50 (1.23

Subgroup and Meta-Regression Analysis of the Effects of Virtual Reality on Pain
The subgroup analyses (Table 4) revealed statistically significant differences in mean effects across the groups of medical procedures, most notably between the dental subgroup (Hedges' g 0.99,95% CI [0.28,1.70]) and the pre-operative (Hedges' g −0.13, 95% CI [−0.37, 0.12]) subgroups. In the preoperative and wound care subgroups, the confidence intervals included zero, indicating the possibility of no or minimal differences between the VR and non-VR conditions. These subgroups were also quite small. The mean effects were similar between the immersion subgroups. However, studies using less interactive VR systems reported significantly lower pain levels (Hedges' g 0.99,95% CI [0.51,1.47]) than those using interactive VR systems (Hedges' g 0.28,95% CI [0.10,0.45]). Four studies were not included in the subgroup analysis of interactivity, as they contained both interactive and non-interactive VR interventions (Chaudhary et al., 2020;Gerceker et al., 2020;Gerceker et al., 2021;Piskorz et al., 2020). Participants' age and the level of immersion and interactivity were applied as predictors in a meta-regression analysis ( Table 5). Again, the four studies with both interactive and non-interactive interventions were not included. After controlling for the level of immersion and mean age of participants, the difference between interactive and non-interactive VR did not reach statistical significance. No statistically significant relationship was found between the participants' age or level of immersion and mean pain scores.

Subgroup and Meta-Regression Analysis of the Effects of Virtual Reality on Anxiety
Subgroup analyses of studies reporting anxiety outcomes ( Table 6) indicate similar patterns as those observed for pain outcomes, with the largest effect sizes reported in the dental subgroup (Hedges' g 1.41,95% CI [0.44,2.37]). However, the difference between the groups of medical procedures was not statistically significant. The difference between the interactivity subgroups was statistically significant, with lower anxiety scores reported in the noninteractive group (Hedges' g 1.15,95% CI [0.57,1.73]. The non-immersive (Hedges' g 1.16,95% CI [0.46,1.87]) group reported markedly lower anxiety levels, but this difference was not statistically significant.
A meta-regression analysis with participants' age, the level of immersion and interactivity as predictors revealed no statistically significant relationships with anxiety scores ( Table 7).

Sensitivity Analyses
Sensitivity analyses were conducted to evaluate the robustness in results when removing studies that were not adequately powered (<100 participants), non-randomized studies, and studies with two or more individual domains considered at a high risk of bias. As previously discussed, most studies received an overall high risk of bias judgment due to the prioritization of self-reported measures. Rather than excluding studies based on their overall risk of bias, the sensitivity analysis involved removing studies that received a high risk of bias judgment in more than one domain.

Summary of Evidence
The aim of this systematic review was to evaluate the evidence regarding the effectiveness of VR on procedural pain and anxiety in children. An overview of the characteristics of VR interventions was provided, as well as the settings and ways in which they were used. Meta-analyses of pain and anxiety outcomes were performed, and the kind of medical procedure, mean patient age, interactivity, and immersion were explored as potential moderators. The strength of evidence was assessed through risk of bias judgements, tests for publication bias, and sensitivity analyses.
Although information about the VR interventions was often lacking, it was clear that they were diverse in terms of technical specifications, level of interactivity, and the media that was displayed. While most VR headsets were fully occlusive and offered auditive stimulation, stereoscopic graphics and head tracking were only used in nearly half of the studies. The screen resolution and field of view also varied greatly. Information regarding the screen refresh rate was often   unavailable. Nearly half of the studies used non-interactive simulations or movies, whereas the interactive group consisted of both minimally interactive simulations (e.g., Aydin and Ozyazicioglu, 2019) and more cognitively taxing games (e.g., Piskorz and Czub, 2018). Overall, the evidence was deemed at a high risk of bias using the ROB 2.0 tool. This is not surprising, as blinding patients to their allocation to experimental groups was not possible, and selfreported measures were preferred for inclusion in the metaanalysis. The fact that most studies received a high risk of bias judgment does not in itself suggest low methodological quality of studies. However, most studies were deemed to have at least an intermediate risk of bias in several other domains. This raises serious concerns on the validity of study results and their syntheses. For example, studies conducted with lower methodological quality may overestimate treatment effects (Moher et al., 1998;Hempel et al., 2011).
Other reasons to suspect spuriously large treatment effects are the indications of publication bias. Several studies reporting nonsignificant results are likely lacking from the literature, and there is reason to believe that the true effects are considerably smaller than those observed in the retrieved studies. In conclusion, the meta-analytical findings should thus be interpreted with great caution, and attention should be directed toward the more modest range of estimates suggested by the trim-and-fill and selection model analyses.

Effects of Virtual Reality on Pain and Anxiety
High levels of heterogeneity were observed in both the pain and anxiety studies, but most studies reported results in favor of VR. Large effects were found for both pain (1.76 points on the Wong-Baker Faces Scale [W-BFS]) and anxiety (1.22 points on the Child Fear Scale [CFS]). Based on estimates adjusted for publication bias, there is however strong reason to believe that the true effects of VR on pain and anxiety are considerably lower (approximately 1.23 points on the W-BFS and 0.82 points on the CFS).

Moderator Analyses
Studies in which VR was used during dental or needle-related procedures reported larger effects on average. The pain and anxiety scores were also lower in the non-interactive VR subgroup. There was a high degree of overlap between these three groups; all the 24 non-interactive VR interventions were used during dental or needle-related procedures among the pain studies, and 20 out of the 24 among the anxiety studies. It is therefore difficult to establish whether it is the medical procedure or the level of interactivity (or neither) that best explains the differences that were observed.
No statistically significant differences in VR effectiveness were found between systems that were highly immersive (i.e., had head tracking, full visual occlusion of the patient's physical surroundings, stereoscopy, and auditive stimuli) and those that lacked at least some of these features (or in which immersion variables could not be confirmed). It should not be concluded based on these results that there is no effect of immersion on VR effectiveness. The immersion variable used in the analysis was based on only four of the many features known to influence presence. They were selected as information regarding other VR features was lacking for several studies. To maintain an acceptable predictor-study ratio, they were used to create a dichotomous variable that only described whether a VR system possessed all the four features. Consequently, any potential differences between VR systems with none, some, and all the features were ignored. A more sophisticated approach would involve an assessment of the relative influence of several individual immersion variables. The results of the present analysis should thus only be interpreted as an observed mean difference between studies that did and did not have four arbitrary VR features, and that were also heterogenous in many aspects, such as patient characteristics and medical procedures. The same considerations apply to the statistically significant difference that was observed between interactive and less interactive VR systems. For example, the varying degrees and forms of interactivity were not considered. Nevertheless, our findings may contradict the common assumption that highly immersive and interactive interventions are superior, which has been reported in studies on experimentally induced pain in mostly adult volunteers (e.g., Hoffman et al., 2004;Hoffman et al., 2006;Dahlquist et al., 2007;Wender et al., 2009).
No relationship was found between the study-level mean age of participants and the effectiveness of VR. When using aggregate data, rather than individual-level data, only the between-studies variation is analyzed. In this case, it might have concealed any true relationship between the participant's individual age and the effectiveness of VR. This is an example of what is referred to as ecological fallacy (Thompson and Higgins, 2002). It should therefore not be concluded that the age of participants is not related to the effectiveness of VR on pain.  Subgroup and meta-regression analyses are observational in nature and cannot be used to establish causality (Borenstein et al., 2009;Deeks et al., 2020). They are also based on a limited number of studies and are probably not representative of all medical procedures, VR interventions, and patients in hypothetical studies or a clinical setting. Positive results from subgroupand meta-regression analyses should therefore not be interpreted as conclusive evidence that certain VR systems perform better than others, or that it is more effective in certain settings and patients. Neither should the opposite be inferred from the failure to identify any such differences. In conclusion, the results of the moderator analyses should not be used to draw any definitive conclusions but may inspire new hypotheses and further research on the importance of interactivity and immersion, as well as variables that were not assessed in this study (e.g., the health status and gender of participants).

Sensitivity Analyses
The overall estimate of the effects on pain was somewhat reduced when inadequately powered studies were removed. Unexpectedly, a slight increase in the effect size estimate for anxiety was observed when inadequately powered studies were removed. This increase was seemingly caused by a group of studies with narrow confidence intervals and between 50 and 96 participants that reported effect sizes slightly smaller than the average of studies that were considered adequately powered. As studies are assigned weights proportional to their standard error in a random-effects model, removing these studies likely caused the unexpected increase in the mean effect size estimate. It should also be noted that the power cut-off was based on an arbitrary assumption of a 0.50 effect size.
The summary effect size for both pain and anxiety remained relatively constant after removing the non-randomized studies. This is likely because only a few non-randomized studies were included in each meta-analysis, of which several had wide confidence intervals and thus contributed less to the original summary effect. It should therefore not be concluded that there is no association between the study design and effect sizes.
Only modest changes in the mean effects for pain and anxiety were observed when studies with a high risk of bias judgment in more than one domain were removed. However, the retained studies all had at least an intermediate overall risk of bias. This sensitivity analysis should therefore not be interpreted as evidence that bias did not influence the results.

Limitations
The measures obtained for the quantitative synthesis were subjective and thus carry inherent limitations. As pain and anxiety are private, subjective experiences, self-reports were prioritized over observational measures. However, as pointed out by von Baeyer (2009), they should be interpreted with regards to developmental and social factors. Consciously or not, children may underreport or overreport their pain for reasons such as difficulties with understanding the scales or fear of the consequences of reporting certain scores (e.g., underreporting pain due to a fear of being subjected to more medical procedures) (O'Brien and Root, 2019;von Baeyer, 2009). Furthermore, scales like the Wong-Baker Faces Scale have been criticized for using response options represented by faces that cry, smile, or look angry; if the children themselves do not experience the corresponding emotions, they may avoid selecting these responses even though they most accurately reflect the level of their distress (von Baeyer, 2009). The lack of blinding to the experimental condition may also have introduced bias to the measurement of pain and anxiety. Several other issues related to the measurement of pain and anxiety also apply (see von Baeyer, 2009). An important limitation of the present study is therefore not conducting multiple analyses with reports from several informants or physiological data (e.g., pulse rate).
The validity of results from systematic reviews and metaanalyses is a product of the quality of primary studies (Borenstein et al., 2009). For example, methodological issues of primary studies, like flaws in the randomization process and retrospectively registered trials, are also transferred to any syntheses of study results. Updated reviews should therefore be conducted as more trials with larger sample sizes and greater methodological rigor are being published.
Although efforts were made to locate unpublished studies, no studies were identified that did not get published or were in press before the completion of this review. The failure to include any unpublished studies is a significant limitation of the present study, considering the indications of publication bias. Eligible studies may also have been excluded because of language restrictions.
The risk of bias judgements were conducted by only one person in the present study. Although the ROB 2.0 and ROBINS-I tools contain decision algorithms that guide the overall judgements per domain, scoring individual items nevertheless requires at least some subjective judgements (Higgins et al., 2003). This also applies to the process of study selection, in which only the first author conducted the initial screening of potentially eligible studies.
Another issue to consider at the review-level is the categorization of medical procedures. The categories were created with the intention of describing each included study as accurately as possible while also keeping the number of subgroups low to ensure that they were adequately sized for subgroup analyses. However, the medical procedures within each subgroup were certainly not homogenous. For example, while Eijlers et al. (2019b) measured the effect of procedural preparation on post-operative pain, Walther-Larsen et al. (2019) measured the effect of VR distraction on acute pain from intravenous cannulation before surgery. Another example is the needle-related group, which included both lumbar punctures as part of cancer treatment and routine venipuncture in healthy children. It is possible that a different set of categories would have yielded different results and useful insight.
The present review aimed to assess whether immersion and interactivity could explain some of the heterogeneity that had previously been reported. Although subgroup analyses revealed some statistically significant differences, heterogeneity remained high. Other potential moderators that were not analyzed in the present review should be explored in the future, such as the children's health status and gender, concurrent use of pharmacological interventions, procedure and VR duration, and the timing of the VR procedure and data collection. Furthermore, the comparison groups were diverse and included both nointervention conditions and various active non-VR interventions, which may also contribute to the observed level of heterogeneity. Other relevant issues that were not addressed in the present review include safety issues and adverse outcomes. Although common symptoms like nausea and vertigo tend to decline quickly after removing the VR headset, more serious concerns have also been expressed (see Nichols and Patel, 2002).

CONCLUSION
The results of the present review indicate that VR has beneficial effects on procedural pain in children, compared to other non-VR interventions or no intervention. The direction of the effects is in accordance with previous meta-analyses, but their magnitudes were lower than those reported in Eijlers, Utens et al. (2019) and Georgescu et al. (2020). The differences likely reflect the various definitions of VR and immersion and the rapidly developing literature, as well as the inclusion of adult samples in some reviews. However, the strength of evidence is considered weak due to a high risk of bias within and across studies, and it is not possible to draw any definitive conclusions.
The results indicated that non-interactive studies were superior, which contradicts the results of some previously cited studies (e.g., Hoffman et al., 2004;Hoffman et al., 2006;Dahlquist et al., 2007;Wender et al., 2009). Although these results should be interpreted with caution, it is possible that children benefit more from less demanding tasks. This would have important implications for VR developers, clinicians, and decision makers. Further research is needed to establish if interactivity could be beneficial, and if so, the optimal level and mode of interactivity for different age groups.
The review has demonstrated the diversity of VR systems in terms of hardware and software. No relationship was found between immersion and treatment effects. However, immersion features were not assessed individually, and their potential role should therefore not be dismissed. VR interventions vary in terms of the content that is displayed. Interestingly, some interventions feature content that is likely to increase arousal (e.g., rollercoaster simulations), whereas some included more relaxing content (e.g., underwater simulations). The effects of these and other software design decisions would be interesting to address in future studies.
Decision makers should be aware of the differences between VR interventions when considering the implementation of VR in clinical settings. Less immersive and non-interactive technologies may also have additional benefits that were not discussed in the present review. For example, larger screens may be impractical during some procedures (e.g., dental procedures), auditive stimuli may disturb communication with medical personnel, and head tracking may encourage movements of the head and body that could be disruptive to the medical procedure.
In conclusion, the review suggests that VR could be beneficial in pediatrics. However, the results must be seen in context of the limitations of primary studies and the present review. More studies with larger sample sizes and methodological rigor are needed, especially on the effects of using VR for procedural preparation. Researchers should explicitly state their definitions of VR and immersion to avoid confusion. It remains unclear whether VR is more effective than all other interventions, such as non-VR, screen-based interventions. Less interactive VR may be preferable in pediatrics, but more research is needed on the potential differences between various forms and degrees of interactivity. Future studies should also be focused on individual immersion variables and the content that is displayed on the VR headsets.

DATA AVAILABILITY STATEMENT
Publicly available datasets were analyzed in this study. This data can be found here: https://doi.org/10.18710/QA8WBZ, DataverseNO.

AUTHOR CONTRIBUTIONS
RN contributed to the conception and design of the study, search strategy, data coding and statistical analysis (major contributions). TL contributed to design of the study, search strategy, and statistical analysis (supporting contributions). RN wrote the first draft of the manuscript. TL wrote sections of the manuscript. Both authors contributed to manuscript revision, read, and approved the submitted version.

FUNDING
No specific grant apart from institutional support from UiT The Arctic University of Norway.