Detecting Fear of Heights Response to a Virtual Reality Environment Using Functional Near-Infrared Spectroscopy

To enable virtual reality exposure therapy (VRET) that treats anxiety disorders by gradually exposing the patient to fear using virtual reality (VR), it is important to monitor the patient's fear levels during the exposure. Despite the evidence of a fear circuit in the brain as reflected by functional near-infrared spectroscopy (fNIRS), the measurement of fear response in highly immersive VR using fNIRS is limited, especially in combination with a head-mounted display (HMD). In particular, it is unclear to what extent fNIRS can differentiate users with and without anxiety disorders and detect fear response in a highly ecological setting using an HMD. In this study, we investigated fNIRS signals captured from participants with and without a fear of height response. To examine the extent to which fNIRS signals of both groups differ, we conducted an experiment during which participants with moderate fear of heights and participants without it were exposed to VR scenarios involving heights and no heights. The between-group statistical analysis shows that the fNIRS data of the control group and the experimental group are significantly different only in the channel located close to right frontotemporal lobe, where the grand average oxygenated hemoglobin Δ[HbO] contrast signal of the experimental group exceeds that of the control group. The within-group statistical analysis shows significant differences between the grand average Δ[HbO] contrast values during fear responses and those during no-fear responses, where the Δ[HbO] contrast values of the fear responses were significantly higher than those of the no-fear responses in the channels located towards the frontal part of the prefrontal cortex. Also, the channel located close to frontocentral lobe was found to show significant difference for the grand average deoxygenated hemoglobin contrast signals. Support vector machine-based classifier could detect fear responses at an accuracy up to 70% and 74% in subject-dependent and subject-independent classifications, respectively. The results demonstrate that cortical hemodynamic responses of a control group and an experimental group are different to a considerable extent, exhibiting the feasibility and ecological validity of the combination of VR-HMD and fNIRS to elicit and detect fear responses. This research thus paves a way toward the a brain-computer interface to effectively manipulate and control VRET.


INTRODUCTION
Exposure therapy is a form of therapy that treats anxiety disorders by gradually and repeatedly exposing the client to his/her fear (Brinkman et al., 2009) in the absence of harm. This can activate the fear extinction process and was proven to be an effective intervention (Hofmann, 2008). Recently, virtual reality (VR) has been introduced to exposure therapy by the evidence that realistic virtual circumstances can have a significant influence on a person's mental state (Riva et al., 2007;Martens et al., 2019), which can pave a way to a successful exposure therapy. Among a vast variety of VR hardware, head-mounted display (HMD) has been shown to be effective in improving the sense of presence in a virtual environment (VE), which is the key element of effective application of VR in the mental health domain (Jerdan et al., 2018). Realistic immersive VEs enable researchers to ecologically perform experiments and invent therapy methods, leading to effective and highly ecologically valid virtual reality exposure therapy (VRET) systems (Martens et al., 2019). HMD-based VR enables an immersive VRET that makes the exposure therapy more controlled, safer, and in some cases also less expensive than traditional exposure therapy (Teo et al., 2016;Boeldt et al., 2019;Bălan et al., 2020). Furthermore, the exposure protocol can be completely standardized when using VRET, which increases the therapist's control over the stimuli and the duration of the exposure, as opposed to traditional in vivo exposure (Rizzo et al., 2013). Despite the higher level of control that VRET offers to the therapist, it is still a common practice that the therapist monitors the fear responses of the client (Brinkman et al., 2009). One important reason to do this is to ensure that the gradual exposure to the fear-eliciting stimuli do not overwhelm the client. Excessive exposure to situations that induce fear can, for example, cause panic attacks for the client and might therefore worsen the anxiety instead of treating it (Boeldt et al., 2019).
However, monitoring a person's fear responses while using VR has been a big challenge. The traditional option of tracking facial expressions becomes difficult when the user is wearing a VR-HMD. Subjective ratings suffer from the difficulty in verbalizing current mental state indication (Hill and Bohil, 2016) and memory bias (Rodríguez et al., 2015). Neuroimaging techniques have been recently proposed to objectively and unobtrusively measure fear response during virtual fear exposure but are limited to the use of electroencephalogram (EEG) (Hu et al., 2018;Peterson et al., 2018;Bălan et al., 2020). However, the disadvantages of EEG include susceptibility to motion artifacts and electrical signal interference which can be anticipated when a user interacts with VR technology. On the other hand, functional nearinfrared spectroscopy (fNIRS) offers a recording of cortical activity in a natural mobility setting with higher spatial resolution than EEG, less susceptibility to motion artifacts and electrical noises, portability, and lightweight characteristic. These advantages substantiate the great potential for the combination of VR-HMD and fNIRS, which has been recently demonstrated in a bisection task (Seraglia et al., 2011), the assessment of prospective memory (Dong et al., 2017;Dong et al., 2018), the processing of racial stereotypes (Kim et al., 2019), performance monitoring during training (Hudak et al., 2017), and a neurofeedback system to support attention (Aksoy et al., 2019). However, the feasibility and ecological validity of using fNIRS to measure fear response during virtual fear exposure is still unexplored.
The neural mechanisms underpinning the fear circuit have been widely researched. The majority of fNIRS studies on cortical responses to fear-invoking stimuli report an increase in cortical activations in the parietal cortex (Köchel et al., 2013;Zhang et al., 2017) or the prefrontal cortex (PFC) (Glotzbach et al., 2011;Roos et al., 2011;Ma et al., 2013;Landowska, 2018;Rosenbaum et al., 2020) during fearful stimulation. PFC areas in which significant activations were found include the left PFC (Ma et al., 2013), dorsolateral PFC (dlPFC), anterior PFC (Landowska, 2018), left dlPFC, and left ventrolateral PFC (vlPFC) (Rosenbaum et al., 2020). The studies that found activations in the parietal cortex presented subjects to fearful and neutral sounds. Decreased chromophores deoxygenated hemoglobin (HbR) concentration changes (Köchel et al., 2013) and higher oxygenated hemoglobin (HbO) concentration changes (Zhang et al., 2017) were found when subjects were listening to fearful sounds as compared to neutral sounds. The areas with significant activations include the (right) supramarginal gyrus and the right superior temporal gyrus. The studies that found an increased cortical activation in the PFC exposed their subjects to spiders (Rosenbaum et al., 2020), fearful faces (Glotzbach et al., 2011;Roos et al., 2011), or a fear-learning experiment based on shocks (Ma et al., 2013). A recent fNIRS study observed decreased HbO concentration changes in the dlPFC and anterior PFC when participants with moderate acrophobia were exposed to a cave VE that displayed artificial heights . The effect was intense during the first exposure session, but the learning process on coping with fear responses affected the following sessions. In general, the majority of fNIRS studies reported increased HbO concentration changes in the PFC when subjects were exposed to the fearful stimuli as compared to the control situations and occasionally reported the complementary decrease in HbR concentration changes (Glotzbach et al., 2011). The increment of HbO concentration changes is also in line with other neuroimaging studies beyond fNIRS that found increased cortical activity in the PFC as fearful responses (Lange et al., 2003;Nomura et al., 2004) of healthy subjects, while the activity in the amygdala is inversely related (Nomura et al., 2004). In contrast, patients with anxiety disorder show decreased activity in the PFC in response to fearful stimuli and increased activity in the amygdala (Etkin and Wager, 2007;Shin and Liberzon, 2010;Price et al., 2011). It is thus evident that the PFC plays an important role in mediating fear responses  and is known as a major component of the cognitive control network (Rosenbaum et al., 2020).
Despite evidence of PFC activity due to the fear circuit as reflected by fNIRS signals, little is known about fear responses in highly immersive VR, especially when using HMD. The current study investigates the possibility of inducing and detecting fear responses in VR-HMD using fNIRS. Specifically, we are interested in inducing and detecting a fear of heights response, which is one of the most prevailing types of human fear which can be reproduced in VR, alongside (Garcia-Palacios et al., 2002;Miloff et al., 2019;Lindner et al., 2020), fear of flying (Rothbaum et al., 2000;Maltby et al., 2002;Rothbaum et al., 2006), fear of driving (Wald and Taylor, 2000), and even post-traumatic stress disorders (Rothbaum et al., 2001;Difede and Hoffman, 2002;Gerardi et al., 2008;Rothbaum et al., 2014). Brain studies on fear of heights using fNIRS have been done in VE (Emmelkamp et al., 2001;Donker et al., 2018;Freeman et al., 2018;Gromer et al., 2018), but the previous works recruited participants either with or without fear of heights. The study in VR-HMD remains unexplored and is the main objective in this study, where we aimed to recruit participants both with and without fear of heights to allow a comparison between groups. Our first research question is as follows: 1) To what extent do the fNIRS signals captured from participants with a fear of heights response and participants without it differ?
To answer this question, we invited both participants with fear of heights (experimental group) and participants without fear of heights (control group) to participate in our experiment, during which they were exposed to virtual height and virtual ground conditions. It was hypothesized that the virtual heights will cause a fear response for the experimental group but does not cause a fear response for the control group. Furthermore, it was hypothesized that the ground condition does not cause a fear response for any of the groups.
In addition, we aimed to train simple machine learning classifiers to automatically detect fear responses of the experimental group from fNIRS signals, which has not been done in previous works. Our second research question is as follows: 2) To what extent can a person's fear of heights response to a virtual reality environment be detected by a simple machine learning model using fNIRS data?
To answer this question, we trained and tested linear classifiers in subject-dependent and subject-independent ways on the data of the experimental group and evaluated the performance in distinguishing ground-condition and height-condition data. Our first attempt to achieve a successful classification of different fNIRS responses to fear of heights elicited in VR-HMD would exhibit ecological validity of combining both components, serving as a baseline toward a practical and effective VRET in the future improvement.

Participants
Two different groups of participants were recruited and prescreened by the Acrophobia Questionnaire (AQ), consisting of 20 items that are rated on a seven-point Likert scale, ranging from not anxious at all to extremely anxious (Cohen, 1977;Antony, 2001). Only participants with fear of heights who scored higher than 35 were invited to participate as the experimental group. On the other hand, only participants without fear of heights who scored lower than 20 were invited to participate as the control group (Gromer et al., 2018). Accordingly, 20 participants (nine females, age 26.10 ± 10.47 years) in the experimental group reported a high AQ score (52.40 ± 11.47), and 21 other participants (nine females, age 22.95 ± 2.11 years) in the control group reported a low AQ score (9.71 ± 5.89). None of the participants suffered from anxiety disorders.

Tasks and Procedure
The study was approved by the institutional ethics committee of University of Twente (reference number: RP 2020-76). All procedures were in accordance with the Helsinki Declaration. After confirming the eligibility of the participants and obtaining written informed consent, the participants were introduced to the Oculus Rift S 1 , which is a VR-HMD with six degrees of freedom enabling tracking of head rotations and translations (forward/ backward, left/right, up/down). Therefore, the participants were able to look around in the VEs by simply rotating their head and to walk around by moving their body in the physical world. The researcher demonstrated how the VR-HMD should be adjusted to fit the head. Hand-held controllers were given to be held during the experiment to make the tracking of the VR-HMD more reliable, but the participants were not allowed to use the controllers. To familiarize the participants with the VE, the HMD, and holding the controller, a practice round with an example VE similar to the ground condition was included and continued until the participant indicated satisfactory familiarity. Then, the VR-HMD was removed, and the participants were fitted with the fNIRS headset. The fNIRS system was calibrated, and the signal quality was assessed visually by the researcher. Afterwards, the participants were asked to stand in a designated place and to fit the VR-HMD themselves. The straps of the VR-HMD were loosened as much as possible to reduce the risk of optode displacement. Figure 2 shows a participant wearing both the fNIRS headset and the VR-HMD. After preparation, the participants were asked to perform the task. The researcher instructed the participants about the maximum level of movement which they were allowed to perform in order to minimize the motion artifacts in the fNIRS signal. Although the VR-HMD provides the possibility to walk around in the VE, the participants were instructed to refrain. Instead of moving, they were asked to gently look around in the VE, while preventing large head movements. Additionally, they were allowed to bend forward slightly during the height condition but were asked to return to the original position after bending forward.
All participants were tested under the same procedure. There were two conditions of VEs: ground condition and height condition. Each condition was presented alternately for five trials, each of which lasted 30 s and was preceded with a baseline period of 20 s, in which neither visual nor auditory stimuli were presented and the participants were instructed to relax and avoid active thinking. In the ground condition, the participants were virtually standing on a sidewalk or square in the middle of a city in the VE, while in the height condition, the participants were virtually standing by the rooftop of a high building. The VEs were created using the Unity development platform 2 . See Figure 1 for examples of the scenes. Both conditions were accompanied by city sounds, to increase the immersiveness of the experience.
After the experiment, the participants were asked to rate their perceived feelings of distress or fear using the Subjective Units of Distress Scale (SUDS) (Wolpe, 1969) during ground and height conditions on an 11-point Likert scale ranging from 0 (no distress/anxiety) to 100 (worst distress/anxiety that you have ever felt). The SUDS questionnaire is often used to assess exposure settings during cognitive behavioral treatment (Benjamin et al., 2010). Additionally, the participants were asked to fill out 14 items of the IGroup Presence Questionnaire (IPQ), which measures a person's sense of presence in VR (Schubert et al., 2001), to test if the participants felt sufficiently present in the VEs for a fear response to emerge. After that, the participants were asked to fill out the AQ to confirm the group membership; median 0.82 (Cohen, 1977) indicates adequate test-retest reliability, suggesting that pre-and post-experiment AQ scores should be similar. Finally, a structured interview by the researcher was held to ask the participants if and when they felt fearful or any other emotions during the experiment to gather extra feedback.

fNIRS Data Acquisition
Changes in HbO and HbR concentrations were measured using the Artinis Brite 24 3 . The Brite is a wireless continuous wave fNIRS device that can measure up to 27 channels. The near-infrared light is emitted at two nominal wavelengths: 760 and 850 nm. Cortical hemodynamic responses were measured at a sampling rate of 10 Hz. The optodes were arranged to cover a large region of the PFC, including the dlPFC, anterior PFC, and part of the vlPFC. Every emitter-detector pair had a maximum distance of 3 cm between the optodes. Figure 2 shows the positioning of the optodes and channels on the scalp, with an overview of the 10-20 system as a reference. In order to prevent nearinfrared light being absorbed by hair, the researcher used a narrow, oblong tool to move the participant's hair to the side when it fell between an optode and the participant's scalp.
Signal quality was visually validated by confirming the presence of cardiac cycles in the fNIRS signals (Hocke et al., 2018).

fNIRS Pre-processing
The fNIRS data were recorded using the Artinis Oxysoft software 4 . The raw data were converted to Δ(HbO) and Δ(HbR) signals (Chen, 2016;Pinti et al., 2019) using the modified Beer-Lambert law (Delpy et al., 1988;Scholkmann et al., 2014). After that, the data were exported to Matlab using Oxysoft2Matlab script and visually inspected. Channels with severe motion artifacts (usually with an amplitude of 5 μM) and channels that did not show cardiac cycles (evident by the repetitive alternation of around 0.10 μM of the amplitude) were excluded from further analysis. Motion correction was applied to the remaining channels, using the Temporal Derivative Distribution Repair procedure (Fishburn et al., 2019). After that, the correlation coefficients of every channel's Δ(HbO) and Δ(HbR) signals were calculated. Channels with a positive correlation coefficient were removed, following a previous fNIRS study suggesting that a negative correlation can be expected when the amount of motion artifacts in the signals is low (Cui et al., 2010). Then, a third-order Butterworth band-pass filter (Hocke et al., 2018;Pinti et al., 2019) with low cut-off frequency 0.01 Hz and high cut-off frequency 0.1 Hz was applied to remove physiological noise arising from breath cycles (∼0.2-0.3 Hz), cardiac cycles (∼1 Hz), and Mayer waves (∼0.1 Hz) (Naseer and Hong, 2015;Pinti et al., 2019). The filtered signals were separated into trials and adjusted with a baseline, yielding five ground-condition trials and five height-condition trials. The main duration of trials was set from 0 to 30 s after the stimulus presentation onset, covering the entire task period of each trial. The 5-s period preceding the stimulus presentation was used as a baseline period. For each participant and each channel, the signals were grand averaged across trials in each condition.

Data Analysis
A between-group analysis was performed by investigating the significant differences of fNIRS signals between the control group and the experimental group. In order to investigate the effect of height on participants with fear of heights, a within-group analysis was performed on the data of ground-condition trials and height-condition trials of merely the experimental group to investigate significant difference of fNIRS signals in the two conditions. Statistical testing was conducted using Matlab 2020a, and the overview of the statistical analysis performed is illustrated in Figure 2.

Between-Group Analysis
To mitigate the inter-subject variability issue, we computed a contrast between the ground-condition and the height-condition grand average Δ[HbO] signals and Δ[HbR] signals for all channels and all participants. The contrast was computed by subtracting the grand average ground condition signal from the grand average height condition signal. For all of these signals, the mean over the window from 3 to 15 s post-stimulus onset was computed, following the evidences that the hemodynamic response only starts to become visible after 3 s [2.8-s lag was found (Lachert et al., 2017)] and that the hemodynamic response is most intense in the first 5-17 s after the stimulus onset (Khan et al., 2020). For every channel, a permutation test with 50,000 permutations was used to test for significant differences between the contrast signal means of the control group and the experimental group, at the significance level α 0.05. The permutation test was chosen as it is a non-parametric test that can be used on small sample sizes and makes no assumptions about the distribution of the data.

Within-Group Analyses
Similar to the between-group analysis, the grand average ground condition Δ(HbO) and Δ(HbR) signals and the grand average height condition Δ(HbO) and Δ(HbR) signals were averaged over the 3-15 s window. For every channel, a permutation test with 50,000 permutations was used to test for significant differences between the ground-condition trial means and the heightcondition trial means over the 3-15 s window, at the significance level α 0.05.

Correction for Multiple Comparisons
Four statistical analyses were executed on the fNIRS data (between/within group analysis on the Δ[HbO]/Δ[HbR] data) per channel, yielding a total of 4 × 27 108 hypothesis tests from all channels. False discovery rate correction, as suggested by Genovese et al. (2002) for neuroimaging data, was executed on the 108 p-values that resulted from the statistical analyses to correct for multiple comparisons. The rate q was set to 0.05.

Subject-Dependent and Subject-Independent Classification
Subject-dependent classifiers were trained and tested only for the experimental group due to the clear distinction of fear responses between height-condition trials and ground-condition trials, labeled as fear response and no fear response, respectively. All 1-s windowed data from the first six trials (consisting of three ground-condition trials and three height-condition trials) were used as training data, and the remaining windowed data from four trials were the test data. As data were extracted from 3 to 15 s after stimulus onset, this resulted in 12 × 6 72 training data instances and 12 × 4 48 test data instances from each participant. In this study, linear discriminant analysis (LDA) and support vector machines (SVM) with linear kernel, implemented in Matlab 2020a, were trained with the standard hyper-parameter settings. Specifically, a linear coefficient threshold of 0 was used with regularized LDA. Sequential minimal optimization was applied to the linear-SVM and without feature scaling. The performance in the modes of 1-, 3-, and 5-s history was measured by the accuracy. Similarly, subject-independent classifiers were trained and tested with the experimental group using leave-one-subject-out cross-validation. It can be useful in real-life VRET settings to classify unseen data from an unknown participant (Bălan et al., 2020). In order to compare the subject-dependent classification with a random classifier (50% accuracy), the 95% confidence interval is calculated for each classifier. The lower (b l ) and upper bounds (b u ) of the 95% confidence interval are based on the Wilson score interval (Wilson, 1927) and are given by the formula wherep is the estimated performance, n is the number of test samples, and z the value corresponding to the desired confidence interval. In case of the 95% confidence interval, z 1.96. The advantage of the Wilson score interval is that it is asymmetric and has no overshoot or zero-width intervals unlike the traditional normal approximation.

Behavioral Results
Three participants withdrew from the experiment due to motion sickness caused by the VR-HMD. SUDS threshold at 30 was used to distinguish the feeling of relaxation and fear, where participants should report higher than this threshold when feeling fear during the experimental condition. As a result, two participants were excluded from each group due to the mismatch between the reported SUDs score and the expected range. In addition, the threshold of IPQ was set at 3 as the minimum for the feeling of presence in the VR, leaving two participants, whose scores did not surpass the threshold, out from the experimental group. Besides, the AQ scores were used to reconfirm the group membership after the experiment, resulting in one and two participants removed from the control and experimental groups, respectively. Consequently, a total of 15 participants (n c 15) remained to be part of the control group and 14 participants (n e 14) were part of the experimental group. Table 1 shows the mean scores and standard deviations of the questionnaire results for both groups; it suggests a clear distinction between the two groups in terms of AQ scores (pre-experimental as well as post-experimental) and SUDS for the height condition. The two groups scored similarly for SUDS in ground condition and for the experienced presence in the VEs. Figure 3 shows the grand average Δ[HbO] traces of the contrast between the ground condition and the height condition for the two groups for every channel, with the standard error given around every trace. It is apparent that only channel 3 (p 0.000 8) generated a significant difference between the contrast Δ[HbO] means of both groups using statistical testing with the false discovery rate (FDR) correction. Meanwhile, the results from Δ[HbR] traces, as also shown in Figure 3, suggested that there are some channels (i.e., channels 15, 23, 24, and 27), where the grand average trace of the control group has a different pattern than that of the experimental group. However, none of them are significant after the corrected statistical testing. traces, as also shown in Figure 4, are rather flat for both conditions, whereas only the difference in channel 23 (p 0.001 7) is significant after the corrected statistical testing.

Classification
Due to motion artifacts and hardware malfunctions, some channels were excluded from the analyses in some participants; i.e., features were extracted from the remaining uncorrupted channels per participant (see Section 2.4.1). In this study, we trained and tested the subject-independent classifiers with the data from only of the uncorrupted Δ[HbO] channels, as many corrupted Δ[HbR] channels were excluded for many participants, which makes it unfeasible to train and test classifiers on these data. Table 2 shows the accuracies of the subject-dependent classification and the 95% confidence interval, calculated using Eq. 1. The mean accuracy computed across all participants suggests that the SVM on the 3-s history performs best, with a mean accuracy of 69.69% (SD 16.94). However, the mean accuracies of the other classifiers are close to that of the 3-s history SVM, with a maximal difference of roughly 1.4%. Therefore, the amount of history taken into account in the classification seems to have a minimal effect on this metric. LDA and linear SVM achieved similar performance with a maximum of 1.2% in accuracy difference. From Eq. 1 one can easily deduce that if the estimated performance is above 64.5% then the lower bound of the 95% confidence interval is higher than 50%. Recall that a subject-dependent classifier is tested on 48 samples; hence, n 48. This implies that for most types of classifiers considered, only 7 out of the 14 subject-dependent classifiers perform significantly better than random. In order to show that the mean of the different subject-dependent classification is significantly higher than the mean of a random classifier for each participant, we take a somewhat different approach. Since the mean over all subject-dependent classification is not the outcome of a Bernoulli experiment (it is the mean over different Bernoulli experiments) we cannot apply the Wilson score interval. But the mean performance of 14 subject-dependent random classifiers is equivalent to an estimated performance of a Bernoulli experiment with 14 × 48 trials. Since the success rate of a random classifier is known and equal to 50%, we can estimate the 99% confidence interval (z 2.576) using Eq. 1 and is given by (45%, 55%). This means that in 99% of the cases, the observed mean of the random classification will be in this interval. The performance also varies considerably among the different participants; while classification in participants 1, 2, and 9 achieved low accuracy, classification for participants 7 and 10 was almost perfect. In some participants, the accuracy also changed by classification methods and history by an amount of almost 15% (participant 1), while these factors had minimal impact on the accuracy in other participants (e.g., participants 7 and 10). This led us to the analysis of data distribution and its effect on the classification performance.

Subject-Dependent Classification
We investigated the data distribution in feature space of representative participants: participants 2 and 9 with relatively low classification accuracy and participant 7 with high FIGURE 3 | Grand average Δ(HbO) and Δ(HbR) traces of the contrast between ground condition and height condition for the two groups: control group (black traces) and experimental group (red traces for Δ(HbO) and blue traces for Δ(HbR)), with standard deviation. The gray shaded area (3-15 s post-stimulus) is the window over which the means were taken that were used for the permutation tests. The horizontal axis represents time in seconds, ranging from 0 to 30, and the vertical axis represents concentration change in μM, ranging from −0.4 to 0.6. The plots are corresponding to channel labels, which are arranged in accordance with the optode layout that was used during the experiment, as presented in Figure 2. Channel numbers are labeled in every plot. The plots surrounded by the border shows the channel where a significant difference was found between the means of the control group and the experimental group. The graphs are arranged according to the optode layout that was used during the experiment, as presented in Figure 2. classification performance. Specifically, principal component analysis (PCA) was applied to 1-s history data, and we visualized the distribution of data projected on the first and second principal components (PCs) to investigate if the training and test data are identically distributed as shown in Figure 5. Training and test data are represented in different FIGURE 4 | Overview of the grand average Δ(HbO) and Δ(HbR) traces of the ground condition (green traces for Δ(HbO) and blue traces for Δ(HbR)) and height condition (orange traces for Δ(HbO) and purple traces for Δ(HbR)) of the experimental group, with standard deviation. The gray shaded area is the window over which the means were taken that were used for the permutation tests. The horizontal axis represents time in seconds, ranging from 0 to 30, and the vertical axis represents concentration change in μM, ranging from −0.4 to 0.6. The plots are corresponding to channel labels, which are arranged in accordance with the optode layout that was used during the experiment, as presented in Figure 2. The plots surrounded by boarders show the channels where a significant difference was found between the means of the ground condition and the height condition for the experimental group. colors. The decisions made for the test data by the LDA and linear-SVM classifiers are also depicted. It should be noted that data distribution in feature space of 3-and 5-s history data are generally similar to 1-s history data and therefore not shown here. Also, it is worth mentioning that PCA is applied for visualization purposes only and not for feature dimension reduction.
In the data of participant 2 there is a high overlap in the test data between the distribution of the fear and non-fear classes, making it difficult to reach decent performance. In participant 9 data, LDA and linear-SVM learned to distinguish classes along the second PC, while the test data are clearly separable along the first PC, thereby yielding performance around chance level. In   Table 3 shows the accuracies of the subject-independent classifiers and the 95% confidence interval, calculated using Eq. 1. For the subject-independent classification, the 95% confidence intervals are smaller, since these are tested on n 120 samples. This also implies that if the estimated performance is above 59% then the lower bound of the 95% confidence interval is larger than 50%. This implies that for most types of classifiers 12 out of 14 subject-independent classifiers perform significantly better than random. In order to compare the mean performance of the subject-independent classification with the mean of subject-independent random classification, we take the same approach as for the subjectdependent classification. In this case we have 14 × 120 trails, and the 99% confidence interval is given by (47%, 53%). Based on the mean accuracies computed from all participants, it can be inferred that the SVM on the 5-s history performs best, with a mean accuracy of 74.46% (SD 12.34). On the contrary, the SVM on the 1-s history performs the worst on average, with a mean accuracy of 71.67% (SD 11.60). From the calculated 99% confidence interval (47%, 53%) for the mean of random subject-independent classification (see Section 2.6.2), one can easily deduce that the mean of the subject-dependent classification is significantly higher (99.5% confidence) than the mean of random subject-independent classification. Again, the difference between the accuracies of the classifiers that perform best and worst on average is only a few percent, indicating that the amount of history and the classification methods have merely minor influence on the classification performance. Again, the accuracies vary considerably among participants, ranging from 48.33% (participant 8) to 91.67% (participant 10), and the cause of this variation is also investigated by data distribution in PC space.

Subject-independent Classification
In the data of participant 8, the training data depicted in Figure 6 (P8) shows that the data labeled as no fear are mostly centered around the negative values of the first PC, while fear data were located around the positive values. However, the test data of Figure 6 (P8) have a different pattern. Instead, the data of the different labels are distributed over the positive and negative values of the second PC and are quite overlapping, indicating the difficulty to separate the test data of the different labels by a linear decision boundary. This might explain why the classifiers, which seemingly learned to separate classes along the first PC, cannot perform well on the test data, yielding low accuracy. In contrast, data distribution in test data from participant 10 [see Figure 6 (P10)] are linearly separable in the first PC. Specifically, data labeled as no fear are centered around the negative values of the first PC, and the test data labeled as fear are centered around the positive values of the first PC. The linear classifiers were therefore successful in generalizing the learned pattern along the first PC to the test data, achieving high classification performance.

DISCUSSION
The aim of this study was to measure brain activity of participants with and without fear of heights when exposed to fearful stimuli presented in VR-HMD. Additionally, the study investigated the feasibility to train a simple classifier to recognize a fear response from fNIRS signals recorded from participants with fear of heights. A successful combination of fNIRS measurement and VR-HMD can prove the ecological validity of its use in VRET.

Between-Group Analysis
The results from the between-group analysis of the fNIRS signals showed that the grand average contrast Δ[HbO] signals of the control group and the experimental group are significantly different in channel 3. No significant differences were found between the grand average contrast Δ[HbR] signals of the two groups. The evidence that only one out of 27 channels shows a FIGURE 6 | Training, test data, and classification decisions from LDA and SVM of the 1-s subject-dependent classification of participants 2 (P2) and 10 (P10), plotted against the first and second principle components.
Frontiers in Computer Science | www.frontiersin.org January 2022 | Volume 3 | Article 652550 significant difference between the two groups for only one chromophore suggests that the fNIRS signals of participants with fear of heights were not that different from those of participants without fear of heights in general. However, it is difficult to make a direct comparison of our result with the literature due to the lack of including both experimental and control groups in previous studies on this topic. Despite this, it was discovered that Δ[HbO] measured in (some areas of) the PFC of recruited homogeneous participants increased during fearful conditions (Rosenbaum et al., 2020;Glotzbach et al., 2011;Zhang et al., 2017;Köchel et al., 2013;Roos et al., 2011;Ma et al., 2013;Landowska, 2018), which is in line with our results that the Δ[HbO] signal of the experimental group peaks higher than that of the control group when exposed to fearful stimuli (see Figure 3). In contrast, we found that Δ[HbR] of both groups were quite equal, which is partly in accordance with the evidence that the majority of similar works did not report any change of Δ[HbR] after the exposure to fearful stimuli (Roos et al., 2011;Ma et al., 2013;Zhang et al., 2017;Rosenbaum et al., 2020), but there are some exceptions (Glotzbach et al., 2011;Köchel et al., 2013;Landowska et al., 2018).
Nevertheless, it was also reported in the literature that Δ[HbO] values over the PFC can increase when the participants were experiencing other mental states, such as mental workload, mental stress, affective responses, attention, deception, preference, anticipation, suspicion, and frustration (Suzuki et al., 2008;Ayaz et al., 2012;Kreplin and Fairclough, 2013;Ding et al., 2014;Hirshfield et al., 20142014;Tupak et al., 2014;Arefi Shirvan et al., 2018;Numata et al., 2019). This indicates that increased Δ[HbO] values are not only an indication of fear responses but can also be driven by other psychological factors. This effect is less likely for the experimental group in our study, as they indicated that they were feeling afraid during the height exposure, which makes it improbable that they also experienced other mental states, considering that fear is presumably the most salient feeling they would perceive.

Within-Group Analysis
The result of the within-group analysis of the fNIRS signals shows that the grand average Δ[HbO] values are significantly higher during the height condition than during the ground condition. This significant difference was observed in a total of 13 channels, which are all located towards the frontal part of the PFC. These results indicate that during fear responses, the Δ[HbO] values increase significantly as compared to no-fear responses, which is in accordance with the vast majority of previous works on fNIRS measurements taken during fear responses (Glotzbach et al., 2011;Roos et al., 2011;Köchel et al., 2013;Ma et al., 2013;Zhang et al., 2017;Landowska, 2018;Rosenbaum et al., 2020).
Additionally, the results of the within-group analysis show that the grand average Δ[HbR] values of the height condition and the ground condition are significantly different in channel 23. Surprisingly, the grand average Δ[HbR] signal of the height condition is higher than that of the ground condition in this channel. This contradicts some findings from the literature, where decreased Δ[HbR] values are reported for fearful conditions (Glotzbach et al., 2011;Köchel et al., 2013;Landowska et al., 2018). It remains unclear why our results differ from the literature.
The clear distinction of fNIRS signals due to fear exposure in experimental group suggests the possibility to train a classifier to automatically detect fear responses using fNIRS.

Subject-Dependent Classification
The subject-dependent classification results suggest that the amount of history and the choice between the LDA or the linear-SVM algorithm has minimal influence on the subject-dependent classification performance. The linear classifiers do not perform well for some participants due to the difference in data distribution between training and test data. A possible explanation is that the fear responses and accompanying fNIRS measurements of these participants were not stable over time.

Subject-Independent Classification
Similarly, the choice of classification methods and the amount of history to take into account do not have enormous influence on the performance of subject-independent classifiers. While the overall accuracy is above 71%, classification for participants 8 and 11 achieved poor performance. Our investigation on the participants' AQ, SUDS, and IPQ scores indicated that these participants had a strong fear of heights, felt very anxious during the height trials, felt relaxed during the ground trials, and felt sufficiently present in the VEs. However, we learned from the data distribution analysis that the fNIRS measurements of these participants were not stable over time. This might explain the overlap of fear and no fear fNIRS data trials in the first two PCs of the feature space when taking all data from this participant as a test set (in leave-one-subject-out cross-validation), while the cause of the instability of data over time remains unclear.
It is remarkable that for most participants, the subjectindependent classification outperforms the subject-dependent classification. While the training data from the first six trials have a different distribution than the last four trials used as a test set in subject-dependent classification, combining all trials might mitigate the discrepancy between those trials, converging to more common patterns of the other participants. This would explain the relatively good performance of the subject-independent classification. More research is needed to prove this hypothesis.

Overall Classification Performance
Overall, the average classification accuracy of the subjectdependent classification is approximately 70%, whereas the subject-independent classification has average accuracies around 74%. These accuracies are statistically significantly higher than random classification. It is noteworthy that the goal of this study is not to find the best classification model but to examine to what extent a simple linear classifier with minimal parameter tuning can discriminate between fear and nofear responses. Future work on applying sophisticated algorithms could improve the classification performance. It should also be noted that our channel selection used in classification was based solely on signal quality and not influenced by feature correlation.
The accuracies in our study are comparable to those in previous works attempting to classify fear from no-fear responses in VR using physiological signals. Despite achieving slightly higher accuracies from 76% to 89.5% compared to our research, previous work recruited a lower number of participants [seven participants in Handouzi et al. (20132013) using blood volume pulse (BVP) data, eight participants in Bălan et al. (2020) using galvanic skin response (GSR), heart rate, and EEG data]. Among similar studies that recruited a higher number of participants, the study by Šalkevicius et al. (2019) detected public speaking anxiety using BVP, GSR, and skin temperature data from 30 participant and achieved 80.1% accuracy in leaveone-subject-out cross-validation. However, this work did not include brain signals in the study. In contrast, another study (Hu et al., 2018) detecting fear of heights response in VR-HMD from EEG signal achieved 88.77%, but the results were based on 10-fold cross-validation, where the generalizability to classify unseen participant remains unknown.
In general, our classification performance has demonstrated the feasibility to detect a fear of height response from brain signals of a previously unseen participant as a 1-s rate (the size of our sliding window is 1 s). As we encourage other researchers to test other classification paradigms, the physiological data reproducing the results in this study are publicly available.

Limitations of the Study
It is noteworthy that the fNIRS signals comprise multiple components where some of them are potentially confounders that are not task-related. Our method is based on contrasting an experimental condition with a baseline condition, which can subtract out spurious hemodynamic/oxygenation responses from the experimental task (Tachtsidis and Scholkmann, 2016). Thus, it should reduce hemodynamic influences from the extracerebral layer from the fNIRS signals. Alternative approaches can be further incorporated to remove systemic confounders.
Neither the fNIRS headset nor the VR-HMD was originally designed for simultaneous usage of both devices. The incompatibility caused an uncomfortable feeling for many participants. The VR-HMD needed tightening up with a headband around the participant's head. This put an extra pressure on the optodes of the fNIRS headset, which can be unpleasant for some participants, negatively influencing the user experience of the system. Since it is difficult to quantify the effect of the uncomfortable feeling caused by the hardware components, it remains unknown to what extent this affected the participants and the consequent results.
Although the fNIRS technology is less susceptible to motion artifacts and electrical noise, it was often reported in the literature that motion artifacts still occur (Naseer and Hong, 2015;Wilcox and Biondi, 2015;Pinti et al., 2020). Therefore, the participants in our study were instructed to look around very slowly in the VEs and to limit their bodily movements, which might reduce realism of the experience of the VEs for some participants. Still, the motion artifacts were present in our study.

Recommendations for Future Work
Future research should consider adding more trials per condition and prolonging the duration of each trial. More data are needed to train sophisticated classification models. Prolonged duration opens the possibility to include heart rate variability (HRV) as a feature for the classifier; HRV can be captured from the embedded cardiac cycles in the fNIRS signals and was found to be a useful measure to detect fear (Wiederhold et al., 2002;Peterson et al., 2018). Also, it is worthy to investigate the difference among high-arousal-negative-valence responses, such as mental stress, frustration, and fear; disentanglement of such responses can potentially improve the detection of fear of height.
In this research, we measure distress and the feeling of presence by using established scales to allow the comparison with previous research in fear and VRET. While SUDS has been widely used in the context of fear exposure treatment due to its high comprehensiveness, conciseness, and validity in psychological studies, future works should also consider using recently developed measures that are correlated highly with the SUDS to confirm the validity of the measured fear by the classical SUDS. These include the scale of anxiety (Spielberger, 1972;Masia-Warner et al., 2003), discomfort (Kaplan et al., 1995), disturbance (Harris et al., 2002;Kim et al., 2008), or distress (McCullough, 2002. Similarly, although IPQ was found as the most reliable questionnaire to measure the presence in VR environment (Schwind et al., 2019) among classical measures (Witmer and Singer, 1998;Slater and Steed, 2000;Usoh et al., 2000), the alternative recent questionnaires should also be considered (Grassini and Laumann, 2020).
The statistical analyses of the fNIRS data and the classification performances are merely based on the mean Δ[HbO] and mean Δ[HbR] values. However, it is known from previous fNIRS studies investigating mental states that alternative features such as amplitude, slope, standard deviation, kurtosis, skewness, and signal peaks can provide insights and be used as discriminative features for classifying mental states (Khan and Hong, 2015;Zhang et al., 2016;Aghajani et al., 2017;Parent et al., 2019). In our study, the grand average Δ[HbO] traces revealed that the traces of the experimental group generally rise to a peak value, whereas this pattern is less apparent for the grand average traces of the control group (see Figure 3). A similar observation can be made for the grand average Δ[HbO] traces of the height condition and ground condition of the experimental group (see Figure 4). Based on these observations, it is anticipated that alternative features such as the maximum signal value, the time to peak, and the signal slope have the potential to improve the classification results or enhance the fNIRS difference between groups.

CONCLUSION
The results answer our first research question by demonstrating that there is significant difference in fNIRS signals between participants with a fear of heights and participants without it when exposed to fear conditions in a VE. Specifically, the contrast between the ground-condition and height-condition fNIRS signals in the experimental group was larger than that in the control group, despite limited statistical significance. The effect of the condition was more salient when focusing only on the experimental group that exhibited significant differences in the grand average Δ[HbO] values during fear responses and during no-fear responses. The effect was dominant in the optode area close to the frontal part of the PFC. To answer our research question regarding to what extent a machine learning model can be successfully trained to recognize fear of heights response using fNIRS, we trained different simple classifiers in a subject-dependent and subjectindependent framework and found that subject-dependent classification encountered the issue of subjective variability. Nevertheless, the subject-independent classification results show the potential for usage in online fear of height detection, and the average accuracy in classifying unseen data from a previously unseen participant is above 74.00%. Our study therefore confirmed the ecological validity of combining fNIRS measurement and VR-HMD, which may pave a way toward effective VRET.

DATA AVAILABILITY STATEMENT
The original contributions presented in the study are publicly available. This data can be found here: [https://doi.org/10.4121/ 17302865].

ETHICS STATEMENT
The studies involving human participants were reviewed and approved by Ethics Committee of the Faculty of Electrical Engineering, Mathematics and Computer Science, University of Twente (reference number: RP 2020-76). The patients/participants provided their written informed consent to participate in this study.

AUTHOR CONTRIBUTIONS
LdW, NT, and MP conceived, planned, designed the experiment. LdW provided a critical review, carried out the experiment, data acquisition, analysis, processing interpretation, and wrote the full research report. NT wrote the manuscript with input from all the authors. MP supervised the research, provided critical feedback, and proofread the manuscript. All authors discussed the results and contributed to the final manuscript.

FUNDING
This work was partially supported by the European Regional Development Fund's operationeel programma oost (OP-OOST EFRO PROJ-00900) and by the Netherlands Organization for Scientific Research (NWA Startimpuls 400.17.602).