Real-time affect detection in virtual reality: a technique based on a three-dimensional model of affect and EEG signals

Pinilla, Andres; Voigt-Antons, Jan-Niklas; Garcia, Jaime; Raffe, William; Möller, Sebastian

doi:10.3389/frvir.2022.964754

ORIGINAL RESEARCH article

Front. Virtual Real., 05 January 2023
Sec. Virtual Reality and Human Behaviour
Volume 3 - 2022 | https://doi.org/10.3389/frvir.2022.964754

Real-time affect detection in virtual reality: a technique based on a three-dimensional model of affect and EEG signals

Andres Pinilla^1,2*^†

Jan-Niklas Voigt-Antons^3,4^†

Jaime Garcia²^†

William Raffe²^† www.frontiersin.org

Sebastian Möller^1,3^†

¹Quality and Usability Lab, Institute for Software Technology and Theoretical Computer Science, Faculty of Electrical Engineering and Computer Science, Technische Universitat Berlin, Berlin, Germany
²UTS Games Studio, Faculty of Engineering and IT, University of Technology Sydney UTS, Sydney, NSW, Australia
³German Research Center for Artificial Intelligence (DFKI), Berlin, Germany
⁴Hamm-Lippstadt University of Applied Sciences, Hamm, Germany

This manuscript explores the development of a technique for detecting the affective states of Virtual Reality (VR) users in real-time. The technique was tested with data from an experiment where 18 participants observed 16 videos with emotional content inside a VR home theater, while their electroencephalography (EEG) signals were recorded. Participants evaluated their affective response toward the videos in terms of a three-dimensional model of affect. Two variants of the technique were analyzed. The difference between both variants was the method used for feature selection. In the first variant, features extracted from the EEG signals were selected using Linear Mixed-Effects (LME) models. In the second variant, features were selected using Recursive Feature Elimination with Cross Validation (RFECV). Random forest was used in both variants to build the classification models. Accuracy, precision, recall and F1 scores were obtained by cross-validation. An ANOVA was conducted to compare the accuracy of the models built in each variant. The results indicate that the feature selection method does not have a significant effect on the accuracy of the classification models. Therefore, both variations (LME and RFECV) seem equally reliable for detecting affective states of VR users. The mean accuracy of the classification models was between 87% and 93%.

1 Introduction

Affective states play a central role in some of the most common mental disorders. For example, negative affective states are prevalent in patients suffering from anxiety and depression (Barlow et al., 1991). Likewise, impaired emotional empathy is one of the symptoms of psychopathy (Brook and Kosson, 2013). At the same time, previous studies suggest that video games can be used for the assessment and treatment of mental disorders (Shiban et al., 2015; Blandon et al., 2016; Tavares Vasconcelos Oliveira et al., 2021). Those games can be developed using Virtual Reality (VR) technologies (Bermudez i Badia et al., 2019; Baghaei et al., 2021), leading to immersive (Perkis et al., 2020) and emotionally engaging experiences (Visch et al., 2010). Therefore, the development of games for mental healthcare could be fostered by developing Virtual Reality (VR) systems that automatically decode users’ affective states.

There are at least two requirements that should be fulfilled by a technique for affect detection in VR. Firstly, affective states should be analyzed automatically and in real-time. Secondly, the technique should not interrupt the interaction of the user with the virtual environment. Previous studies suggest that both requirements can be satisfied using electrophysiological signals (Picard et al., 2001; Antons et al., 2014).

Consequently, this manuscript explores the development of a technique for inferring users’ affective states from their electrophysiological responses. The technique was tested with data collected during an experiment in Virtual Reality (VR). In this experiment, participants evaluated their affective responses towards 16 videos in terms of the three dimensions of the Evaluative Space Model (ESM) (Cacioppo et al., 1997) (negativity, positivity, and net predisposition), while their brain activity was recorded using electroencephalography (EEG).

The ESM (Cacioppo et al., 1997) is a theoretical model that allows representing affective responses that contain simultaneous positive and negative activation. This is not possible in theoretical models where positive and negative activation are mapped to a single valence dimension, such as the Circumplex Model of Affect (Russell, 1980). If negative and positive activation are represented in a single dimension, an increase in positive activation necessarily implies a decrease in negative activation, and vice versa. There is evidence suggesting that positive and negative activation might operate independently (Mattek, 2011). Yet, it is not clear how to infer users’ affective states in terms of a theoretical model that does not assume the existence of a bipolar pleasure-displeasure continuum.

Additional studies indicate that it is possible to infer some characteristics of users’ affective states by analyzing their electroencephalography (EEG) signals (Ray and Cole, 1985; Davidson, 1992; Pfurtscheller and Lopes da Silva, 1999; Huster et al., 2009; Antons, 2015; Hofmann et al., 2018). The features extracted from EEG signals can be used to build Machine Learning models for affect detection. Those models can be implemented into a Brain-Computer Interface (BCI), interpreting users’ affective states as computer commands (Zander and Kothe, 2011).

There are at least two types of BCI models: user-dependent and user-independent models. The former are trained with data recorded from the user, while the latter are trained with prerecorded data from multiple users. The technique proposed in this paper aims to build user-dependent models. Thus, the models are tailored to the individual characteristics of each user (Brouwer et al., 2015).

Deep learning methods have been used previously for inferring affective states from EEG activity (Khosrowabadi et al., 2014; Zheng and Lu, 2015; Tripathi et al., 2017; Song et al., 2018). For example, Long Short-Term Memory (LSTM) recurrent neural networks (RNN) can be used to decode affective states from EEG signals (Hofmann et al., 2018; Hofmann et al., 2021). Yet, this approach has been used for decoding emotional arousal. It is still unclear whether it can be used to decode valence, or any of the three affective dimensions of the ESM (Cacioppo et al., 1997).

At the same time, neural networks tend to require more computing power than some traditional machine learning algorithms, such as Random Forest. Partly, because neural networks usually require larger datasets during the training phase to achieve similar accuracy. Additionally, real-time analysis of EEG signals is particularly demanding in terms of computing power because the data must be processed at the same speed that it is recorded.

Computing power is not a limitation when a High-Performance Computing (HPC) system is available. However, most users do not have access to an HPC center. It is possible to overcome this challenge by training a neural network at an HPC center and deploying the trained model (e.g., Singh and Tao, 2020). However, this approach is not optimal for building userdependent models in real-time, because it would require 1) capturing enough data from each user to train user-dependent neural networks, 2) transferring the data from each user to an HPC center, 3) training at least one neural network per user, 4) transferring the trained models back to the device of each user, and 5) completing the entire process at a speed that does not disrupt the experience of the user.

In contrast, a Random Forest classification model usually can be trained using consumer-grade hardware. Therefore, this algorithm could be used to train affect detection models on the user’s device. This approach is consistent with edge computing (Cao et al., 2020), an emerging paradigm that supports the benefits of processing the data on the user’s device. Some of those benefits are 1)minimizing the amount of data that is transferred over the network, reducing bandwidth consumption and avoiding potential pitfalls caused by network disruptions, 2) strengthening security and privacy, because most of the user’s data remain on their device, and 3) reducing the operational costs. Thus, Random Forest might be more suitable than neural networks for building user-dependent affect detection models in real-time.

One of the key steps when building a Random Forest classification model is identifying the most relevant features for the construct of interest. This process is known as feature selection. A common method for feature selection is Recursive Feature Elimination (RFE), which has been used previously in the field of affect detection (Val-Calvo et al., 2019). This method requires defining a fixed number of features to select. The classification model is fit multiple times, and in each iteration, the less relevant features are removed until reaching the number of features that have been previously defined.

Given that the proposed technique aims to build user-dependent models, feature selection is conducted with the data of each user. Hence the optimal number of features to choose might change according to the individual characteristics of each user. Therefore, RFE might not be suitable for building user-dependent models. This challenge can be solved using Recursive Feature Elimination with Cross Validation (RFECV) (Yin et al., 2017; Akbar et al., 2021; Zanetti et al., 2022), a method similar to RFE that automatically detects the optimal number of features that are required for training a model.

Previous studies point out that cross-validation schemes for supervised learning problems, such as RFECV, are not suitable for analyzing EEG signals (Val-Calvo et al., 2019). This is because those schemes assume independence between samples (Val-Calvo et al., 2019). Therefore, they cannot capture the sequential nature of time series that is inherent to EEG signals.

A possible solution for this challenge might be found in Linear Mixed-Effects (LME) models, a non-parametric statistical method that is particularly useful for analyzing EEG signals (Riha et al., 2020). The flexible structure of LME allows capturing variations across time and subject-specific brain activity patterns. Yet, it is not common to use LME in the context of supervised learning for selecting features extracted from EEG signals.

Therefore, two variants of the affect detection technique proposed in the manuscript are tested. The difference between both variants is the method used for feature selection. Those methods are LME and RFECV. Given that RFECV is a well-established feature selection method, it is useful as a benchmark to evaluate the performance of the LME approach proposed in this manuscript. At the same time, given that LME models are able to capture variations across time, they can be used as a benchmark to evaluate the performance of RFECV when applied to time series analysis.

2 Methods

2.1 Participants

Twenty-three students from the Technische Universität Berlin participated in the study. One participant was excluded because the file containing the electrophysiological data was corrupted. Two participants were excluded because a flat line was detected in more than 90% of their EEG recordings. Outliers were detected in two participants, whose data was removed from further analysis. Therefore, the analysis was conducted with the data of 18 participants. Their age was between 19 and 58 years old (M = 30.28; SD = 10.31). Eight were women, and ten were men. All participants provided written informed consent before participating in the experiment. They received €10 as compensation.

2.2 Virtual environment

The experiment was programmed using the software Psychopy 3.0 (Peirce et al., 2019). The computer screen was streamed into a Head-Mounted Display (HMD) using the software Virtual Desktop. The virtual environment was a home theatre (see Figure 1). Participants remained seated in a chair during the experiment and watched the stimuli on the virtual home-theater screen.

FIGURE 1

FIGURE 1. Screenshot of the virtual environment used during the experiment. Participants were inside a virtual home theater. The videos were projected on the screen of this home theater using the software Virtual Desktop (Guy, 2016).

2.3 Stimuli

The stimuli were music videos taken from the DEAP dataset (Koelstra et al., 2012). This dataset contains 40 music videos. A subset of 16 videos selected during a previous study (Kruger et al., 2020) were used in the experiment. Two additional videos were used for training trials. Thus, 18 videos were used in total. The duration of each video was 60 s.

2.4 Apparatus

The virtual environment was shown using an HTC Vive. A Brain Products amplifier was used for the ECG signals and a g.Tec amplifier was used for the EMG and EEG signals. Recordings from both systems were synchronized using Lab Streaming Layer (LSL). An ECG electrode was placed in each wrist and another in the left ankle. EMG electrodes were placed in the Zygomaticus Major and Corrugator Supercilii muscles, similar to Dimberg et al. (2000). EEG electrodes were placed at F3, F4, P3, P4, T7, T8, and Cz, according to the 10–20 system. The EEG electrode locations were selected based on a previous study conducted by Huster et al. (2009) (see Figure 2). The reference electrode for EEG was placed in the left mastoid. The ground electrode for EMG and EEG was placed in the right mastoid. The sampling frequency was 5,000 Hz for ECG, and 256 Hz for EMG and EEG. Impedance for the EEG signals was below 10 kΩ.

FIGURE 2

FIGURE 2. Diagram of electrode montage used during the experiment. Seven electrodes were placed, according to the 10–20 international system. Electrode sites were F3, F4, P3, P4, T7, T8 and Cz, similar to Huster et al. (2009). Reference and ground electrodes were placed in the left and right mastoids, respectively.

2.5 Procedure

Participants signed informed consent and completed a demographics questionnaire. Then the electrodes, HMD, and headphones were placed. Electrophysiological signals were visually inspected before starting the experimental task. Two practice trials were presented to help participants gain familiarity with the virtual environment and the rating system, followed by a 2-min grey screen with a cross in the center, similar to Koelstra et al. (2012). After the 2-min grey screen, 16 trials were presented. Each trial consisted of a 5-s grey screen with a cross in the center, followed by a video. The order of presentation of the videos was randomized. After each video, five questions were presented. All questions were answered using a slider. Participants used an optical mouse to select their answers in the sliders.

The slider for all questions ranged from 1 to 10. In each question, one word was shown in each extreme of the slider, similar to Lombard et al. (2000).

The first three questions corresponded to the three dimensions of the ESM (Cacioppo et al., 1997): negativity, positivity, and net predisposition. The questions were taken from a previous study (Pinilla et al., 2020). The statement of the first three questions was, “how did this video make you feel.” In the first question (negativity), the words at the sides of the slider were “1—Not bad at all” and “10—Very bad”; In the second (positivity), the words were “1—not good at all” and “10—very good”; In the third (net predisposition) the words were “1—Very relaxed” and “10—Very restless”.

Two additional questions were used to assess liking and familiarity. These questions were taken from Koelstra et al. (Koelstra et al., 2012). The statements of the fourth and fifth questions were “how much do you like this video?” and “how well do you know the video ?“, respectively. In the fourth question (familiarity), the words placed at the sides of the slider were “1 - Never saw it before the experiment” and “10 - Knew it very well”. In the fifth question (liking), the words were “1 - Not at all” and “10—Very much”. Results obtained with the liking and familiarity questions were analyzed but not included in this manuscript. The data is available in the public repository of the research project.

2.6 Video labels

Threve labels were assigned to each video based on participants’ evaluations. Given that the evaluation of each participant towards each video was different, the label assigned to a video could vary across participants. Each label corresponds to one of the three dimensions of the ESM (Cacioppo et al., 1997): negativity, positivity, and net predisposition. For each of these three labels, two categories were defined: high and low. Given that the maximum value of the rating scales was 10, evaluations below 5 were labeled as “low,” and evaluations above 5 were labeled as “high.” Consequently, the labels for the negativity dimension were “high negativity” or “low negativity”; for the positivity dimension, “high positivity” or “low positivity”; and for the net predisposition dimension “high net predisposition” or “low net predisposition.”

2.7 Ethics

Ethical review and approval were not required for the study following local legislation and institutional requirements. The participants provided their written informed consent to participate in the study.

3 Signal processing

The signal processing steps consisted of four main steps: (1) preprocessing, (2) feature extraction, (3) feature selection, and (4) training and testing the classification models. Steps (3) and (4) were conducted 10 times for each participant. In each iteration, data were split into two sets: (a) feature selection set and (b) training and testing set. The trials allocated to each set were randomly selected in each iteration. Out of the 16 trials available for each participant, four were used for conducting feature selection. The remaining 12 trials were used for training and testing the classification models. This was intended to avoid double-dipping (i.e., training the classification models with the same data used for feature selection) (Ball et al., 2020). A flowchart describing the process is presented in Figure 3.

FIGURE 3

FIGURE 3. Flowchart describing the signal processing steps. Steps 3 and 4 were repeated 10 times for each participant. In each iteration, 4 trials were randomly selected and used for conducting feature selection with LME and RFECV. The remaining 12 trials were used for training and testing the classification models. A subset of features was selected for each affective dimension of the ESM (Cacioppo et al., 1997) (negativity, positivity, and net predisposition). The features selected with LME were used to build one classification model for each affective dimension. Likewise, the features selected with RFECV were used to build another three classification models (one per affective dimension). The models were trained using a Random Forest Classifier and evaluated with a 12-fold cross-validation.

Anomalies were found in the ECG and EMG data. Therefore, these signals were excluded from further analysis. The EMG data recorded at the Corrugator Supercilii muscle contained noise in all participants, possibly caused by the pressure of the HMD on the electrodes. In the ECG data, the LF/HF ratio (Thayer et al., 2009) was zero for all participants. The ECG and EMG data are available in the public repository of the research project.

3.1 Preprocessing

The following preprocessing steps were conducted on the EEG signals using EEGLAB 2021.1 (Delorme and Makeig, 2004):

1) Notch filter: Remove powerline noise using a notch filter at 50 Hz.

2) Remove bad channels: Remove channels where a flatline longer than 5 s is detected, or whose correlation with nearby channels is lower than 80%.

3) Remove artifacts: Remove artifacts caused by eye movements, eye blinks, and other noise sources using Artifact Subspace Reconstruction (ASR), a method designed for real-time artifact removal (Mullen et al., 2015).

4) Re-referencing: Perform common-average referencing.

5) Band-pass filter: Apply band-pass filter to remove frequencies below 4 Hz and above 45 Hz.

6) Extract epochs: Each video (trial) is equivalent to one epoch. Thus, the length of each epoch is 60 s.

7) Baseline removal: Remove the baseline of the 3 s prior to the beginning of each epoch.

8) Down-sampling: Down-sample to 128 Hz to increase processing speed.

3.2 Feature extraction

The features extracted from the EEG signals were defined based on a literature review conducted by the authors of this manuscript (Pinilla et al., 2021) and an affect detection technique proposed by Val-Calvo et al. (2019). Consequently, the following features were extracted using Python 3.10.4, AntroPy 0.1.4 (Vallat, 2022) and NumPy 1.22.3 (Harris et al., 2020):

1) Relative Power Spectral Density (RPSD) (Antons et al., 2014)

2) Frontal asymmetry (Huster et al., 2009)

3) Spectral envelope (Kraljevic et al., 2017)

4) Number of zero-crossings (Patil et al., 2016)

5) Katz fractal dimension (Akar et al., 2015)

6) Hjorth parameters (Mehmood and Lee, 2015)

7) Petrosian fractal dimension (Balan et al., 2020)

Each feature was extracted at each electrode site available for each participant. Given that some electrodes were automatically removed during the preprocessing step (see bad channel removal in Section 3.1), not all electrodes were available for all participants. Hence the number of features extracted for each participant varied according to the number of electrodes that were removed after preprocessing the signals.

Power Spectral Density (PSD) was extracted at each electrode site at the delta, theta, alpha, beta, and gamma power bands, using Welch’s method. This method splits a signal into overlapping segments and returns an average of the power across those segments. The size of those overlapping segments is defined with a sliding window. In this experiment, an 8 s sliding Hann window was used. Welch’s method returns an average of the power across contiguous segments of a signal. Therefore, it returns a single value instead of a time series. To extract the PSD as a time series, Welch’s method was applied multiple times across contiguous segments of the signal, using a 32 s sliding Hann window. The result was a time series of the PSD at each power band, at each electrode site available for each participant.

The Relative Power Spectral Density (RPSD) was calculated by dividing the PSD of each power band by the total power, yielding a time series of the RPSD at each power band, at each electrode site. Frontal asymmetry was obtained by subtracting the RPSD at electrode site F3 from the RPSD at F4, yielding a time series of the Frontal Asymmetry at each power band, similar to Huster et al. (2009). The other features (spectral envelope, number of zero-crossings, Katz fractal dimension, Hjorth parameters, and Petrosian fractal dimension) were also extracted using an 32 s sliding Hann window, yielding a time series for each feature, at each electrode site available for each participant.

3.3 Feature selection

Two methods were used for feature selection: (1) Linear Mixed-Effects (LME) and (2) Recursive Feature Elimination with Cross Validation (RFECV). Both methods were conducted 10 times for each participant. In each iteration, a group of features was selected for each of the three dimensions of the ESM (Cacioppo et al., 1997) (negativity, positivity, and net predisposition). Therefore, 60 groups of features were obtained for each participant (2 feature selection methods * 10 iterations * 3 affective dimensions).

As mentioned in Section 2.5, the data of each participant contained 16 trials. The data used for feature selection contained four trials that were randomly chosen for each affective dimension. Out of those four trials, two were labeled by the participant as “high” and the other two as “low” in the corresponding affective dimension (see Section 2.6 for more information related to the video labels). This process was repeated 10 times per participant.

3.3.1 Linear Mixed-Effects

Linear mixed-effects models were conducted to analyze the relationship between the features extracted from the EEG signals and the participant’s evaluation of the videos. The analysis was conducted using R and the lme4 package (Bates et al., 2015). For each feature extracted from each participant, three LME models were built. That is one LME model per affective dimension (negativity, positivity, and net predisposition). The fixed effects in all the models were the participant’s rating towards the videos in the dimension of interest and time stamp. The random effects were the intercepts for trial, as well as by-trial random slopes for the feature of interest. Each model was compared against a reduced model that had the same parameters as the full model, but did not contain the feature of interest. P-values were obtained by likelihood ratio tests of the full model against the reduced model. In this analysis, a statistically significant effect indicates that the feature included in the full model contains relevant information about the affective dimension of interest. Therefore, the features that led to likelihood ratio tests with a significant statistical effect (p < 0.5) were selected.

3.3.2 Recursive Feature Elimination with Cross Validation

RFECV was conducted individually for each participant, in each affective dimension of the ESM (Cacioppo et al., 1997). The estimator was a Random Forest Classifier with 100 trees. The data were labeled based on the ratings of each participant (see Section 2.6). The process was conducted with a 4-fold cross-validation. All folds contained an equal amount of samples.

3.4 Classification models

A Random Forest Classifier with 100 trees was used to build the classification models, similar to previous studies in affect detection (Liu et al., 2016; Gupta et al., 2018). Given that the feature selection was conducted 10 times per participant, the models were built 10 times per participant as well. In each iteration, the data used for training and testing the classification models contained the 12 trials that were not used for conducting the feature selection.

As previously mentioned, there were 60 groups of features per participant (10 iterations * 2 feature selection methods * 3 affective dimensions). One classification model was built with each of those groups of features. Each classification model was evaluated using 12-fold cross-validation. All folds contained an equal amount of samples.

Accuracy, precision, recall, and F1-score were obtained for each fold. The values obtained for each of those metrics were averaged in two steps. First, means were calculated across the 12 folds. Then, values were averaged across trials, yielding the mean performance metrics for each participant. In the context of this study, accuracy is the portion of videos that were correctly classified; Precision is the portion of videos classified as “high negativity”, “high positivity” or “high net predisposition” that were correctly classified; Recall is the portion of videos previously labeled as “high negativity”, “high positivity” or “high net predisposition” that were correctly classified; And F1-score is the harmonic mean of precision and recall.

4 Results

A two-way repeated measures ANOVA was conducted to analyze whether the feature selection method (LME vs. RFE) had an effect on the accuracy of the classification models for each affective dimension (negativity, positivity, net predisposition). A Shapiro-Wilk test suggested that the assumption of normal distribution was met, W = 0.984, p > 0.05. The assumption of sphericity was met as well, as assessed by Mauchly’s test, X² (2) = 0.533, p > 0.05. There was no significant two-way interaction, F(2, 34) = .09, p > 0.05, η^p ₂ = .001.

The mean accuracy of the models trained with the features selected using LME was slightly higher than the accuracy of the models trained with the features selected using RFECV. The accuracy of the negativity classification models was 1.25% higher for LME (M = 92.03%, SD = 4.58) than for RFECV (M = 90.78%, SD = 5.07). Likewise, the accuracy of positivity classification models was 0.84% higher for LME (M = 92.86%, SD = 4.67) than for RFECV (M = 92.02%, SD = 4.18). In the net predisposition dimension, the accuracy of the LME models (M = 92.18%, SD = 2.87) was 1.18% higher than the RFECV models (M = 91%, SD = 4.08). Yet, no significant effect was found for the main effect of feature selection method, F (1,106) = 1.787. p > 0.05, ŋ_p² = 0.017, indicating that the differences between LME and RFECV were not statistically significant.

Precision, recall, and F1-score performance metrics were obtained for each model. Those metrics were not analyzed in the ANOVA. Instead, they have been included in Table 1.

TABLE 1

TABLE 1. Mean accuracy, precision, recall and F1-score obtained for each affective dimension of the ESM (Cacioppo et al., 1997). Values were obtained with a 12-fold cross-validation (see Section 3.4). Means were obtained in three steps. First, values were averaged across folds. Then, values were averaged across iterations (10 iterations per participant). Finally, means and standard deviation were calculated across participants.

5 Discussion

A technique for real-time affect detection is proposed in this manuscript. The technique was tested with data from an experiment conducted in VR, where participants’ affective responses were analyzed in terms of the three dimensions of the ESM (Cacioppo et al., 1997): negativity, positivity, and net predisposition.

The data was analyzed by emulating the steps that would be conducted in a real-time analysis. The process consists of four steps: (1) preprocess the EEG data to increase the signal-to-noise ratio (2) extract features using a sliding window; (3) for each affective dimension of the ESM (Cacioppo et al., 1997), select features using two methods: Linear Mixed-Effects (LME) and Recursive Feature Elimination with Cross Validation (RFECV); (4) build the classification models with the selected features.

A two-way repeated measures ANOVA was conducted to compare the mean accuracy of the classification models obtained with each feature selection method (LME vs. RFECV). No statistically significant differences were found. Yet, LME led to classification models slightly more accurate than their RFECV counterparts. The mean accuracy obtained with both feature selection methods was between 87% and 93%, suggesting that the proposed technique leads to reliable results, regardless of the feature selection method used (see Figure 4). These results are consistent with previous studies in affect recognition using EEG signals, where classification models with an accuracy of 90.77% (Xu and Plataniotis, 2012) and 90.4% (Song et al., 2018) were reported. However, those studies used neural networks, while the technique proposed in this manuscript used Random Forest.

FIGURE 4

FIGURE 4. Mean accuracy of the classification models for each affective dimension of the ESM (Cacioppo et al., 1997). The accuracy of the classification models trained with features selected using LME was not statistically significantly different than the accuracy of the models trained with features selected using RFECV. The accuracy of the classification models was similar across affective dimensions. Error bars depict CI, 95%.

The high accuracy of the technique is attributed to at least three factors: (1) the models were trained with the data of each participant, reducing potential biases caused by individual differences between participants; (2) the classification models were built separately for each affective dimension, including in each model only the most relevant features for each affective dimension; and (3) the features were extracted from the EEG signals using a sliding window, capturing variations across time for each feature.

Previous studies point out that cross-validation schemes, such as RFECV, assume independence between samples. Therefore, they are not able to capture the sequential nature of time series (Val-Calvo et al., 2019). In contrast, LME models are able to capture variations across time (see Section 3.3.1). Given that all features extracted in this study were time series, it was expected that the classification models trained with features selected using LME would outperform the classification models trained with features selected using RFECV. Yet, no statistically significant difference was found between the feature selection methods. This finding suggests that cross-validation schemes are suitable for time series feature selection.

The correlation between the selected features was not analyzed in this manuscript. Usually, training a classification model with features that are highly correlated (i.e., presence of multicollinearity) does not increase or decrease its predictive power. However, it can compromise the performance of the technique, because including additional features implies consuming additional computational resources. It is likely that there was multicollinearity in the features selected with LME. It is less likely that this occurred in the features selected with RFECV because this method eliminates recursively the features that do not add predictive power to the classification model. In this regard, RFECV has a key advantage as compared to LME, because it might require fewer computational resources to achieve a similar result.

Similar to most statistical tests, LME models must meet a series of assumptions. One of the challenges in using LME for feature selection for real-time analysis lies in the difficulty of assessing those assumptions automatically. Usually, the assumption of normal distribution in LME models is assessed by visual inspection of Q-Q plots, while the assumptions of linearity and homoscedasticity are assessed by visual inspection of residual plots. However, it is not feasible to visually assess each of those plots in a real-time analysis. A possible solution for this challenge would require building an image recognition model for automatically analyzing Q-Q plots and residual plots, with the aim of estimating whether a given LME model fulfills the assumptions of normal distribution, linearity, and homoscedasticity.

Additional research is required to analyze whether the results reported in this study can be replicated with other datasets. At the same time, it might be useful to conduct additional studies to analyze whether the proposed technique can be extrapolated to other theoretical models. In this manuscript, affective states were analyzed in terms of the Evaluative Space Model (ESM) (Cacioppo et al., 1997). However, there are other theoretical models, such as the Circumplex Model of Affect (Russell, 1980), or Plutchnik’s structural model of emotions (Plutchik, 1982), which might be preferable depending on the intended use case.

In this manuscript, the data was processed emulating the steps that would be conducted in a real-time setting. The next step is to implement the proposed technique into a prototype of a Brain-Computer Interface that is integrated into a VR system. There is a software framework that could ease this process, called the Excite-O-Meter (Quintero et al., 2021). This framework provides the core software components required for recording, analyzing, and visualizing bodily signals in VR. Likewise, Semertzidis et al. (2020) built a BCI for affect communication between dyads in Augmented Reality (AR) systems, called Neo-Noumena. This system could be used as a framework for implementing the technique proposed in the manuscript into an AR system.

Yet, the proposed technique is not ready for use in VR/AR experiences that require motor movements from the user. The experiment presented in this manuscript was conducted in a virtual reality environment where participants were not moving. This was intended to reduce the presence of artifacts caused by body movements. Additional research is required to analyze the performance of the proposed technique in non-stationary settings (Gramann et al., 2011). This could be achieved using a spatial navigation task that involves emotional stimuli, similar to Palmiero and Piccardi (2017). The analysis of the mobile EEG data could be conducted using existing toolboxes for this purpose, such as MoBILAB (Ojeda et al., 2014) and BeMoBil (Klug et al., 2018). However, those toolboxes are not suitable for real-time analysis. Additional research is required to understand how to remove artifacts from EEG signals, in an online fashion, during non-stationary experiments.

6 Conclusion

This manuscript explores the development of a technique for inferring affective states from electroencephalography (EEG) signals. The technique was tested in an experiment conducted in Virtual Reality. The data was analyzed emulating the steps that would be conducted for real-time affect detection. The feature selection was conducted using two methods: Linear Mixed-Effects (LME) analysis and Recursive Feature Elimination with Cross Validation (RFECV). An ANOVA was conducted to analyze whether the accuracy of the classification models was influenced by the feature selection method. A statistically significant interaction was not found. Both feature selection methods lead to classification models with similar accuracy. The classification models were trained one at a time for each affective dimension of the ESM (Cacioppo et al., 1997): negativity, positivity, and net predisposition. The accuracy of the models was between 87% and 93% across these three affective dimensions. Further work is required to (1) analyze the performance of the proposed technique with a larger population, in non-stationary settings, and (2) implement the proposed technique into a Brain-Computer Interface (BCI) that is integrated into a VR/AR system.

Data availability statement

All the data recorded during this study and reports generatedduring the analysis can be found in the public repository of theresearch project : https://osf.io/7v9kt/.

The code used for analyzing the data is open source and available at https://github.com/aepinilla/affect_detection.

Ethics statement

Ethical review and approval was not required for the study on human participants in accordance with the local legislation and institutional requirements. The patients/participants provided their written informed consent to participate in this study.

Author contributions

JV and SM contributed to the study conception and signal processing. JG and WR proof-read the manuscript and contributed to the Machine Learning component of the study. AP conducted the experiment, data analysis, and wrote the first draft. All authors commented on previous versions of the manuscript.

Funding

This work was supported by the strategic partnership between the Technische Universität Berlin, Germany, and the University of Technology Sydney, Australia. Additional support was provided by the School of Computer Science from the Faculty of Engineering and IT, University of Technology Sydney.

Acknowledgments

We are thankful to Kerstin Pieper for sharing her knowledge on electrophysiology; to Robert Spang for his early feedback in the design of the experiment; to Tilo Himmelsbach for his help with the code; and to Michael Gaebler, whose questions were particularly useful to improve this manuscript. We acknowledge support from the German Research Foundation and the Open Access Publication Fund of TU Berlin.

Conflict of interest

The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Publisher’s note

All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article, or claim that may be made by its manufacturer, is not guaranteed or endorsed by the publisher.

References

Akar, S. A., Kara, S., Agambayev, S., and Bilgic¸, V. (2015). “Nonlinear analysis of eeg in major depression with fractal dimensions,” in Proceeding of the 2015 37th Annual International Conference of the IEEE Engineering in Medicine and Biology Society (EMBC), Milan, Italy, August 2015 (IEEE), 7410–7413.

PubMed Abstract | CrossRef Full Text | Google Scholar

Akbar, S., Martinez-Enriquez, A., Aslam, M., and Saleem, R. (2021). “Feature analysis of eeg based brain computer interfaces to detect motor imagery,” in International conference on brain informatics (Springer), 509–518.

ORIGINAL RESEARCH article

Real-time affect detection in virtual reality: a technique based on a three-dimensional model of affect and EEG signals

1 Introduction

2 Methods

2.1 Participants

2.2 Virtual environment

2.3 Stimuli

2.4 Apparatus

2.5 Procedure

2.6 Video labels

2.7 Ethics

3 Signal processing

3.1 Preprocessing

3.2 Feature extraction

3.3 Feature selection

3.3.1 Linear Mixed-Effects

3.3.2 Recursive Feature Elimination with Cross Validation

3.4 Classification models

4 Results

5 Discussion

6 Conclusion

Data availability statement

Ethics statement

Author contributions

Funding

Acknowledgments

Conflict of interest

Publisher’s note

References

This article is part of the Research Topic

People also looked at