Your new experience awaits. Try the new design now and help us make it even better

ORIGINAL RESEARCH article

Front. Psychol.

Sec. Emotion Science

Enhancing Emotion Recognition in Virtual Reality: A Multimodal Dataset and a Temporal Emotion Detector

Provisionally accepted
Chenxin  QuChenxin Qu1Xiaoping  CheXiaoping Che1*Yafei  YangYafei Yang1Zhongwei  ZhangZhongwei Zhang1Enyao  ChangEnyao Chang1Jianing  ZhangJianing Zhang1Hongwei  ZhuHongwei Zhu2Ling  YangLing Yang3
  • 1Beijing Jiaotong University, Beijing, China
  • 2Peking Union Medical College Hospital, Beijing, China
  • 3China National Software & Service Co Ltd, Haidian, China

The final, formatted version of the article will be published soon.

Emotion is a complex psychophysiological phenomenon elicited by external stimuli, exerting a profound influence on cognitive processes, decision-making, and social behavior. Emotion recognition holds broad application potential in healthcare, education, and entertainment. With virtual reality (VR) emerging as a powerful tool, it offers an immersive and controllable experimental environment. Prior studies have confirmed the feasibility and advantages of VR for emotion elicitation and recognition, and multimodal fusion has become a key strategy for enhancing recognition accuracy. However, publicly available VR multimodal emotion datasets remain limited in both scale and diversity due to the scarcity of VR content and the complexity of data collection. The shortage hampers further progress. Moreover, existing multimodal approaches still face challenges such as noise interference, large inter-individual variability, and insufficient model generalization. Achieving robust and accurate physiological signal processing and emotion modeling in VR environments thus remains an open challenge. To address the issues, we constructed a VR experimental environment and selected 10 emotion-eliciting video clips guided by the PAD(Pleasure-Arousal-Dominance) model. Thirty-eight participants (N=38) were recruited, from whom electrodermal activity, eye-tracking, and questionnaire data were collected, yielding 366 valid trials. The newly collected dataset substantially extends the publicly available VREED dataset, enriching VR-based multimodal emotion resources. Furthermore, we propose the MMTED model (Multi-Modal Temporal Emotion Detector), which incorporates baseline calibration and multimodal fusion of electrodermal and eye-tracking signals for emotion recognition. Experimental results demonstrate the strong performance of the MMTED model, achieving accuracies of 85.52% on the public VREED dataset, 89.27% on our self-collected dataset, and 85.29% on their combination.

Keywords: virtual reality, physiological signals, emotion recognition, deep learning, Multimodal data

Received: 21 Sep 2025; Accepted: 31 Oct 2025.

Copyright: © 2025 Qu, Che, Yang, Zhang, Chang, Zhang, Zhu and Yang. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) or licensor are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

* Correspondence: Xiaoping Che, xpche@bjtu.edu.cn

Disclaimer: All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article or claim that may be made by its manufacturer is not guaranteed or endorsed by the publisher.