Multimodal Physiological Signal Emotion Recognition Based on Multi-Head Cross Attention with Representation Learning

Ding, Shihang; Ma, Lin; Li, Haifeng

doi:10.3389/fpsyt.2025.1713559

ORIGINAL RESEARCH article

Front. Psychiatry

Sec. Computational Psychiatry

Multimodal Physiological Signal Emotion Recognition Based on Multi-Head Cross Attention with Representation Learning

Provisionally accepted

Shihang Ding

Lin Ma

Haifeng Li^*

Harbin Institute of Technology, Harbin, China

The final, formatted version of the article will be published soon.

Physiological signals offer a significant advantage in the field of emotion recognition due to their objective nature, as they are less susceptible to volitional control and thus provide a more veridical reflection of an individual's true affective state. The use of multimodal physiological signals enables a more holistic characterization of emotions, establishing multimodal emotion recognition as a critical area of research. However, existing multimodal fusion methods often fail to capture the complex, dynamic interactions and correlations between different modalities. Consequently, they exhibit limitations in fully leveraging complementary information from other physiological signals during the feature learning process. To address these shortcomings, we propose a novel framework for multimodal physiological emotion recognition. This framework is designed to comprehensively learn and extract features from multiple modalities simultaneously, effectively simulating the integrative process of human emotion perception. It utilizes a dual-branch representation learning architecture to process electroencephalography (EEG) and peripheral signals separately, providing high-quality inputs for subsequent feature fusion. Furthermore, we employ a cross attention mechanism tailored for multimodal signals to fully exploit the richness and complementarity of the information. This approach not only improves the accuracy of emotion recognition but also enhances robustness against issues such as missing modalities and noise, thereby achieving precise classification of emotions from multimodal signals. Experimental results on the public DEAP and SEED-IV multimodal physiological signal datasets confirm that our proposed model demonstrates superior performance in the emotion classification task compared to other state-of-the-art models.Our findings prove that the proposed model can effectively extract and fuse features from multimodal physiological signals. These results underscore the potential of our model in the domain of affective computing and hold significant implications for research in healthcare and human-computer interaction.

Keywords: emotion recognition, multimodal, Cross attention, Feature fusion, Physiological signal

Received: 26 Sep 2025; Accepted: 06 Nov 2025.

Copyright: © 2025 Ding, Ma and Li. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) or licensor are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

* Correspondence: Haifeng Li, lihaifeng@hit.edu.cn

Disclaimer: All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article or claim that may be made by its manufacturer is not guaranteed or endorsed by the publisher.