ORIGINAL RESEARCH article
Front. Artif. Intell.
Sec. AI for Human Learning and Behavior Change
Volume 8 - 2025 | doi: 10.3389/frai.2025.1516295
Data Stream-pairwise Bottleneck Transformer for Engagement Estimation from Video Conversation
Provisionally accepted- Nippon Telegraph and Telephone (Japan), Tokyo, Japan
Select one of your emails
You have multiple emails registered with Frontiers:
Notify me on publication
Please enter your email address:
If you already have an account, please login
You don't have a Frontiers account ? You can register here
This study aims to estimate the engagement of participants in multiparty conversations using video and audio data. For this task, interaction among numerous data streams, i.e., video and audio, from multiple participants should be modeled effectively considering the redundancy of video and audio across frames. To efficiently model the participants' interactions considering such redundancy, a previous study proposed to input participant feature sequences to global tokenbased Transformers, which constrains the attention across feature sequences to go through only a small set of internal units so that the model focuses on key information. However, this approach still has challenges of redundancy in the participant-feature estimation based on standard cross-attention Transformers, which can connect all frames across different modalities. To address this, we propose a joint model of interactions among all data streams using global tokenbased Transformers without distinguishing cross-modal and cross-participant interactions. The experimental results using RoomReader corpus confirm that the proposed model outperformed previous models, achieving accuracy from 0.720 to 0.763, weighted F1 from 0.733 to 0.771, and macro F1 from 0.236 to 0.277.
Keywords: transformer, engagement, Multiparty conversation, multimodal, Classification, Global token
Received: 24 Oct 2024; Accepted: 05 May 2025.
Copyright: © 2025 Suzuki, Hojo, Shinoda, Mizuno and Masumura. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) or licensor are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.
* Correspondence: Keita Suzuki, Nippon Telegraph and Telephone (Japan), Tokyo, Japan
Disclaimer: All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article or claim that may be made by its manufacturer is not guaranteed or endorsed by the publisher.