AUTHOR=Pellet-Rostaing Arthur , Bertrand Roxane , Boudin Auriane , Rauzy Stéphane , Blache Philippe 

TITLE=A multimodal approach for modeling engagement in conversation

JOURNAL=Frontiers in Computer Science

VOLUME=Volume 5 - 2023

YEAR=2023

URL=https://www.frontiersin.org/journals/computer-science/articles/10.3389/fcomp.2023.1062342

DOI=10.3389/fcomp.2023.1062342

ISSN=2624-9898

ABSTRACT=Recently, engagement has emerged as a key variable explaining the success of conversation. In the perspective of human-machine interaction, automatic assessment of engagement is therefore crucial to better understand interaction dynamics and design socially-aware robots. Our work presents a predictive model of the level of engagement in conversations. It shows in particular the interest of using a rich multimodal set of features, outperforming the existing models in this domain. This study relies on two corpora of audio-visual recordings of naturalistic face-to-face interactions. These resources have been enriched with various annotations of verbal and nonverbal behaviors, such as smiles, head nods, and feedbacks. In addition, we manually annotated manual gestures intensity. Based on a review of previous works in psychology and human-machine interaction, we propose a new definition of the notion of engagement, adequate for the description of this phenomenon both in natural and mediated environments. This definition served as a basis for our annotation scheme. In our work,  engagement has been studied at the turn level; known to be crucial for the organization of the conversation. Even though there is still a lack of consensus around their precise definition, we have developed a turn detection tool. 
A multimodal characterization of engagement is achieved by a multi-level classification of turns. We claim that multimodal cues, involving prosodic, mimo-gestural and morpho-syntactic information, are relevant to characterize the level of engagement of speakers in conversation. Our results significantly outperform the baseline and reach state-of-the-art level (0.76 weighted F-score). The most contributing modalities are identified by testing the performance of a two-layer perceptron when trained on unimodal feature sets and on combinations of two to four modalities. Results support our claim relative to the usefulness of multimodality. Combining features related to the speech fundamental frequency and energy with mimo-gestural features leads to the best performance.