ORIGINAL RESEARCH article
Front. Robot. AI
Sec. Robot Design
Volume 12 - 2025 | doi: 10.3389/frobt.2025.1581024
This article is part of the Research TopicInnovative Methods in Social Robot Behavior GenerationView all 3 articles
Simultaneous Text and Gesture Generation for Social Robots with Small Language Models
Provisionally accepted- Uppsala University, Uppsala, Sweden
Select one of your emails
You have multiple emails registered with Frontiers:
Notify me on publication
Please enter your email address:
If you already have an account, please login
You don't have a Frontiers account ? You can register here
Increased robot communication abilities come with increased user expectations for verbal and non-verbal social robot behaviour. Recent works have demonstrated how Large Language Models (LLMs) have the potential to support the autonomous generation of such behaviours, including both dialogue and non-verbal behaviour. Specifically considering non-verbal behaviour generation, current LLM-based approaches often rely on multi-step reasoning and multi-turn generation using large, closed-source models. This results in significant computational overhead and limits their applicability, particularly in low-resource and/or privacy-constrained environments. In this work, we explore the performance of current techniques and, motivated by their ineffectiveness in low-compute environments, introduce a novel method for the simultaneous generation of text and gestures that adds minimal to no additional overhead compared to plain text generation. Importantly, our system does not generate low-level pose sequences or joint trajectories. Rather, it operates at a higher level of abstraction, predicting communicative intentions that are then mapped to platform-specific expressions. Our method uses robot-specific 'gesture heads' that are autonomously derived from the model and do not require specialised or pose-based datasets. This design choice enables generalisability across platforms, without committing to a specific embodiment or gesture modality. These gesture heads operate in parallel with the language modelling component, almost eliminating any additional computation time. We validate our method on two robot platforms-(i) Furhat, capable of various facial expressions, and (ii) Pepper, capable of multiple hand gestures and movements. Experimental results show the effectiveness of our method both in terms of computational time, space and overall behavioural performance.
Keywords: social robot, behavior generation, Multimodal behavior, deep learning, Generative Model, interactive behaviors LOOK_DOWN:0.6 EYE_SQUINT_LEFT:0.4 EYE_SQUINT_RIGHT:0.4 SMILE_CLOSED:0.3
Received: 21 Feb 2025; Accepted: 25 Apr 2025.
Copyright: © 2025 Galatolo and Winkle. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) or licensor are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.
* Correspondence: Alessio Galatolo, Uppsala University, Uppsala, Sweden
Disclaimer: All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article or claim that may be made by its manufacturer is not guaranteed or endorsed by the publisher.