ORIGINAL RESEARCH article

Front. Psychiatry

Sec. Digital Mental Health

Volume 16 - 2025 | doi: 10.3389/fpsyt.2025.1548287

This article is part of the Research TopicUnited in Diversity: Highlighting Themes from the European Society for Research on Internet Interventions 7th ConferenceView all 10 articles

Machine-learning detection of stress severity expressed on a continuous scale using acoustic, verbal, visual, and physiological data: Lessons learned

Provisionally accepted
  • 1VU Amsterdam, Amsterdam, Netherlands
  • 2Sentimentics, Dordrecht, Netherlands
  • 3Amsterdam University Medical Center, Amsterdam, Netherlands
  • 4Amsterdam Neuroscience Research Institute, Amsterdam, Netherlands
  • 5GGZ inGeest, Amsterdam, Netherlands
  • 6Tilburg University, Tilburg, Netherlands
  • 7Erasmus University Rotterdam, Rotterdam, Netherlands
  • 8University of Amsterdam, Amsterdam, Netherlands
  • 9WHO Collaborating Center for Research and Dissemination of Psychological Interventions, Amsterdam, Netherlands
  • 10Babeș-Bolyai University, Cluj-Napoca, Cluj, Romania

The final, formatted version of the article will be published soon.

Early detection of elevated acute stress is necessary if we aim to reduce consequences associated with prolonged or recurrent stress exposure. Stress monitoring may be supported by valid and reliable machine-learning algorithms. However, investigation of algorithms detecting stress severity on a continuous scale is missing. Use of multimodal data might contribute to such detection. We aimed to detect stress using multimodal data and identify challenges in such a study. We assessed performance of a machine-learning algorithm trained on multimodal data, namely visual, acoustic, verbal, and physiological features, in its ability to detect stress severity following a partially automated online version of the Trier Social Stress Test. College students (n = 42; M age = 20.79, 69% female) completed a self-reported stress visual analogue scale at five time-points: After the resting period (P1), during the three stress-inducing tasks (i.e., preparation for a presentation, presentation, and arithmetic task, P2-4) and after a recovery (P5). We recorded the participants' voice and facial expressions by a video camera and measured cardiovascular and electrodermal physiology by an ambulatory monitoring system. We evaluated the performance of the algorithm in detection of stress severity using combinations of visual, acoustic, verbal, and physiological data collected at each of the periods of the experiment (P1-5). We found a weak association between the detected and observed scores (r 2 = .154; p = .021). In post-hoc analysis, we classified participants into categories of stressed and non-stressed individuals.When applying all available features (i.e., visual, acoustic, verbal, and physiological), or a combination of visual, acoustic and verbal features, performance ranged from acceptable to good, but only for the presentation task (accuracy up to .71, F1-score up to .73).The complexity of input features needed for machine-learning detection of stress severity based on multimodal data requires large sample sizes with wide variability of stress reactions and inputs among participants. These are difficult to recruit for laboratory setting, due to high time and effort demands on the side of both researcher and participant. Resources needed may be decreased using automatization of experimental procedures, which may lead to additional technological challenges, potentially causing other recruitment setbacks.

Keywords: stress, machine learning, multimodal, acoustic, Verbal, video, Physiology

Received: 19 Dec 2024; Accepted: 20 May 2025.

Copyright: © 2025 Ciharova, Amarti, Van Breda, Gevonden, Ghassemi, Kleiboer, Vinkers, Sep, Trofimova, Cooper, Peng, Schulte, Karyotaki, Cuijpers and Riper. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) or licensor are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

* Correspondence: Marketa Ciharova, VU Amsterdam, Amsterdam, Netherlands

Disclaimer: All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article or claim that may be made by its manufacturer is not guaranteed or endorsed by the publisher.