Comparative performance of large language models in emotional safety classification across sizes and tasks

Pinzuti, Edoardo; Tüscher, Oliver; Ferreira Castro, André

doi:10.3389/frai.2025.1706090

BRIEF RESEARCH REPORT article

Front. Artif. Intell.

Sec. Medicine and Public Health

Comparative performance of large language models in emotional safety classification across sizes and tasks

Provisionally accepted

Edoardo Pinzuti^1,2

Oliver Tüscher^1,2,3,4

André Ferreira Castro^5*

¹Leibniz Institute for Resilience Research, Mainz, Germany
²Department of Psychiatry, Psychotherapy and Psychosomatic Medicine, University Medical Center Halle, Halle (Saale), Germany
³German Center for Mental Health (DZPG), Site Halle-Jena-Magdeburg, Halle (Saale), Germany
⁴Department of Psychiatry and Psychotherapy, University Medical Center of the Johannes Gutenberg-University Mainz, Mainz, Germany
⁵School of Life Sciences, Technical University of Munich, Freising, Germany

The final, formatted version of the article will be published soon.

Understanding how large language models (LLMs) process emotionally sensitive content is critical for building safe and reliable systems, particularly in mental health contexts. We compare the performance of LLMs of different sizes on two key tasks: trinary classification of emotional safety (safe vs. unsafe vs. borderline) and multi-label classification using a six-category safety risk taxonomy. To support this, we construct a novel dataset by merging several human-authored mental health datasets (> 15K samples) and augmenting them with emotion re-interpretation prompts generated via ChatGPT. We evaluate four LLaMA models (1B, 3B, 8B, 70B) across zero-shot and few-shot settings. Our results show that larger LLMs achieve stronger average performance, particularly in nuanced multi-label classification and in zero-shot settings. However, lightweight fine-tuning allowed the 1B model to achieve performance comparable to larger models and BERT in several high-data categories, while requiring < 2GB VRAM at inference. These findings suggest that smaller, on-device models can serve as viable, privacy-preserving alternatives for sensitive applications, offering the ability to interpret emotional context and maintain safe conversational boundaries. This work highlights key implications for therapeutic LLM applications and the scalable alignment of safety-critical systems.

Keywords: large language model (LLM), Scaling and Fine-Tuning, Privacy-preserving AI, Emotional Safety Classification, Affective Computing

Received: 15 Sep 2025; Accepted: 04 Nov 2025.

Copyright: © 2025 Pinzuti, Tüscher and Ferreira Castro. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) or licensor are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

* Correspondence: André Ferreira Castro, andre.ferreira-castro@tum.de

Disclaimer: All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article or claim that may be made by its manufacturer is not guaranteed or endorsed by the publisher.