METHODS article
Front. Psychiatry
Sec. Psychopathology
This article is part of the Research TopicLanguage and Emotions in Mental HealthView all articles
Leveraging Reddit data for Context-enhanced Synthetic Health Data Generation to Identify Low Self Esteem
Provisionally accepted- 1Mayo Clinic, Rochester, United States
- 2Barrow Neurological Institute, Phoenix, United States
Select one of your emails
You have multiple emails registered with Frontiers:
Notify me on publication
Please enter your email address:
If you already have an account, please login
You don't have a Frontiers account ? You can register here
Low self-esteem (LoST) is a latent yet critical psychosocial risk factor that predisposes individuals to depressive disorders. Although structured tools exist to assess self-esteem, their limited clinical adoption suggests that relevant indicators of LoST remain buried within unstructured clinical narratives. The scarcity of annotated clinical notes impedes the development of natural language processing (NLP) models for its detection. Manual chart reviews are labor-intensive and large language model (LLM)-driven (weak) labeling raises privacy concerns. Past studies demonstrate that NLP models trained on LLM-generated synthetic clinical notes achieve performance comparable to, and sometimes better than those trained on real notes. This highlights synthetic data's utility for augmenting scarce clinical corpora while reducing privacy concerns. Prior efforts have leveraged social media data, such as Reddit, to identify linguistic markers of low self-esteem; however, the linguistic and contextual divergence between social media and clinical text limits the generalizability of these models. To address this gap, we present a novel framework that generates context-enhanced synthetic clinical notes from social media narratives and evaluates the utility of small language models for identifying expressions of low self-esteem. Our approach includes a mixed-method evaluation framework: (i) structure analysis, (ii) readability analysis, (iii) linguistic diversity, and (iv) contextual fidelity of LoST cues in source Reddit posts and synthetic notes. This work offers a scalable, privacy-preserving solution for synthetic data generation for early detection of psychosocial risks such as LoST and demonstrates a pathway for translating mental health signals in clinical notes into clinically actionable insights, thereby identifying patients at risk.
Keywords: Clinical Notes, Llama, self-esteem, small language model, Synthetic data generation
Received: 15 Oct 2025; Accepted: 15 Dec 2025.
Copyright: © 2025 Garg, Liu, Jeon, Biernacka, Frye, Geda and Sohn. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) or licensor are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.
* Correspondence: Sunghwan Sohn
Disclaimer: All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article or claim that may be made by its manufacturer is not guaranteed or endorsed by the publisher.
