Your new experience awaits. Try the new design now and help us make it even better

METHODS article

Front. Psychiatry

Sec. Psychopathology

This article is part of the Research TopicLanguage and Emotions in Mental HealthView all articles

Leveraging Reddit data for Context-enhanced Synthetic Health Data Generation to Identify Low Self Esteem

Provisionally accepted
Muskan  GargMuskan Garg1Xingyi  LiuXingyi Liu1Eunji  JeonEunji Jeon1Joanna  BiernackaJoanna Biernacka1Mark  A. FryeMark A. Frye1Yonas  E. GedaYonas E. Geda2Sunghwan  SohnSunghwan Sohn1*
  • 1Mayo Clinic, Rochester, United States
  • 2Barrow Neurological Institute, Phoenix, United States

The final, formatted version of the article will be published soon.

Low self-esteem (LoST) is a latent yet critical psychosocial risk factor that predisposes individuals to depressive disorders. Although structured tools exist to assess self-esteem, their limited clinical adoption suggests that relevant indicators of LoST remain buried within unstructured clinical narratives. The scarcity of annotated clinical notes impedes the development of natural language processing (NLP) models for its detection. Manual chart reviews are labor-intensive and large language model (LLM)-driven (weak) labeling raises privacy concerns. Past studies demonstrate that NLP models trained on LLM-generated synthetic clinical notes achieve performance comparable to, and sometimes better than those trained on real notes. This highlights synthetic data's utility for augmenting scarce clinical corpora while reducing privacy concerns. Prior efforts have leveraged social media data, such as Reddit, to identify linguistic markers of low self-esteem; however, the linguistic and contextual divergence between social media and clinical text limits the generalizability of these models. To address this gap, we present a novel framework that generates context-enhanced synthetic clinical notes from social media narratives and evaluates the utility of small language models for identifying expressions of low self-esteem. Our approach includes a mixed-method evaluation framework: (i) structure analysis, (ii) readability analysis, (iii) linguistic diversity, and (iv) contextual fidelity of LoST cues in source Reddit posts and synthetic notes. This work offers a scalable, privacy-preserving solution for synthetic data generation for early detection of psychosocial risks such as LoST and demonstrates a pathway for translating mental health signals in clinical notes into clinically actionable insights, thereby identifying patients at risk.

Keywords: Clinical Notes, Llama, self-esteem, small language model, Synthetic data generation

Received: 15 Oct 2025; Accepted: 15 Dec 2025.

Copyright: © 2025 Garg, Liu, Jeon, Biernacka, Frye, Geda and Sohn. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) or licensor are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

* Correspondence: Sunghwan Sohn

Disclaimer: All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article or claim that may be made by its manufacturer is not guaranteed or endorsed by the publisher.