SYSTEMATIC REVIEW article

Front. Psychol., 27 February 2026

Sec. Health Psychology

Volume 16 - 2025 | https://doi.org/10.3389/fpsyg.2025.1715306

Exploring the application boundaries of LLMs in mental health: a systematic scoping review

  • 1. The School of Humanities, Tongji University, Shanghai, China

  • 2. Faculty of Applied Sciences, Macao Polytechnic University, Macao, Macau SAR, China

  • 3. School of Digital Technology and Innovation Design, Jiangnan University, Wuxi, China

  • 4. Information Security and Assurance, Northern Arizona University, Flagstaff, AZ, United States

  • 5. Science in Computer Science, Georgia Institute of Technology, Atlanta, GA, United States

Article metrics

View details

956

Views

77

Downloads

Abstract

Background:

The rapid evolution of large language models (LLMs) has ushered in a new era of artificial intelligence (AI) with unprecedented capabilities in understanding and generating human-like text. This progress has sparked a burgeoning interest in applying LLMs across diverse fields, including healthcare. However, the use of LLMs in mental health remains a complex area that demands rigorous investigation. This systematic scoping review aims to explore the current landscape of LLM applications in mental health, identify key research trends and gaps, and delineate the ethical and practical boundaries, thereby providing a comprehensive framework for future research and clinical practice.

Methods:

This study adheres to the PRISMA-ScR (Preferred Reporting Items for Systematic Reviews and Meta-Analyses extension for Scoping Reviews) guidelines. A comprehensive search was conducted across eleven databases (Web of Science, Scopus, PubMed, Medline, CINAHL, Cochrane, ACM Digital Library, IEEE Xplore, ScienceDirect, APA PsycInfo, and Google Scholar). A total of 29 articles were ultimately included in the study.

Results:

The application of LLMs in mental health is strategically focused on high-throughput screening and clinical augmentation. The application landscape is characterized by domain specialization, with the focus shifting from general models to specialized BERT models to achieve higher clinical accuracy, particularly for high-prevalence disorders such as depression and high-risk conditions. Data analysis is powered by massive, unstructured corpora from social media, supplemented by the systematic incorporation of structured clinical knowledge. However, significant limitations exist, including insufficient cultural sensitivity in non-Western contexts, challenges in capturing longitudinal patient history, and critical risks related to model value alignment and the generation of clinically misleading information.

Conclusion:

LLMs have emerged as sophisticated “Mental Health Agents” with immense potential for providing personalized, knowledge-guided interventions. The core challenge for future development is to transcend basic functionality and achieve clinical rigor. Future research must prioritize deep specialization into psychological models, enhance multimodal integration for comprehensive patient assessment, and urgently develop robust ethical and cultural adaptation frameworks to ensure the models are safe, globally equitable, and reliable for clinical deployment, thereby fulfilling their potential to alleviate the global mental health resource crisis.

1 Introduction

Mental health disorders represent a pressing global public health issue, imposing a substantial burden on individuals, families, and societal and economic development (World Health Organization, 2022). According to the World Health Organization (WHO), mental disorders are among the leading contributors to the global disease burden (Vigo et al., 2016). These conditions severely disrupt daily life, occupational functioning, and interpersonal relationships. For instance, major depressive disorder often manifests as persistent low mood, loss of interest, and fatigue, significantly impairing an individual’s ability to study or work (Pan et al., 2019). Generalized anxiety disorder is characterized by uncontrollable worry, tension, and physical discomfort, making it difficult for patients to focus on routine tasks (Rowa et al., 2017). Post-traumatic stress disorder (PTSD), triggered by traumatic events, can lead to intrusive memories, avoidance behaviors, and emotional numbing, markedly reducing social functioning and quality of life (Omopo, 2024). These disorders are not merely collections of clinical symptoms; they exert systematic negative impacts on cognition, emotion, and behavior, ultimately resulting in decreased productivity, increased medical expenditures, social isolation, and functional decline (Fish et al., 2024). Therefore, early identification and effective support for these mental illnesses are crucial to mitigating their long-term individual and societal consequences (Kirkbride et al., 2024).

To address this challenge, the development of innovative, accessible mental health services is foundational for implementing effective interventions and support. Traditional diagnostic models for mental health rely heavily on clinicians’ subjective judgments and structured or semi-structured interviews. While these models are based on clinicians’ extensive clinical experience, they face multiple structural challenges in addressing the global, growing patient population (Stein et al., 2022). These include geographical disparities in professional medical resources, subjectivity and time consumption in the diagnostic process, and barriers to timely and effective care due to stigma, financial constraints, or geographic limitations (Perkins et al., 2018). These challenges collectively drive the urgent need for innovative, scalable, and efficient solutions. In this context, digital technologies, particularly LLMs, are emerging as key drivers of transformation in mental health service systems.

Digital technologies, particularly AI, are widely recognized as transformative forces in reshaping the delivery of mental health services (Liu et al., 2024). Early digital interventions, such as telemedicine platforms and mobile health applications (mHealth), have preliminarily demonstrated their potential to improve service accessibility (Pang et al., 2018). With the rapid advancement of AI technologies, their applications in assisting diagnosis, treatment support, and risk assessment have garnered increasing attention. Early machine learning (ML) models (Luo et al., 2023), such as those based on support vector machines (SVMs) (Sharma et al., 2024) or naive Bayes classifiers (Ibrahim et al., 2024), have shown promise in specific tasks. For instance, natural language processing (NLP) techniques have been used to analyze text sentiment (Li et al., 2025), and speech pattern recognition has aided in diagnostic assessments. However, these models are typically designed for specific tasks, lacking generalizability and the ability to understand complex contexts. This limits their effectiveness in handling the highly unstructured, metaphor-rich, and culturally diverse language data prevalent in mental health.

Recent breakthroughs in AI, particularly the rise of LLMs based on advanced architectures like Transformer, have opened new frontiers for mental health. LLMs (Orrù et al., 2025), trained on massive textual and multimodal data, have acquired robust Natural Language Understanding (NLU), Natural Language Generation (NLG), and contextual reasoning capabilities (Montejo-Raez et al., 2024). Unlike earlier shallow ML models, LLMs exhibit “emergent abilities”—performance gains that exceed what would be expected from mere scaling (Han et al., 2024). This enables them to capture subtle and nuanced semantic associations in human language, making them well-suited to address the complexity and subtlety inherent in mental health (Kumar et al., 2025a). The potential applications of LLMs in mental health diagnostics are multifaceted and are transitioning from proof-of-concept to real-world implementation.

First, LLMs can analyze unstructured text data from social media platforms, online forums, anonymous chat rooms, and personal diaries to identify language biomarkers associated with specific mental disorders (Owen et al., 2024). For example, LLM models can detect subtle linguistic cues indicative of hopelessness, social isolation, or suicidal ideation, providing a non-invasive and efficient method for large-scale mental health risk screening (Gao et al., 2025). Second, LLMs can serve as intelligent assistants for clinicians by processing and integrating complex clinical text data (Lin and Kuo, 2025). For instance, they can analyze electronic health records (EHRs), handwritten clinical notes, and initial consultation transcripts to extract key symptoms, medical history, and behavioral patterns (AlSaad et al., 2024). This information can be synthesized into structured, easily interpretable reports, supporting clinical decision-making and reducing the administrative burden on healthcare providers (Vrdoljak et al., 2025). Third, conversational AI powered by LLMs, such as AI chatbots, can offer patients 24/7 accessible emotional support and behavioral tracking (Chow and Li, 2025). These chatbots engage in natural, empathetic conversations, helping patients express emotions, track mood fluctuations, and implement psychology-based coping strategies (Shi, 2025). The interaction data, anonymized to protect privacy, can be used to monitor patients’ condition changes over time, providing clinicians with long-term, dynamic health data for more timely interventions (Li et al., 2024).

Despite their promise, existing reviews on LLMs in mental health present certain limitations, particularly in systematically exploring their application boundaries and potential risks. Most reviews, such as those by Gautam and Kellmeyer (2025) and Jin et al. (2025), primarily focus on technical performance validation or specific application scenarios, lacking a comprehensive examination of LLMs within broader clinical contexts. These reviews often highlight the performance advantages of LLMs in specific tasks but fail to delve into the ethical dilemmas, technical limitations, and safety risks they may encounter in real-world clinical applications. For example, there is a scarcity of systematic research on critical ethical and technical issues, such as algorithmic bias and data privacy, which are essential for ensuring the safe, fair, and effective deployment of LLMs in mental health. Addressing these gaps is vital to advancing the responsible use of LLMs in this field.

To fill these research gaps, this study aims to conduct a comprehensive, systematic scoping review of the application boundaries of LLMs in mental health. Adhering to the PRISMA-ScR guidelines (Mattos et al., 2023), the study will systematically search multiple key academic databases, rigorously screen relevant literature, and extract and synthesize data. The focus will be on the technical, ethical, and practical boundaries of LLMs in mental health clinical practice, providing scientific evidence and recommendations for technology developers, clinicians, and policymakers. Based on this, the study proposes five research questions:

  • (1) What are the geographic and temporal trends in the application of LLMs in mental health research?

  • (2) What are the primary application scenarios and technical types of LLMs in mental health?

  • (3) Which mental disorders are currently the main focus of LLM research, and what are the data sources and evaluation metrics used?

  • (4) What are the technical, ethical, and practical limitations and risks of LLMs in mental health?

  • (5) Based on the analysis of existing research, what are the future research directions and development trends?

2 Methods

2.1 Search strategy

This study screened relevant papers from eleven databases (Web of Science, Scopus, PubMed, Medline, CINAHL, Cochrane, ACM Digital Library, IEEE Xplore, ScienceDirect, APA PsycInfo, and Google Scholar). The search date was June 27, 2025, and the search terms included “large language model*,” “LLM*,” “generative AI,” “GenAI,” “AIGC,” “AI chatbot*,” “conversational AI,” “natural language processing,” “NLP,” “ChatGPT,” “GPT,” “Bard” OR “Gemini,” “mental health,” “mental illness,” “mental disorder*,” “psychiatric disorder*,” “depress*,” “anxiety,” “schizophrenia,” “bipolar disorder*,” “PTSD,” “suicid*,” and “self-harm,” etc. (see Table 1). The PRISMA-ScR checklist is provided in Appendix AS1.

Table 1

DatabaseSearch formula
Web of Science(“large language model*” OR “LLM*” OR “generative AI” OR “GenAI” OR “AIGC” OR “AI chatbot*” OR “conversational AI” OR “natural language processing” OR “NLP” OR “ChatGPT” OR “GPT” OR “Bard” OR “Gemini”) (All fields) AND (“mental health” OR “mental illness” OR “mental disorder*” OR “psychiatric disorder*” OR “depress*” OR “anxiety” OR “schizophrenia” OR “bipolar disorder*” OR “PTSD” OR “suicid*” OR “self-harm”) (All fields) AND (Document Types: Article or Proceeding Paper) AND (Languages: English)
Scopus(TITLE-ABS-KEY (“large language model*” OR “LLM*” OR “generative AI” OR “GenAI” OR “AIGC” OR “AI chatbot*” OR “conversational AI” OR “natural language processing” OR “NLP” OR “ChatGPT” OR “GPT” OR “Bard” OR “Gemini”)) AND (TITLE-ABS-KEY (“mental health” OR “mental illness” OR “mental disorder*” OR “psychiatric disorder*” OR “depress*” OR “anxiety” OR “schizophrenia” OR “bipolar disorder*” OR “PTSD” OR “suicid*” OR “self-harm”)) AND (LIMIT-TO (DOCTYPE, “ar”) OR LIMIT-TO (DOCTYPE, “cp”)) AND (LIMIT-TO (LANGUAGE, “English”))
PubMed((((large language model*) OR (LLM*) OR (generative AI) OR (GenAI) OR (AIGC) OR (AI chatbot*) OR (conversational AI) OR (natural language processing) OR (NLP) OR (ChatGPT) OR (GPT) OR (Bard) OR (Gemini)) AND ((mental health) OR (mental illness) OR (mental disorder*) OR (psychiatric disorder*) OR (depress*) OR (anxiety) OR (schizophrenia) OR (bipolar disorder*) OR (PTSD) OR (suicid*) OR (self-harm)))) Filters: Full text
Medlinelarge language model* OR LLM* OR generative AI OR GenAI OR AIGC OR AI chatbot* OR conversational AI OR natural language processing OR NLP OR ChatGPT OR GPT OR Bard OR Gemini AND mental health OR mental illness OR mental disorder* OR psychiatric disorder* OR depress* OR anxiety OR schizophrenia OR bipolar disorder* OR PTSD OR suicid* OR self-harm
CINAHL((MH “large language model*”) OR TI (“large language model*” OR “LLM*” OR “generative AI” OR “GenAI” OR “AIGC” OR “AI chatbot*” OR “conversational AI” OR “natural language processing” OR “NLP” OR “ChatGPT” OR “GPT” OR “Bard” OR “Gemini”)) AND ((MH “mental health”) OR TI (“mental health” OR “mental illness” OR “mental disorder*” OR “psychiatric disorder*” OR “depress*” OR “anxiety” OR “schizophrenia” OR “bipolar disorder*” OR “PTSD” OR “suicid*” OR “self-harm”))
Cochranelarge language model* AND mental health
ACM Digital Librarylarge language model* AND mental health
IEEE Xplore(“Full Text & Metadata”: large language model* OR LLM* OR generative AI OR GenAI OR AIGC OR AI chatbot* OR conversational AI OR natural language processing OR NLP OR ChatGPT OR GPT OR Bard OR Gemini) AND (“Full Text & Metadata”: mental health OR mental illness OR mental disorder* OR psychiatric disorder* OR depress* OR anxiety OR schizophrenia OR bipolar disorder* OR PTSD OR suicid* OR self-harm) Filters Applied: Conferences Early Access Articles Journals
ScienceDirectlarge language model* OR LLM* OR generative AI OR GenAI OR AIGC OR AI chatbot* OR conversational AI OR natural language processing OR NLP OR ChatGPT OR GPT OR Bard OR Gemini AND mental health OR mental illness OR mental disorder* OR psychiatric disorder* OR depress* OR anxiety OR schizophrenia OR bipolar disorder* OR PTSD OR suicid* OR self-harm “Article type: Research articles”
APA PsycInfolarge language model* AND mental health
Google Scholarlarge language model* OR LLM* OR generative AI OR GenAI OR AIGC OR AI chatbot* OR conversational AI OR natural language processing OR NLP OR ChatGPT OR GPT OR Bard OR Gemini AND mental health AND mental illness AND mental disorder* AND psychiatric disorder* AND depress* AND anxiety AND schizophrenia AND bipolar disorder* AND PTSD AND suicid* AND self-harm

Selected databases and search formats.

2.2 Data selection and extraction

Records were first imported into reference management software EndNote, where automated screening was performed to remove duplicates and records marked as ineligible based on pre-set criteria, before manual screening. Two independent reviewers (JY and TL) conducted a preliminary screening of the article titles and abstracts based on predetermined inclusion criteria. Any discrepancies between the two reviewers were resolved through consultation with a third reviewer (YL). The inclusion criteria are as follows: (1) Studies specifically targeting LLMs in mental health; (2) Research on LLM technologies for mental health services; (3) Research articles and conference papers; (4) Full-text articles and conference papers published in English. The inclusion and exclusion criteria were designed with a specific focus on the application of LLMs in mental health. This study prioritizes empirical investigations of how LLMs are implemented in mental health services, thereby excluding research that primarily explores individuals’ perceptions, algorithm comparison, attitudes, or opinions regarding LLMs in these domains. Additionally, review articles (e.g., narrative or systematic reviews) were excluded, as they synthesize existing literature rather than present original applications. To ensure comprehensive coverage of diverse research methodologies, the inclusion criteria encompassed qualitative, quantitative, and mixed-methods studies, thereby capturing a holistic range of evidence on the implementation of LLMs in practice. These criteria are summarized in Table 2.

Table 2

Inclusion criteriaExclusion criteria
Research on LLM technologies for mental health services.Research on technologies other than LLMs in the mental health field.
Research on the application of LLM technology in mental health services.Research on algorithm comparison, attitudes, views, intentions, benefits, obstacles, impacts, experiences, and usage demands towards LLM technology.
Research-type articles and conference papers.Review articles, theses, non-academic publications, book chapters, etc.
Full text in English.Full text in other languages.

Inclusion and exclusion criteria.

2.3 Data charting

Based on the review scope methodology guidelines provided by the PRISMA-ScR guidelines (McGowan et al., 2020), a data extraction table was developed. Following a purposive pilot test on five articles selected to represent the diversity of the included studies (e.g., different study designs and LLM applications), the table was refined to ensure comprehensive and relevant data capture. The final data extraction table included the following items: author, year, country, research methodology, type of technology/model, type of issues, mental health application, data source, application performance metrics, limitations, and future research directions. All data were extracted by two independent reviewers. Any disagreements that arose during the data extraction process were resolved through consultation with a third reviewer, ensuring the accuracy and consistency of the extracted information.

2.4 Collating, summarizing, and reporting the results

The extracted data were synthesized and analyzed using a narrative synthesis approach to address the research questions of the review. Descriptive findings, such as the distribution of articles by year, country, or research method, were presented through graphs and charts to provide a visual overview of the research landscape. The qualitative findings, particularly those related to the application boundaries, limitations, and future trends, were thematically analyzed and explained through a detailed narrative to provide a comprehensive and nuanced discussion. All explanations and interpretations were verified by all authors to ensure the rigor and validity of the final report.

3 Results

A total of 10,743 articles were retrieved through a systematic search, the search process and results as shown in Figure 1. To ensure consistency between the methods, results, and the PRISMA flowchart, we explicitly define the key terms used in the selection process. “Records” refer to all initial items retrieved from the databases. “Reports” or “Articles” refer to unique publications assessed for eligibility. “Studies” refer to the final set of reports that met all inclusion criteria. Two reviewers independently screened the article titles and abstracts, excluding 3,221 articles that were not directly related to the research topic, as well as 20 non-English articles. The term “reports not retrieved,” specifically refers to articles that were identified as potentially relevant during the title and abstract screening but for which the full-text document could not be accessed by the reviewers. Then, the two reviewers conducted a thorough assessment of the remaining 58 articles. These studies were not the focus of this research, so 29 articles were excluded. The reason for exclusion was that the topic did not match the main focus of this study. Ultimately, 29 articles were determined to be included in this systematic review scope (see Table 3).

Figure 1

Table 3

Author/Year/CountryMethodsTechnology/modelMental issueMental health applicationData sourceApplication performance metricsLimitationsFuture trends
Abdullah et al.
2024
Canada
(Abdullah and Negied, 2024)
QuantitativeGPT, BERTDepression, Anxiety, PTSD, ADHDDiagnosis, Prediction167,444 clinical social media postsF1-scoreCross-validation should be considered to validate the obtained results, validate the hyperparameters, and fine-tune them based on validation resultsFuture mental disorders from social media data using ML, Ensemble Learning, and LLMs
Al-Otaibi et al.
2025
Saudi Arabia
(Al-Otaibi et al., 2025)
MixedTransformer, AI ChatbotNight terror, Depression, Social phobia, panic attacks, Anhedonia, Borderline personality disorderSupport7,245 tokens from six conversationsErrorAI risks potentially harm vulnerable users; Lack of cross-system comparisonExplore and compare the performance of alternative systems to provide a broader assessment of AI capabilities in this domain
Bartal et al.
2024
Israel
(Bartal et al., 2024)
MixedSentence-transformers PLMs, ChatGPTChildbirth-related post-traumatic stress disorder (CB-PTSD)IdentifyNarratives of length 30 words from 1,295 womenF1 score, sensitivity, specificity, AUClimitations in analyzing shorter narrativesThis textual personal narrative-based assessment strategy employing NLP analysis has the potential to become an accurate, efficient, low-cost, and patient-friendly strategy
Bauer et al.
2024
US
(Bauer et al., 2024)
MixedLLM embeddings, ChatGPT, eXplainable Artificial Intelligence (XAI)SuicidalityUnderstand2.9 million posts from social mediaMean (SD), Proportion of closest, singular value decompositionData time range restrictions and source platform diversity restrictionsExpand the time frame of data collection; explore other web-based platforms; and integrate additional data sources, such as user comments
Belcastro et al.
2025
Italy
(Belcastro et al., 2025)
QuantitativeChatGPT, BERT-XDD model, XAIDepressionDetect10,251 tweets + 7,650 posts from social media textAccuracy, Precision, Recall, F1 scoreLabeled data is scarce; Lacks temporal dynamics; Explanatory models may be oversimplified; Cannot replace clinical judgmentFocus on dynamic and longitudinal modeling to capture context and temporal patterns
Cai et al.
2025
China
(Cai et al., 2025)
QuantitativeBERT, ChatGPTGeneral mental healthInformation extractionChatGPT to generate instances of the taskF1 scoreComputational efficiency and scalability constraints; Cross-domain adaptability unverifiedExplore a dynamic verification framework; Optimize computational efficiency and scalability, conduct cross-domain validation
Cardamone et al.
2025
US
(Cardamone et al., 2025)
MixedChatGPTGeneral mental healthPrediction1,000 records from the clinic termsRecall, F1-scoreLLM hallucination and bias risks; Single-label classification limits complexityExpand coding and classification methods (e.g., multi-label classification), and increase clinical coders
Cremaschi et al.
2025
Italy
(Cremaschi et al., 2025)
MixedChatGPT, Retrieval Augmented Generation (RAG) modelGeneral mental healthDecision supportCreation by the ICD-11 classification systemAccuracy, Precision, Recall, F1-scoreLack of patient history integration (e.g., symptom evolution, age, medication history)Integrate cloud computing; Incorporate temporal evolution, age, and pharmacological information; Focus on comorbidity management and atypical clinical presentations
Dos Santos et al.
2025
Brazil
(Dos Santos et al., 2025)
MixedBERT, BiLSTMDepression, AnxietyPrediction19.4 million messages from depression data (Twitter/X textual and non-textual data in Portuguese and Reddit textual data in English)Precision, Recall, F1-scoreDiscrepancies between model and human judgment (due to clinical caution); Temporal information intentionally omittedExplore the ethical implications of the model, maintain responsible AI development
Fan et al.
2024
China
(Fan et al., 2024)
QuantitativeChatGPT, AI ChatbotDepression, Anxiety, Bipolar Disorder, PTSD, Panic Disorder, Eating DisorderPrediction, Intervention, Suggestion1,667 entries labeled with psychological disorders from the efaqa dataset, and 1,300 dialogues from the ESConV datasetAccuracy, Precision, RecallEthical and privacy issues regarding the use of dataFuture work will explore the model’s ethical implications, maintaining a focus on responsible AI development
Fennig et al.
2024
Israel
(Fennig et al., 2025)
MixedChatGPTEpilepsy, Depression, AnxietyAssessment768,504 posts from Reddit’s 21,906 usersHazard ratios (HRs)Reddit user base lacks general representativeness; Healthcare environment constraints; Text model’s content understanding variationsAddress social stigmatization caused by mental health issues
Hadar-Shoval et al.
2024
Israel
(Hadar-Shoval et al., 2024)
QuantitativeChatGPTGeneral mental healthAssessment53,472 individuals across 49 nationsInternal reliability and intercorrelations of Schwartz’s values, Split-half reliability agreement, and confirmatory factor analysisSmall LLM sample size; Difficult to isolate model capabilities from built-in guardrails; Value construct robustness unassessedEvaluate impact of subtle prompt changes on model values; Validate predictive validity
James et al.
2024
The Netherlands
(James et al., 2023)
MixedChatGPTSevere Mental Illnesses (SMI)Measurable treatment plan goalsDuring the five rounds of investigation, these 8 students were given two final goals by LLMAverage scores for both goals in each phaseUI/UX not assessed; No testing with SMI patients or case managersRecruit SMI patients and case managers to evaluate workflow; Improve UI; Investigate SMI patient attitudes toward privacy and technology
Karamat et al.
2024
Pakistan
(Karamat et al., 2024)
QuantitativeHybrid transformer - MentalBERT, and MelBERT models (CNN)Depression, Anxiety, Borderline personality disorder (BPD), PTSDPrediction40,000 samples from Reddit postsAccuracy, Precision, Recall, F1-score, ROC, AUC, LossEvaluated on a small dataset, failing to capture the full spectrum of disorders/language patternsScale up dataset size and diversity for validation
Kharitonova et al.
2024
Spain
(Kharitonova et al., 2025)
MixedChatGPTDepression, ADHDContent extractionFormulate 10 questions for each scenario and their corresponding correct answersCoherence, Veracity, EvidenceLimited content window; Restricted answer search space; Inconsistent multilingual support qualityExperiment with alternative LLMs; Analyze multilingual support; Evaluate text simplifiers; Integrate LLM engine ensembles; Hierarchical information organization; Extend to multimodal systems
Kim
et al.
2024
Korea
(Kim et al., 2024)
MixedChatGPTDepressionSupportA four-week field study involving 28 patients with major depressive disorder and five psychiatristsCoding paired with thematic analysisRecruitment method impacts generalizability (e.g., young patients, fixed psychiatrists); GPT model updates cause behavioral inconsistencyCompare multiple APP versions and underlying LLMs; Conduct necessary studies with diverse backgrounds
Kumar
et al.
2025
UK
(Kumar et al., 2025b)
QuantitativeBioGPT, DeBERTaGeneral mental healthClassification dataDepSeverity consists of 3,553 Reddit posts, SDCNL contains 1895 posts, and Dreaddit comprises 1,191 postsAccuracy, Precision, Recall, F1-score, AUC, ROCReliance on textual data restricts applicability; Reddit data limits cross-cultural/platform generalizabilityEvaluate other platforms; Integrate multimodal inputs (e.g., audio, physiological signals); Include real-time human feedback; Explore personalized language modeling
Lozoya
et al.
2025
Australia
(Lozoya et al., 2025)
MixedChatGPTDepression, AnxietySimulated psychotherapy client interactions19-question surveyT testSmall sample size of synthetic therapy sessionsIncrease professional involvement or session count; Evaluate effectiveness as an educational tool; Explore cross-language and cross-dictionary adaptation
Malhotra
et al.
2024
India
(Malhotra and Jindal, 2024)
MixedBERT, XAIDepression, SuicidalClassification, Interpretation262,922 tweets of data from four datasetsClassification performance evaluation, LIME explanationSelf-reported data lacks manual sanity checks (e.g., for sarcasm, metaphor), leading to false positivesUse data sanity check protocols (manual/automated); Explore XAI applications; Explore multiclass/multilabel/multilingual data analysis and bias detection
Nowacki
et al.
2025
Poland
(Nowacki et al., 2025)
QuantitativeMentalBERT, MentaLLaMAGeneral mental healthClassification3,553 posts from Reddit, 6,850 SMS-like messages, 5,051 records from CAMS, 3029 data from IRFF1-scoreFlan-T5 architecture required additional stabilization layersDifferent LLMs are similar to each other. This means that all of these models, despite their different architectures or training methods, achieve a similar level of performance. This indicates their flexibility, comparable quality and ability to process complex data
Park
et al.
2024
Korea
(Park et al., 2024)
QuantitativeChatGPT, Zero-shot learning, Medical knowledge graphsDepressionInformation extraction3,793 PubMed abstracts from BioCreative Datasets, 21,082 documents from Document-Level Relationship Extraction DatasetsF1 scoreThe limited datasetAcross diverse medical datasets and expanding the types of entities involved
Pavez
et al.
2024
Chile
(Pavez and Allende, 2024)
QuantitativeBERT, Explainability of Bayesian NetworksGeneral mental healthDiagnosis, Classification2.3 million data points from social mediaPrecision, Recall, F1-Score, SupportMethod fails to fully capture complex nuances and inherent uncertainties of real-world patient interactionsDeepen collaboration with field professionals for comprehensive assessment; Integrate critical clinical factors (e.g., patient history, hereditary traits)
Radwan
et al.
2024
US
(Radwan et al., 2024)
QuantitativenBERTGeneral mental healthClassification, Recognition2021 samplesAccuracy, Precision, Recall, F1-scorenBERT generalizability remains a challenge in multilingual and diverse datasetsExplore cross-lingual fine-tuning; Employ transfer learning/cross-domain adaptation; Integrate multimodal data
Shayaninasab
et al.
2024
Canada
(Shayaninasab et al., 2024)
QuantitativeChatGPT, AI ChatbotDepressionAssessment, Support20 conversational examples (4 examples x 5 depression levels) that were completed with an average of 28 conversation turns, and a total of 562 conversation turnsDepression levelModel cannot classify single input into multiple topics; RAG implementation relies on carefully selected resourcesAdopt more sophisticated strategies to address class imbalance
Wagay
et al.
2024
India
(Wagay and Altaf, 2025)
QuantitativeMentalRoBERTa(6), Capsule Layer, LIME (Local Interpretable Model-agnostic Explanations)General mental healthClassification3,553 posts from RedditRecall, F1-ScorePlatform-specific language remains a challenge; F1/Recall is limited for class imbalance; LIME only offers local, approximate explanationsAdopt more sophisticated strategies to address issues
Wang
et al.
2024
China
(Wang et al., 2025)
QuantitativeScaleLLMDepressionAssessmentResponses from 70,692 participantsAccuracy, Precision, F1-scoreLimited capability to process structured data; Rapid evolution of research may outpace LLM updates; Cross-cultural/language adaptability is a major challenge; Lack of real-world clinical validationAdditional language alignment steps; Cross-cultural adaptability; Rigorous testing and validation in practical clinical scenarios
Wu
et al.
2024
China
(Wu et al., 2024)
MixedAI ChatbotUnderstanding, Comforting, Evoking, and scaffolding habitsPersuasion5-week field experiment (N = 25)Kruskal-Wallis test, Intervention Acceptance RateExperimental group limited to young adults; Short field experiment time; Validity/reliability needs improvement; Detection relies on self-reporting; Only considers initial use; GPT-3.5 performance is unstableExpand sample size and diversity; Improve experimental design; Enhance validity and reliability tests; Explore lighter and more robust LLMs
Zhang
et al.
2024
Australia
(Zhang T. et al., 2024)
QuantitativeChatGPT, Zero-shot learningGeneral mental healthPredictionInvestigation of 150 university studentsZero-shot Mean Absolute ErrorsSubjectivity of self-reported datasets; Imbalanced class distributions lead to model biasConduct fine-tuning tasks for daily activity-driven models; Increase dataset size or use resampling techniques
Zhang
et al.
2024
China
(Zhang X. et al., 2024)
QuantitativeAI ChatbotDepressionDiagnosis, Detection1,339 conversations from a depression diagnosis datasetPrecision, Recall, F1-scoreInappropriate for real clinical application; Chinese conversational agent; Lacks reliable strategy for optimal training stopping pointInvite wider community participation to enhance the model; Work with different languages

Overview of study characteristics.

3.1 Characteristics of studies

The surge in research on LLMs in mental health is driven by both technological singularity and global public health necessity. All 29 papers analyzed were published or accepted within the extremely narrow timeframe of 2024 to 2025. This trend is a direct result of the revolutionary leaps made by general-purpose LLMs, like GPT-4, since 2023, particularly in complex reasoning, emotional understanding, and generating high-quality, human-like text. This technological capability intersected with the deepening global mental health resource crisis, where hundreds of millions lack effective psychological support. Consequently, the research focus has fundamentally shifted: LLMs are now viewed as key strategic digital assets and are being developed as Digital Mental Health Agents capable of offering clinical decision support, multimodal data integration, and personalized therapeutic interventions, leveraging their characteristics of low marginal cost and high scalability to address the resource gap.

The geographic distribution of the research (see Figure 2) (China, the US, and Israel leading the list) clearly illustrates the stratification based on regional economic strength, technological maturity, and public health strategy. China, with 5 papers, leads the world in output, a pattern reflecting its national strategy for basic AI technology localization. Chinese research heavily focuses on developing specialized Chinese foundation models and building knowledge-guided therapeutic applications, aiming to solve the massive mental health resource deficit within its vast population and unique cultural context, emphasizing model professionalism, interpretability, and safety. The US (3 papers) and Israel (3 papers) form the next tier, but with distinct foci: The US leverages its leading data infrastructure and advanced clinical IT systems (EHRs, large-scale social media data) to pursue automated risk prediction and deep integration into clinical workflows. Conversely, Israel, a high-tech innovation hub, focuses on the ethical and psychological depth of AI, concentrating on LLM’s capacity for mentalization, emotional intelligence, and rigorous evaluation of its alignment with human values before widespread deployment.

Figure 2

The middle tier, with 2 papers each from Australia, Canada, Italy, Korea, and India, represents specialized technological penetration. For example, Italy’s work is tailored to its mature but strictly regulated public healthcare system, developing RAG models based on ICD-11 to function as expert clinical decision assistants. Finally, the 8 papers contributed by single nations (including Brazil, Saudi Arabia, the UK, and Spain) often exhibit high pragmatism and cultural adaptation. These regions, frequently facing economic and clinical resource constraints, focus on high-yield, low-cost solutions addressing localized pain points. For instance, Saudi Arabia researches communication errors in Arabic support systems to ensure cross-cultural applicability, while Brazil explores multimodal expert systems integrating text and non-text social data, highlighting a global trend toward diversified, culturally sensitive, and cost-effective LLM deployment against the universal mental health crisis.

3.2 Research method of studies

The overall methodological distribution in this research reveals a structural balance between technical feasibility and clinical prudence, signifying that the study of LLMs in mental health has advanced to a phase of deep interdisciplinary validation. Out of the total 29 papers analyzed (see Figure 3), 16 papers (approximately 55.2%) utilized a purely quantitative research methodology. This segment is primarily driven by computer science and engineering, focusing on quantifiable technical metrics such as model performance, diagnostic accuracy, data prediction capabilities, and system efficiency, thus establishing the technical feasibility of LLMs as digital healthcare tools. Closely following this, 13 papers (approximately 44.8%) employed mixed research methods, combining quantitative and qualitative approaches. The near-equal split highlights a critical consensus among researchers: in the complex domain of human mental health, technical metrics alone are insufficient. The prevalence of mixed methods demonstrates the essential need to integrate rigorous performance indicators with subjective data on user experience, ethical considerations, cultural adaptability, and clinical acceptance, marking a mature shift in the field from merely asking “what can it do” to “how can it be deployed safely and responsibly.”

Figure 3

3.3 LLM technology application landscape in mental health

The comprehensive analysis, drawing from the nine distinct technology combination data points, distinctly charts a sophisticated research landscape in mental health LLM applications characterized by general models as the reference point, domain specialization as the driving force, and clinical trustworthiness as the core architectural principle. ChatGPT, while serving as the most frequently mentioned single model (6 mentions, 16.2%), represents the baseline for general capabilities, and its prominent pairing with Zero-shot Learning (5 mentions, 13.5%) highlights the industry’s successful effort to leverage general LLMs for efficient, low-resource task deployment. Crucially, the dominant research focus has decisively shifted toward vertical domain specialization, evidenced by the overwhelming proportion of dedicated BERT family specialized models, including MentalBERT, MentaLLaMA, BioGPT, DeBERTa, MentalRoBERTa, and nBERT (27 total mentions in combinations), which confirms the community’s critical recognition that achieving the necessary higher clinical accuracy and professional controllability in complex mental health diagnostics requires models trained and fine-tuned on specialized psychological and medical corpora, thereby transcending the limitations of generic language understanding. To ensure safety and confidence in sensitive clinical use cases, Trustworthiness Mechanisms are highly integrated into the architecture: the combination of ChatGPT with the RAG model (3 mentions, 8.1%) is a key strategy employed to ensure factual accuracy by anchoring generated responses to verified knowledge, effectively mitigating the common issue of model “hallucination”; simultaneously, the explicit incorporation of XAI, highlighted by the high-frequency appearance of Explainability of Bayesian Networks (4 mentions, 10.8%) and the complex MentalRoBERTa architecture utilizing LIME (4 mentions, 10.8%), collectively establishes the provision of transparent, traceable decision rationale as an indispensable technical requirement, signaling that the field has unequivocally transitioned into a systematic deployment phase centered on safety, professionalism, and verifiable trust.

3.4 Mental health issues

The largest category, “General mental health” (34.5%), establishes the LLM’s role as a large-scale, low-threshold psychological support system, primarily concentrating on non-diagnostic, generalized frameworks for psychological companionship and behavioral intervention, such as providing understanding, comforting, evoking, and scaffolding habits. This reflects the model’s immense value as a universal solution to the scarcity of mental health resources. However, the research focus quickly shifts to specific disorders with high clinical demand and social impact: Depression has emerged as the absolute research core due to its high prevalence, whether as a single illness (20.7%) or as the primary co-morbidity foundation for common conditions like Anxiety, accounting for nearly half of all disorder combinations mentioned. This intense focus on depression signifies the immense potential of LLMs in the early screening, severity assessment, and high-risk prediction of mood disorders. Furthermore, it is noteworthy that research is actively expanding into more complex, trauma-related disorders (e.g., PTSD) and difficult-to-manage co-morbidities like Borderline Personality Disorder (BPD), alongside specialized studies on extremely high-risk issues related to life safety, such as Suicidality (including standalone and combined counts). This necessitates LLMs possessing heightened professional ethics and granular reasoning capabilities. Overall, the application of LLMs in mental health is transitioning from basic text analysis to a profound, specialized tool designed for the precise identification of high-risk individuals, assisting complex clinical diagnosis, and providing professional risk warnings.

3.5 Data sources for the LLM mental health

The data sources across these 29 articles reveal a dual driving force in LLM mental health research: the need to quantify vast unstructured data and simultaneously deepen clinical expertise, reflecting researchers’ focus on both high-throughput screening and professional-grade validation. Social media platforms are the unequivocally dominant source, providing a massive corpus of raw, real-world, unstructured linguistic data, with millions of posts and messages from Reddit and Twitter/X, including one dataset contributing up to 19.4 million messages. This heavy reliance on online text is the primary characteristic, aiming to leverage LLMs for high-throughput, real-time detection and risk prediction of mood disorders, particularly depression. Second, the sources underscore a systematic pursuit of professional knowledge and structured data, encompassing knowledge bases created from the ICD-11 classification system, biomedical documents like PubMed abstracts, and clinical records from clinic terms, which are critical cornerstones for building clinical decision support systems and knowledge-augmented LLMs. Furthermore, model validation and optimization are achieved through diverse customized data collection, including large-scale multi-national participant surveys, specific scenario short narratives, and both human-generated task instances and small-scale clinical field studies, signaling a shift from pure text mining towards comprehensive validation of domain-specific customization, ethical alignment, and practical clinical efficacy.

3.6 Application performance metrics

The core assessment relies heavily on standard quantitative metrics, with combinations of Accuracy, Precision, Recall, and the F1-score dominating the field, reflecting a primary goal of effective high-throughput screening and detection of mental disorders. Metrics like AUC and ROC further confirm the emphasis on the LLMs’ discriminatory power in risk prediction and binary classification tasks. Crucially, the analysis extends beyond purely technical classification into professional domains: the inclusion of psychometric measures such as Internal Reliability, Split-half Reliability, and Confirmatory Factor Analysis is utilized to ensure the reliability and validity of the psychological constructs being modeled. Furthermore, the appearance of clinical and outcome metrics like Hazard Ratios and T-tests, coupled with the emerging focus on XAI methods (e.g., LIME) and qualitative criteria (e.g., Coherence, Veracity, and Evidence), signals a strategic shift. This transition highlights a commitment to moving LLM applications from simple black-box classifiers to trustworthy, ethically aligned, and clinically interpretable decision-support tools.

4 Discussion

4.1 Main findings and results of studies

This systematic analysis of research on LLM applications in mental health reveals a field undergoing rapid acceleration within an extremely recent timeframe, driven by unprecedented technological leaps and the urgent global necessity of addressing the immense mental health crisis. This confluence has established LLMs as essential digital mental health agents. Geographically, the research output is concentrated in a handful of technologically advanced and economically strong nations. This includes a major Asian country focusing on specialized national foundation models, localization, and ethical safety, alongside the United States, which leverages its robust data infrastructure and clinical IT systems for automated risk prediction. Another key innovation hub is concentrating on the ethical and psychological depth of AI, specifically evaluating LLMs’ capacity for mentalization and alignment with human values. Methodologically, the field is characterized by an interdisciplinary phase, balancing extensive quantitative research with near-equal attention to mixed methods that integrate technical performance with qualitative data on user experience, ethics, and cultural adaptability. Technologically, while general models serve as a capability baseline, the overwhelming focus has shifted to vertical domain specialization through dedicated, fine-tuned models, a move deemed critical for achieving the necessary clinical accuracy. This specialized architecture is heavily reinforced by trustworthiness mechanisms, such as RAG, to anchor responses to verified knowledge, and the strong integration of XAI methods, ensuring transparent and traceable clinical decision rationales. In terms of application, the largest focus is on providing broad, low-threshold General mental health support, but research is intensely concentrated on common conditions like depression and is actively expanding to complex, high-stakes disorders, including PTSD, BPD, and Suicidality. Finally, the data fueling these advances follows a dual track: it relies heavily on massive, unstructured social media data for real-time high-throughput screening, while also systematically incorporating structured professional knowledge bases and clinical records to build expert-grade decision support systems.

4.2 Technical, ethical, and practical limitations and risks of LLMs in mental health

4.2.1 Technical and clinical limitations

The core technical risk of LLMs lies in the deficiency of their clinical accuracy and robustness, particularly when handling high-risk scenarios (Gomes, 2024). For instance, while LLMs have been applied to analyze online discussions to identify high-risk behaviors like suicidal ideation and emotional distress, their precision and reliability in predictive and interventional tasks have not yet met strict clinical standards (Lashgari et al., 2025). A major flaw is the tendency for models to generate “hallucinations,” producing seemingly plausible but false or incorrect clinical information (Kim et al., 2025), which poses a fatal threat to diagnostic decision-support systems like LLMind Chat (Cremaschi et al., 2025). Furthermore, the predictive capacity of LLMs is heavily dependent on the quality of their training data (Yang et al., 2024). Many existing models lack sufficient critical prior knowledge and evidence-based medicine (EBM) data, preventing them from offering deep reasoning and assessments supported by clinical evidence for complex psychological issues (Youngstrom et al., 2017).

4.2.2 Multimodal data challenges

In the realm of multimodal data analysis, such as integrating EEG and physiological signals with text, LLMs show potential, yet they face major technical hurdles (AlSaad et al., 2024). These include data heterogeneity, a lack of interoperability between different sensor systems, and the challenge of establishing a clear clinical correlation between raw sensor data and a person’s mental health status. Effective analysis and reasoning on this disparate data are currently limited (Mezghani et al., 2015).

4.2.3 Practical and cultural barriers

On a practical level, cultural sensitivity represents a significant obstacle. Studies show that even advanced models (like GPT-4o) struggle markedly to identify culturally embedded high-risk narratives (Kazemi et al., 2024). For example, a model’s failure to detect risk signals for filicide-suicide, and its limitations in processing subtle psychological cues specific to certain cultures (Chen et al., 2025). In non-Western linguistic contexts, such as Arabic mental health support inquiries, models like ChatGPT have exhibited clear communication errors, indicating not just a linguistic barrier but a profound lack of understanding of non-mainstream emotional expressions, customs, and help-seeking behaviors (Aleem et al., 2024). For resource-scarce regions (e.g., Africa), LLMs trained on Western principles like Cognitive Behavioral Therapy (CBT) lack cultural resonance with local values, such as limiting user trust, engagement, and effectiveness (Igwe and Durrhiem, 2025).

4.2.4 Ethical and value-alignment risks

The primary ethical risk centers on the model’s value alignment and transparency. The opaque alignment processes of LLMs can unintentionally embed and amplify societal biases, leading to advice that is prejudiced or clinically problematic, potentially harming vulnerable help-seekers (Liu et al., 2023). Concurrently, the over-humanization of conversational AI may blur the lines of the professional therapeutic relationship, leading users to develop unrealistic clinical expectations and dependency (Ngo, 2025). Finally, the handling of highly personal and sensitive mental health data raises acute privacy and security concerns, requiring robust regulatory frameworks to prevent catastrophic data breaches (Kwesi et al., 2025).

4.3 Future research directions and development trends

Based on an analysis of existing research, the future direction of LLMs in mental health will be structured around four core pillars: deep specialization, multimodal fusion, ethical framework development, and global cultural adaptability.

4.3.1 Deep technical specialization

The future trend involves moving beyond general-purpose models to develop specialized LLMs dedicated solely to psychological health. These models will be trained on high-quality, evidence-based psychological datasets that include not just single-turn QA but also multi-turn dialogues and real-world case backgrounds augmented by evidence judgment to ensure deep psychological comprehension and evaluation. Research will continue to enhance the performance of unified information extraction, especially for specific languages like Chinese, by introducing components like type verification for more accurate identification of emotions, psychological states, and underlying issues from unstructured text.

4.3.2 Multimodal integration and empathic LLMs

Multimodal integration is poised to be a breakthrough direction, focusing on how to effectively fuse text, physiological signals (e.g., wearable data, EEG), and behavioral health data. Future Physiology-Driven empathic LLMs (Dongre, 2024) will utilize sophisticated techniques like Science-Guided ML (Sharma and Liu, 2022) to automatically extract features from raw physiological data, enabling the model to achieve precise prediction and contextual awareness of the user’s emotional state, thereby providing highly personalized and empathic interventions.

4.3.3 Clinical reasoning and decision support enhancement

Research will prioritize enhancing LLMs’ clinical reasoning capabilities. This includes developing advanced Chain-of-Thought prompting methods to guide models in complex synthesis and reasoning of multi-sensor data, transforming data classification into deep clinical insights for conditions like depression and anxiety. Furthermore, the RAG architecture will be optimized to verify knowledge in real-time from authoritative diagnostic manuals, serving as a core component of powerful clinical decision support systems and ensuring the professional accuracy of diagnostic suggestions and intervention plans.

4.3.4 Ethical frameworks and global cultural adaptation

Future research will place a strong emphasis on model value alignment and cultural fairness. This involves using frameworks like Schwartz’s theory of basic values to conduct continuous, systematic evaluation and correction of LLMs’ intrinsic values, ensuring their decisions align with core human values and avoiding the embedding of harmful biases. Furthermore, research into Africa-centric LLM frameworks will aim to integrate CBT principles with indigenous values like Ubuntu through fine-tuning (Forane et al., 2024), boosting the cultural relevance of LLMs, and providing support that is globally diverse, equitable, and inclusive. Ultimately, the trend is for LLMs to evolve from simple Q&A tools into highly specialized, culturally intelligent “Health Agents” that safely and reliably alleviate the global shortage of mental health resources.

4.4 Limitations

The conclusions of this systematic scoping review, which synthesizes the current literature on the boundaries, risks, and future trends of LLMs in mental health, are inevitably subject to the following key limitations. The current evidence base is largely restricted to preliminary exploratory studies and proof-of-concept analyses, significantly lacking the clinical rigor of large-scale randomized controlled trials, thus constraining the assessment of LLM efficacy, safety, and long-term impact with high-level evidence-based medicine validation. Compounding this is the rapid, near-instantaneous evolution of LLM technology, which means the literature analyzed may be quickly outdated, posing a severe timeliness challenge in capturing the newest breakthroughs and emergent risks. Furthermore, a heavy reliance on training data from predominantly English and Chinese contexts results in models with documented deficiencies in cultural sensitivity and language generalizability when dealing with non-Western or minority language narratives, highlighting fundamental issues of cultural equity. Finally, the proprietary and opaque ‘black-box’ nature of many high-performance LLMs restricts systematic scrutiny of their embedded ethical biases and value alignment, severely limiting the reproducibility of academic findings.

5 Conclusion

This systematic scoping review aims to systematically explore the boundaries of LLMs in mental health applications, summarizing their core technological pathways, inherent limitations, and future development trends in diagnosis, intervention, and risk prediction. The findings reveal that LLMs have rapidly evolved from simple text analyzers into “Health Agents” capable of integrating multimodal data. Through techniques and the development of specialized models, LLMs are offering novel solutions to alleviate the global shortage of mental health resources. The core strengths of LLMs lie in their advanced language understanding, potential for multimodal fusion, and significant capability to provide personalized, knowledge-guided interventions. However, this review also clearly delineates major challenges facing the field. Technical limitations include the model’s susceptibility to “hallucination,” a lack of clinical evidence-based support, and insufficient robustness and accuracy in high-risk scenarios. Ethical risks are concentrated on the non-transparent value alignment of models, the potential to embed and amplify cultural biases, and the dependency and blurring of the therapeutic relationship caused by over-humanization. Furthermore, the models’ lack of cultural sensitivity in non-Western cultural contexts severely restricts their global scalability and effectiveness. Looking forward, research should focus on: deep specialization, developing psychological professional LLMs based on authoritative psychological and EBM data; advancing cultural equity, developing culturally adaptive LLM frameworks; and establishing regulatory and ethical frameworks to ensure the transparency, trustworthiness, and safe handling of high-risk behaviors by the models. Only through interdisciplinary collaboration and rigorous clinical validation can LLMs be safely and equitably integrated into the mental health service ecosystem to fulfill their immense potential in addressing global psychological distress.

Statements

Author contributions

JY: Conceptualization, Methodology, Writing – original draft. TL: Data curation, Software, Writing – original draft. YL: Software, Visualization, Writing – original draft. TN: Writing – review & editing. PP: Supervision, Writing – review & editing. AX: Validation, Writing – original draft. QY: Data curation, Writing – original draft.

Funding

The author(s) declared that financial support was received for this work and/or its publication. This research was funded by the Macao Science and Technology Development Fund (FDCT; funding ID: 0032/2025/ITP1).

Conflict of interest

The author(s) declared that this work was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Generative AI statement

The author(s) declared that Generative AI was not used in the creation of this manuscript.

Any alternative text (alt text) provided alongside figures in this article has been generated by Frontiers with the support of artificial intelligence and reasonable efforts have been made to ensure accuracy, including review by the authors wherever possible. If you identify any issues, please contact us.

Publisher’s note

All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article, or claim that may be made by its manufacturer, is not guaranteed or endorsed by the publisher.

Supplementary material

The Supplementary material for this article can be found online at: https://www.frontiersin.org/articles/10.3389/fpsyg.2025.1715306/full#supplementary-material

References

  • 1

    AbdullahM.NegiedN. (2024). Detection and prediction of future mental disorder from social media data using machine learning, ensemble learning, and large language models. IEEE Access12, 120553120569. doi: 10.1109/ACCESS.2024.3406469

  • 2

    Al-OtaibiG. M.AlotaibiH. M.AlsalmiS. S. (2025). Communication errors in human–Chatbot interactions: a case study of ChatGPT Arabic mental health support inquiries. Behav. Sci.15:1119. doi: 10.3390/bs15081119,

  • 3

    AleemM.ZahoorI.NaseemM. (2024). Towards culturally adaptive large language models in mental health: using ChatGPT as a case study. Companion Publication of the 2024 Conference on Computer-Supported Cooperative Work and Social Computing.

  • 4

    AlSaadR.Abd-AlrazaqA.BoughorbelS.AhmedA.RenaultM.-A.DamsehR.et al. (2024). Multimodal large language models in health care: applications, challenges, and future outlook. J. Med. Internet Res.26:e59505. doi: 10.2196/59505,

  • 5

    BartalA.JagodnikK. M.ChanS. J.DekelS. (2024). AI and narrative embeddings detect PTSD following childbirth via birth stories. Sci. Rep.14:8336. doi: 10.1038/s41598-024-54242-2,

  • 6

    BauerB.NorelR.LeowA.RachedZ. A.WenB.CecchiG. (2024). Using large language models to understand suicidality in a social media–based taxonomy of mental health disorders: linguistic analysis of reddit posts. JMIR mental health11:e57234. doi: 10.2196/57234,

  • 7

    BelcastroL.CantiniR.MarozzoF.TaliaD.TrunfioP. (2025). Detecting mental disorder on social media: a ChatGPT-augmented explainable approach. Online Soc. Netw. Media48:100321. doi: 10.1016/j.osnem.2025.100321

  • 8

    CaiZ.FangH.LiuJ.XuG.LongY.GuanY.et al. (2025). Improving unified information extraction in Chinese mental health domain with instruction-tuned LLMs and type-verification component. Artif. Intell. Med.162:103087. doi: 10.1016/j.artmed.2025.103087,

  • 9

    CardamoneN. C.OlfsonM.SchmutteT.UngarL.LiuT.CullenS. W.et al. (2025). Classifying unstructured text in electronic health records for mental health prediction models: large language model evaluation study. JMIR Med. Inform.13:e65454. doi: 10.2196/65454,

  • 10

    ChenC.-C.ChenJ. A.LiangC.-S.LinY.-H. (2025). Large language models may struggle to detect culturally embedded filicide-suicide risks. Asian J. Psychiatr.105:104395. doi: 10.1016/j.ajp.2025.104395,

  • 11

    ChowJ. C.LiK. (2025). Large language models in medical chatbots: opportunities, challenges, and the need to address AI risks. Information16:549. doi: 10.3390/info16070549

  • 12

    CremaschiM.DitolveD.CurcioC.PanzeriA.SpotoA.MaurinoA. (2025). Decoding the mind: a RAG-LLM on ICD-11 for decision support in psychology. Expert Syst. Appl.279:127191. doi: 10.1016/j.eswa.2025.127191

  • 13

    DongreP. (2024) Physiology-driven empathic large language models (EmLLMs) for mental health support. Extended Abstracts of the CHI Conference on Human Factors in Computing Systems,

  • 14

    Dos SantosW. R.ParaboniI.MatsushimaE. H.Da SilvaC. A.de Moura MeiraE. S.GuimarãesJ. V. R. F.et al. (2025). Mixture of experts for depression and anxiety disorder prediction from textual and non-textual social media data. IEEE Access.99:1. doi: 10.1109/ACCESS.2025.3583259

  • 15

    FanX.YangL.WangX.LyuD.ChenH. (2024). Constructing a knowledge-guided mental health chatbot with LLMS. The 16th Asian Conference on Machine Learning (Conference Track).

  • 16

    FennigU.Yom-TovE.SavitzkyL.NissanJ.AltmanK.LoebensteinR.et al. (2025). Bridging the conversational gap in epilepsy: using large language models to reveal insights into patient behavior and concerns from online discussions. Epilepsia66, 686699. doi: 10.1111/epi.18226,

  • 17

    FishF. J.CaseyP.CaseyP. R.KellyB. (2024). Fish's clinical psychopathology: Signs and symptoms in psychiatry. Cambridge: Cambridge University Press.

  • 18

    ForaneS. G.EzugwuA. E.IgweK. (2024). Evaluating the cultural sensitivity of large language models in mental health support: a framework inspired by Ubuntu values. International Conference on Big Data Analytics.

  • 19

    GaoY.FuJ.GuoL.LiuH. (2025). Leveraging large language models for spontaneous speech-based suicide risk detection. arXiv. arXiv:2507.00693 [Preprint]. doi: 10.48550/arXiv.2507.00693.

  • 20

    GautamD.KellmeyerP. (2025). Exploring the credibility of large language models for mental health support: protocol for a scoping review. JMIR Res. Protoc.14:e62865. doi: 10.2196/62865,

  • 21

    GomesT. (2024). The Role of large language models in mental health: a scoping review. Universidade Catolica Portuguesa (Portugal): PQDT-Global.

  • 22

    Hadar-ShovalD.AsrafK.MizrachiY.HaberY.ElyosephZ. (2024). Assessing the alignment of large language models with human values for mental health integration: cross-sectional study using Schwartz’s theory of basic values. JMIR Mental Health11:e55988. doi: 10.2196/55988,

  • 23

    HanS.WangM.ZhangJ.LiD.DuanJ. (2024). A review of large language models: fundamental architectures, key technological evolutions, interdisciplinary technologies integration, optimization and compression techniques, applications, and challenges. Electronics13:5040. doi: 10.3390/electronics13245040

  • 24

    IbrahimI. M. B.MaskatR.AminordinA. B.TeoN. H. I. (2024) Classification of mental health conditions in Reddit post using multinomial naïve Bayes algorithm. 2024 IEEE 22nd Student Conference on Research and Development (SCOReD).

  • 25

    IgweK.DurrhiemK. (2025). A scoping review of culturally sensitive large language models-based cognitive behavioural therapy for anxiety and depression: global lessons for African implementation. Interdiscip. J. Sociality Stud.5:a06-a06. doi: 10.38140/ijss-2025.vol5.1.06

  • 26

    JamesL. J.MaessenM.GengaL.MontagneB.HagenaarsM. A.Van GorpP. M. (2023). Towards augmenting mental health personnel with LLM technology to provide more personalized and measurable treatment goals for patients with severe mental illnesses. International Conference on Pervasive Computing Technologies for Healthcare.

  • 27

    JinY.LiuJ.LiP.WangB.YanY.ZhangH.et al. (2025). The applications of large language models in mental health: scoping review. J. Med. Internet Res.27:e69284. doi: 10.2196/69284,

  • 28

    KaramatA.ImranM.YaseenM. U.BukhshR.AslamS.AshrafN. (2024). A hybrid transformer architecture for multiclass mental illness prediction using social media text. IEEE Access.99:1. doi: 10.1109/ACCESS.2024.3519308

  • 29

    KazemiS.GerhardtG.KatzJ.KuriaC. I.PanE.PrabhakarU. (2024). Cultural fidelity in large-language models: an evaluation of online language resources as a driver of model performance in value representation. arXiv. arXiv:2410.10489 [Preprint]. doi: 10.48550/arXiv.2410.10489.

  • 30

    KharitonovaK.Pérez-FernándezD.Gutiérrez-HernandoJ.Gutiérrez-FandiñoA.CallejasZ.GriolD. (2025). Incorporating evidence into mental health Q&a: a novel method to use generative language models for validated clinical content extraction. Behav. Inform. Technol.44, 23332350. doi: 10.1080/0144929X.2024.2321959

  • 31

    KimT.BaeS.KimH. A.LeeS.-w.HongH.YangC.et al. (2024). MindfulDiary: harnessing large language model to support psychiatric patients' journaling. Proceedings of the 2024 CHI Conference on Human Factors in Computing Systems.

  • 32

    KimY.JeongH.ChenS.LiS. S.LuM.AlhamoudK.et al. (2025). Medical hallucinations in foundation models and their impact on healthcare. arXiv. arXiv:2503.05777 [Preprint]. doi: 10.48550/arXiv.2503.05777

  • 33

    KirkbrideJ. B.AnglinD. M.ColmanI.DykxhoornJ.JonesP. B.PatalayP.et al. (2024). The social determinants of mental health and disorder: evidence, prevention and recommendations. World Psychiatry23, 5890. doi: 10.1002/wps.21160,

  • 34

    KumarA.GuptaK.VermaK.KumarS. (2025a). AI-driven mental healthcare 5.0: a survey of opportunities and challenges in leveraging large language models and generative AI.

  • 35

    KumarA.SharmaA.SangwanS. R. (2025b). DynaMentA: dynamic prompt engineering and weighted transformer architecture for mental health classification using social media data. IEEE Trans. Comput. Soc. Syst.12, 41934203. doi: 10.1109/TCSS.2025.3569400

  • 36

    KwesiJ.CaoJ.ManchandaR.Emami-NaeiniP. (2025). Exploring user security and privacy attitudes and concerns toward the use of {general-purpose}{LLM} chatbots for mental health. 34th USENIX security symposium (USENIX security 25).

  • 37

    LashgariF.PourvahabM.SousaA.MonteiroA.PaisS. (2025). Risk-aware suicide detection in social media: a domain-guided framework with explainable LLMs. Int. J. Web Res.8, 4558. doi: 10.22133/ijwr.2025.525754.1288

  • 38

    LiY.-H.LiY.-L.WeiM.-Y.LiG.-Y. (2024). Innovation and challenges of artificial intelligence technology in personalized healthcare. Sci. Rep.14:18994. doi: 10.1038/s41598-024-70073-7,

  • 39

    LiJ.YangY.MaoC.PangP. C.-I.ZhuQ.XuD.et al. (2025). Revealing patient dissatisfaction with health care resource allocation in multiple dimensions using large language models and the international classification of diseases 11th revision: aspect-based sentiment analysis. J. Med. Internet Res.27:e66344. doi: 10.2196/66344,

  • 40

    LinC.KuoC.-F. (2025). Roles and potential of large language models in healthcare: a comprehensive review. Biom. J.48:100868. doi: 10.1016/j.bj.2025.100868,

  • 41

    LiuT.ChengY.LuoY.WangZ.PangP. C.-I.XiaY.et al. (2024). The impact of social media on children’s mental health: a systematic scoping review. Healthcare12:2391. doi: 10.3390/healthcare12232391,

  • 42

    LiuY.YaoY.TonJ.-F.ZhangX.GuoR.ChengH.et al. (2023). Trustworthy llms: a survey and guideline for evaluating large language models' alignment. arXiv. arXiv:2308.05374 [Preprint]. doi: 10.48550/arXiv.2308.05374

  • 43

    LozoyaD. C.ConwayM.De DuroE. S.D'AlfonsoS. (2025). Leveraging large language models for simulated psychotherapy client interactions: development and usability study of Client101. JMIR Med. Educ.11:e68056. doi: 10.2196/68056

  • 44

    LuoY.ZhangR.WangF.WeiT. (2023). Customer segment classification prediction in the Australian retail based on machine learning algorithms. Proceedings of the 2023 4th International Conference on Machine Learning and Computer Application.

  • 45

    MalhotraA.JindalR. (2024). XAI transformer based approach for interpreting depressed and suicidal user behavior on online social networks. Cogn. Syst. Res.84:101186. doi: 10.1016/j.cogsys.2023.101186

  • 46

    MattosS. M.CestariV. R. F.MoreiraT. M. M. (2023). Scoping protocol review: PRISMA-ScR guide refinement. Rev. Enferm. UFPI12:e3062. doi: 10.26694/reufpi.v12i1.3062

  • 47

    McGowanJ.StrausS.MoherD.LangloisE. V.O'BrienK. K.HorsleyT.et al. (2020). Reporting scoping reviews—PRISMA ScR extension. J. Clin. Epidemiol.123, 177179. doi: 10.1016/j.jclinepi.2020.03.016,

  • 48

    MezghaniE.ExpositoE.DriraK.Da SilveiraM.PruskiC. (2015). A semantic big data platform for integrating heterogeneous wearable data in healthcare. J. Med. Syst.39:185. doi: 10.1007/s10916-015-0344-x,

  • 49

    Montejo-RaezA.Molina-GonzalezM. D.Jimenez-ZafraS. M.Garcia-CumbrerasM. A.Garcia-LopezL. J. (2024). A survey on detecting mental disorders with natural language processing: literature review, trends and challenges. Comput Sci Rev53:100654. doi: 10.1016/j.cosrev.2024.100654

  • 50

    NgoV. (2025). Humanizing AI for trust: the critical role of social presence in adoption. AI & Soc., 117. doi: 10.1007/s00146-025-02506-4,

  • 51

    NowackiA.SitekW.RybińskiH. (2025). LLM-based classifiers for discovering mental disorders. J. Intell. Inf. Syst., 118. doi: 10.1007/s10844-025-00934-8,

  • 52

    OmopoO. E. (2024). Exploring post-traumatic stress disorder: causes, diagnostic criteria, and treatment options. Int. J. Acad. Inf. Syst. Res.8, 3544.

  • 53

    OrrùG.MelisG.SartoriG. (2025). Large language models and psychiatry. Int. J. Law Psychiatry101:102086. doi: 10.1016/j.ijlp.2025.102086,

  • 54

    OwenD.LynhamA. J.SmartS. E.PardinasA. F.Camacho ColladosJ. (2024). Artificial intelligence for analyzing mental health disorders in social media: a quarter-century narrative review of progress and challenges. J. Med. Internet Res.

  • 55

    PanZ.ParkC.BrietzkeE.ZuckermanH.RongC.MansurR. B.et al. (2019). Cognitive impairment in major depressive disorder. CNS Spectr.24, 2229. doi: 10.1017/S1092852918001207,

  • 56

    PangP. C.-I.ChangS.VerspoorK.ClavisiO. (2018). The use of web-based Technologies in Health Research Participation: qualitative study of consumer and researcher experiences. J. Med. Internet Res.20:e12094. doi: 10.2196/12094,

  • 57

    ParkC.LeeH.JeongO. R. (2024). Leveraging medical knowledge graphs and large language models for enhanced mental disorder information extraction. Future Internet16:260. doi: 10.3390/fi16080260

  • 58

    PavezJ.AllendeH. (2024). A hybrid system based on bayesian networks and deep learning for explainable mental health diagnosis. Appl. Sci.14:8283. doi: 10.3390/app14188283

  • 59

    PerkinsA.RidlerJ.BrowesD.PeryerG.NotleyC.HackmannC. (2018). Experiencing mental health diagnosis: a systematic review of service user, clinician, and carer perspectives across clinical settings. Lancet Psychiatry5, 747764. doi: 10.1016/S2215-0366(18)30095-6,

  • 60

    RadwanA.AmarnehM.AlawnehH.AshqarH. I.AlSobehA.MagablehA. A. A. R. (2024). Predictive analytics in mental health leveraging LLM embeddings and machine learning models for social media analysis. Int. J. Web Serv. Res.21, 122. doi: 10.4018/IJWSR.338222

  • 61

    RowaK.WaechterS.HoodH. K.AntonyM. M. (2017). “Generalized anxiety disorder” in Psychopathology: History, diagnosis, and empirical foundations, third edition, 149186.

  • 62

    SharmaN.LiuY. (2022). A hybrid science-guided machine learning approach for modeling chemical processes: a review. AICHE J.68:e17609. doi: 10.1002/aic.17609

  • 63

    SharmaC. M.TheinK. Y. M.ChariarV. M. (2024). “Optimized support vector Machines for Detection of mental disorders” in Artificial intelligence in healthcare (CRC Press), 190219.

  • 64

    ShayaninasabM.ZahoorM.YalçinÖ. N. (2024) Enhancing patient intake process in mental health consultations using RAG-driven Chatbot. 2024 12th International Conference on Affective Computing and Intelligent Interaction Workshops and Demos (ACIIW).

  • 65

    ShiQ. (2025) Chatbots in mental healthcare: developments and challenges. 2025 IEEE 26th China Conference on System Simulation Technology and its Applications (CCSSTA).

  • 66

    SteinD. J.ShoptawS. J.VigoD. V.LundC.CuijpersP.BantjesJ.et al. (2022). Psychiatric diagnosis and treatment in the 21st century: paradigm shifts versus incremental integration. World Psychiatry21, 393414. doi: 10.1002/wps.20998,

  • 67

    VigoD.ThornicroftG.AtunR. (2016). Estimating the true global burden of mental illness. Lancet Psychiatry3, 171178. doi: 10.1016/S2215-0366(15)00505-2,

  • 68

    VrdoljakJ.BobanZ.VilovićM.KumrićM.BožićJ. (2025). A review of large language models in medical education, clinical decision support, and healthcare administration. Healthcare13. doi: 10.3390/healthcare13060603,

  • 69

    WagayF. A.AltafY. (2025). MentalRoBERTa-caps: a capsule-enhanced transformer model for mental health classification. MethodsX15:103483. doi: 10.1016/j.mex.2025.103483

  • 70

    WangX.ZhouY.ZhouG. (2025). Enhancing health assessments with large language models: a methodological approach. Appl. Psychol. Health Well Being17:e12602. doi: 10.1111/aphw.12602,

  • 71

    World Health Organization (2022). World mental health report: Transforming mental health for all. Geneva: World Health Organization.

  • 72

    WuR.YuC.PanX.LiuY.ZhangN.FuY.et al. (2024) MindShift: leveraging large language models for mental-states-based problematic smartphone use intervention. Proceedings of the 2024 CHI conference on human factors in computing systems.

  • 73

    YangJ.JinH.TangR.HanX.FengQ.JiangH.et al. (2024). Harnessing the power of llms in practice: a survey on chatgpt and beyond. ACM Trans. Knowl. Discov. Data18, 126. doi: 10.1145/3653304,

  • 74

    YoungstromE. A.Van MeterA.FrazierT. W.HunsleyJ.PrinsteinM. J.OngM. L.et al. (2017). Evidence-based assessment as an integrative model for applying psychological science to guide the voyage of treatment. Clin. Psychol. Sci. Pract.24, 331363. doi: 10.1111/cpsp.12207

  • 75

    ZhangX.CuiW.WangJ.LiY. (2024). Chat, summary and diagnosis: a LLM-enhanced conversational agent for interactive depression detection. 2024 4th International Conference on Industrial Automation, Robotics and Control Engineering (IARCE).

  • 76

    ZhangT.TengS.JiaH.D'AlfonsoS. (2024). Leveraging LLMs to predict affective states via smartphone sensor features. Companion of the 2024 on ACM International Joint Conference on Pervasive and Ubiquitous Computing,

Summary

Keywords

large language model, LLMS, mental health, mental illness, systematic scoping review

Citation

Yang J, Liu T, Luo YT, Niu T, Pang P, Xiang A and Yang Q (2026) Exploring the application boundaries of LLMs in mental health: a systematic scoping review. Front. Psychol. 16:1715306. doi: 10.3389/fpsyg.2025.1715306

Received

29 September 2025

Revised

08 December 2025

Accepted

22 December 2025

Published

27 February 2026

Volume

16 - 2025

Edited by

Wing-Yue Geoffrey Louie, Oakland University, United States

Reviewed by

Inez Y. Oh, Washington University in St. Louis, United States

Zhaoxi Fang, Shaoxing University, China

Updates

Copyright

*Correspondence: Patrick Pang,

Disclaimer

All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article or claim that may be made by its manufacturer is not guaranteed or endorsed by the publisher.

Outline

Figures

Cite article

Copy to clipboard


Export citation file


Share article

Article metrics