Abstract
Background:
The rapid evolution of large language models (LLMs) has ushered in a new era of artificial intelligence (AI) with unprecedented capabilities in understanding and generating human-like text. This progress has sparked a burgeoning interest in applying LLMs across diverse fields, including healthcare. However, the use of LLMs in mental health remains a complex area that demands rigorous investigation. This systematic scoping review aims to explore the current landscape of LLM applications in mental health, identify key research trends and gaps, and delineate the ethical and practical boundaries, thereby providing a comprehensive framework for future research and clinical practice.
Methods:
This study adheres to the PRISMA-ScR (Preferred Reporting Items for Systematic Reviews and Meta-Analyses extension for Scoping Reviews) guidelines. A comprehensive search was conducted across eleven databases (Web of Science, Scopus, PubMed, Medline, CINAHL, Cochrane, ACM Digital Library, IEEE Xplore, ScienceDirect, APA PsycInfo, and Google Scholar). A total of 29 articles were ultimately included in the study.
Results:
The application of LLMs in mental health is strategically focused on high-throughput screening and clinical augmentation. The application landscape is characterized by domain specialization, with the focus shifting from general models to specialized BERT models to achieve higher clinical accuracy, particularly for high-prevalence disorders such as depression and high-risk conditions. Data analysis is powered by massive, unstructured corpora from social media, supplemented by the systematic incorporation of structured clinical knowledge. However, significant limitations exist, including insufficient cultural sensitivity in non-Western contexts, challenges in capturing longitudinal patient history, and critical risks related to model value alignment and the generation of clinically misleading information.
Conclusion:
LLMs have emerged as sophisticated “Mental Health Agents” with immense potential for providing personalized, knowledge-guided interventions. The core challenge for future development is to transcend basic functionality and achieve clinical rigor. Future research must prioritize deep specialization into psychological models, enhance multimodal integration for comprehensive patient assessment, and urgently develop robust ethical and cultural adaptation frameworks to ensure the models are safe, globally equitable, and reliable for clinical deployment, thereby fulfilling their potential to alleviate the global mental health resource crisis.
1 Introduction
Mental health disorders represent a pressing global public health issue, imposing a substantial burden on individuals, families, and societal and economic development (World Health Organization, 2022). According to the World Health Organization (WHO), mental disorders are among the leading contributors to the global disease burden (Vigo et al., 2016). These conditions severely disrupt daily life, occupational functioning, and interpersonal relationships. For instance, major depressive disorder often manifests as persistent low mood, loss of interest, and fatigue, significantly impairing an individual’s ability to study or work (Pan et al., 2019). Generalized anxiety disorder is characterized by uncontrollable worry, tension, and physical discomfort, making it difficult for patients to focus on routine tasks (Rowa et al., 2017). Post-traumatic stress disorder (PTSD), triggered by traumatic events, can lead to intrusive memories, avoidance behaviors, and emotional numbing, markedly reducing social functioning and quality of life (Omopo, 2024). These disorders are not merely collections of clinical symptoms; they exert systematic negative impacts on cognition, emotion, and behavior, ultimately resulting in decreased productivity, increased medical expenditures, social isolation, and functional decline (Fish et al., 2024). Therefore, early identification and effective support for these mental illnesses are crucial to mitigating their long-term individual and societal consequences (Kirkbride et al., 2024).
To address this challenge, the development of innovative, accessible mental health services is foundational for implementing effective interventions and support. Traditional diagnostic models for mental health rely heavily on clinicians’ subjective judgments and structured or semi-structured interviews. While these models are based on clinicians’ extensive clinical experience, they face multiple structural challenges in addressing the global, growing patient population (Stein et al., 2022). These include geographical disparities in professional medical resources, subjectivity and time consumption in the diagnostic process, and barriers to timely and effective care due to stigma, financial constraints, or geographic limitations (Perkins et al., 2018). These challenges collectively drive the urgent need for innovative, scalable, and efficient solutions. In this context, digital technologies, particularly LLMs, are emerging as key drivers of transformation in mental health service systems.
Digital technologies, particularly AI, are widely recognized as transformative forces in reshaping the delivery of mental health services (Liu et al., 2024). Early digital interventions, such as telemedicine platforms and mobile health applications (mHealth), have preliminarily demonstrated their potential to improve service accessibility (Pang et al., 2018). With the rapid advancement of AI technologies, their applications in assisting diagnosis, treatment support, and risk assessment have garnered increasing attention. Early machine learning (ML) models (Luo et al., 2023), such as those based on support vector machines (SVMs) (Sharma et al., 2024) or naive Bayes classifiers (Ibrahim et al., 2024), have shown promise in specific tasks. For instance, natural language processing (NLP) techniques have been used to analyze text sentiment (Li et al., 2025), and speech pattern recognition has aided in diagnostic assessments. However, these models are typically designed for specific tasks, lacking generalizability and the ability to understand complex contexts. This limits their effectiveness in handling the highly unstructured, metaphor-rich, and culturally diverse language data prevalent in mental health.
Recent breakthroughs in AI, particularly the rise of LLMs based on advanced architectures like Transformer, have opened new frontiers for mental health. LLMs (Orrù et al., 2025), trained on massive textual and multimodal data, have acquired robust Natural Language Understanding (NLU), Natural Language Generation (NLG), and contextual reasoning capabilities (Montejo-Raez et al., 2024). Unlike earlier shallow ML models, LLMs exhibit “emergent abilities”—performance gains that exceed what would be expected from mere scaling (Han et al., 2024). This enables them to capture subtle and nuanced semantic associations in human language, making them well-suited to address the complexity and subtlety inherent in mental health (Kumar et al., 2025a). The potential applications of LLMs in mental health diagnostics are multifaceted and are transitioning from proof-of-concept to real-world implementation.
First, LLMs can analyze unstructured text data from social media platforms, online forums, anonymous chat rooms, and personal diaries to identify language biomarkers associated with specific mental disorders (Owen et al., 2024). For example, LLM models can detect subtle linguistic cues indicative of hopelessness, social isolation, or suicidal ideation, providing a non-invasive and efficient method for large-scale mental health risk screening (Gao et al., 2025). Second, LLMs can serve as intelligent assistants for clinicians by processing and integrating complex clinical text data (Lin and Kuo, 2025). For instance, they can analyze electronic health records (EHRs), handwritten clinical notes, and initial consultation transcripts to extract key symptoms, medical history, and behavioral patterns (AlSaad et al., 2024). This information can be synthesized into structured, easily interpretable reports, supporting clinical decision-making and reducing the administrative burden on healthcare providers (Vrdoljak et al., 2025). Third, conversational AI powered by LLMs, such as AI chatbots, can offer patients 24/7 accessible emotional support and behavioral tracking (Chow and Li, 2025). These chatbots engage in natural, empathetic conversations, helping patients express emotions, track mood fluctuations, and implement psychology-based coping strategies (Shi, 2025). The interaction data, anonymized to protect privacy, can be used to monitor patients’ condition changes over time, providing clinicians with long-term, dynamic health data for more timely interventions (Li et al., 2024).
Despite their promise, existing reviews on LLMs in mental health present certain limitations, particularly in systematically exploring their application boundaries and potential risks. Most reviews, such as those by Gautam and Kellmeyer (2025) and Jin et al. (2025), primarily focus on technical performance validation or specific application scenarios, lacking a comprehensive examination of LLMs within broader clinical contexts. These reviews often highlight the performance advantages of LLMs in specific tasks but fail to delve into the ethical dilemmas, technical limitations, and safety risks they may encounter in real-world clinical applications. For example, there is a scarcity of systematic research on critical ethical and technical issues, such as algorithmic bias and data privacy, which are essential for ensuring the safe, fair, and effective deployment of LLMs in mental health. Addressing these gaps is vital to advancing the responsible use of LLMs in this field.
To fill these research gaps, this study aims to conduct a comprehensive, systematic scoping review of the application boundaries of LLMs in mental health. Adhering to the PRISMA-ScR guidelines (Mattos et al., 2023), the study will systematically search multiple key academic databases, rigorously screen relevant literature, and extract and synthesize data. The focus will be on the technical, ethical, and practical boundaries of LLMs in mental health clinical practice, providing scientific evidence and recommendations for technology developers, clinicians, and policymakers. Based on this, the study proposes five research questions:
(1) What are the geographic and temporal trends in the application of LLMs in mental health research?
(2) What are the primary application scenarios and technical types of LLMs in mental health?
(3) Which mental disorders are currently the main focus of LLM research, and what are the data sources and evaluation metrics used?
(4) What are the technical, ethical, and practical limitations and risks of LLMs in mental health?
(5) Based on the analysis of existing research, what are the future research directions and development trends?
2 Methods
2.1 Search strategy
This study screened relevant papers from eleven databases (Web of Science, Scopus, PubMed, Medline, CINAHL, Cochrane, ACM Digital Library, IEEE Xplore, ScienceDirect, APA PsycInfo, and Google Scholar). The search date was June 27, 2025, and the search terms included “large language model*,” “LLM*,” “generative AI,” “GenAI,” “AIGC,” “AI chatbot*,” “conversational AI,” “natural language processing,” “NLP,” “ChatGPT,” “GPT,” “Bard” OR “Gemini,” “mental health,” “mental illness,” “mental disorder*,” “psychiatric disorder*,” “depress*,” “anxiety,” “schizophrenia,” “bipolar disorder*,” “PTSD,” “suicid*,” and “self-harm,” etc. (see Table 1). The PRISMA-ScR checklist is provided in Appendix AS1.
Table 1
| Database | Search formula |
|---|---|
| Web of Science | (“large language model*” OR “LLM*” OR “generative AI” OR “GenAI” OR “AIGC” OR “AI chatbot*” OR “conversational AI” OR “natural language processing” OR “NLP” OR “ChatGPT” OR “GPT” OR “Bard” OR “Gemini”) (All fields) AND (“mental health” OR “mental illness” OR “mental disorder*” OR “psychiatric disorder*” OR “depress*” OR “anxiety” OR “schizophrenia” OR “bipolar disorder*” OR “PTSD” OR “suicid*” OR “self-harm”) (All fields) AND (Document Types: Article or Proceeding Paper) AND (Languages: English) |
| Scopus | (TITLE-ABS-KEY (“large language model*” OR “LLM*” OR “generative AI” OR “GenAI” OR “AIGC” OR “AI chatbot*” OR “conversational AI” OR “natural language processing” OR “NLP” OR “ChatGPT” OR “GPT” OR “Bard” OR “Gemini”)) AND (TITLE-ABS-KEY (“mental health” OR “mental illness” OR “mental disorder*” OR “psychiatric disorder*” OR “depress*” OR “anxiety” OR “schizophrenia” OR “bipolar disorder*” OR “PTSD” OR “suicid*” OR “self-harm”)) AND (LIMIT-TO (DOCTYPE, “ar”) OR LIMIT-TO (DOCTYPE, “cp”)) AND (LIMIT-TO (LANGUAGE, “English”)) |
| PubMed | ((((large language model*) OR (LLM*) OR (generative AI) OR (GenAI) OR (AIGC) OR (AI chatbot*) OR (conversational AI) OR (natural language processing) OR (NLP) OR (ChatGPT) OR (GPT) OR (Bard) OR (Gemini)) AND ((mental health) OR (mental illness) OR (mental disorder*) OR (psychiatric disorder*) OR (depress*) OR (anxiety) OR (schizophrenia) OR (bipolar disorder*) OR (PTSD) OR (suicid*) OR (self-harm)))) Filters: Full text |
| Medline | large language model* OR LLM* OR generative AI OR GenAI OR AIGC OR AI chatbot* OR conversational AI OR natural language processing OR NLP OR ChatGPT OR GPT OR Bard OR Gemini AND mental health OR mental illness OR mental disorder* OR psychiatric disorder* OR depress* OR anxiety OR schizophrenia OR bipolar disorder* OR PTSD OR suicid* OR self-harm |
| CINAHL | ((MH “large language model*”) OR TI (“large language model*” OR “LLM*” OR “generative AI” OR “GenAI” OR “AIGC” OR “AI chatbot*” OR “conversational AI” OR “natural language processing” OR “NLP” OR “ChatGPT” OR “GPT” OR “Bard” OR “Gemini”)) AND ((MH “mental health”) OR TI (“mental health” OR “mental illness” OR “mental disorder*” OR “psychiatric disorder*” OR “depress*” OR “anxiety” OR “schizophrenia” OR “bipolar disorder*” OR “PTSD” OR “suicid*” OR “self-harm”)) |
| Cochrane | large language model* AND mental health |
| ACM Digital Library | large language model* AND mental health |
| IEEE Xplore | (“Full Text & Metadata”: large language model* OR LLM* OR generative AI OR GenAI OR AIGC OR AI chatbot* OR conversational AI OR natural language processing OR NLP OR ChatGPT OR GPT OR Bard OR Gemini) AND (“Full Text & Metadata”: mental health OR mental illness OR mental disorder* OR psychiatric disorder* OR depress* OR anxiety OR schizophrenia OR bipolar disorder* OR PTSD OR suicid* OR self-harm) Filters Applied: Conferences Early Access Articles Journals |
| ScienceDirect | large language model* OR LLM* OR generative AI OR GenAI OR AIGC OR AI chatbot* OR conversational AI OR natural language processing OR NLP OR ChatGPT OR GPT OR Bard OR Gemini AND mental health OR mental illness OR mental disorder* OR psychiatric disorder* OR depress* OR anxiety OR schizophrenia OR bipolar disorder* OR PTSD OR suicid* OR self-harm “Article type: Research articles” |
| APA PsycInfo | large language model* AND mental health |
| Google Scholar | large language model* OR LLM* OR generative AI OR GenAI OR AIGC OR AI chatbot* OR conversational AI OR natural language processing OR NLP OR ChatGPT OR GPT OR Bard OR Gemini AND mental health AND mental illness AND mental disorder* AND psychiatric disorder* AND depress* AND anxiety AND schizophrenia AND bipolar disorder* AND PTSD AND suicid* AND self-harm |
Selected databases and search formats.
2.2 Data selection and extraction
Records were first imported into reference management software EndNote, where automated screening was performed to remove duplicates and records marked as ineligible based on pre-set criteria, before manual screening. Two independent reviewers (JY and TL) conducted a preliminary screening of the article titles and abstracts based on predetermined inclusion criteria. Any discrepancies between the two reviewers were resolved through consultation with a third reviewer (YL). The inclusion criteria are as follows: (1) Studies specifically targeting LLMs in mental health; (2) Research on LLM technologies for mental health services; (3) Research articles and conference papers; (4) Full-text articles and conference papers published in English. The inclusion and exclusion criteria were designed with a specific focus on the application of LLMs in mental health. This study prioritizes empirical investigations of how LLMs are implemented in mental health services, thereby excluding research that primarily explores individuals’ perceptions, algorithm comparison, attitudes, or opinions regarding LLMs in these domains. Additionally, review articles (e.g., narrative or systematic reviews) were excluded, as they synthesize existing literature rather than present original applications. To ensure comprehensive coverage of diverse research methodologies, the inclusion criteria encompassed qualitative, quantitative, and mixed-methods studies, thereby capturing a holistic range of evidence on the implementation of LLMs in practice. These criteria are summarized in Table 2.
Table 2
| Inclusion criteria | Exclusion criteria |
|---|---|
| Research on LLM technologies for mental health services. | Research on technologies other than LLMs in the mental health field. |
| Research on the application of LLM technology in mental health services. | Research on algorithm comparison, attitudes, views, intentions, benefits, obstacles, impacts, experiences, and usage demands towards LLM technology. |
| Research-type articles and conference papers. | Review articles, theses, non-academic publications, book chapters, etc. |
| Full text in English. | Full text in other languages. |
Inclusion and exclusion criteria.
2.3 Data charting
Based on the review scope methodology guidelines provided by the PRISMA-ScR guidelines (McGowan et al., 2020), a data extraction table was developed. Following a purposive pilot test on five articles selected to represent the diversity of the included studies (e.g., different study designs and LLM applications), the table was refined to ensure comprehensive and relevant data capture. The final data extraction table included the following items: author, year, country, research methodology, type of technology/model, type of issues, mental health application, data source, application performance metrics, limitations, and future research directions. All data were extracted by two independent reviewers. Any disagreements that arose during the data extraction process were resolved through consultation with a third reviewer, ensuring the accuracy and consistency of the extracted information.
2.4 Collating, summarizing, and reporting the results
The extracted data were synthesized and analyzed using a narrative synthesis approach to address the research questions of the review. Descriptive findings, such as the distribution of articles by year, country, or research method, were presented through graphs and charts to provide a visual overview of the research landscape. The qualitative findings, particularly those related to the application boundaries, limitations, and future trends, were thematically analyzed and explained through a detailed narrative to provide a comprehensive and nuanced discussion. All explanations and interpretations were verified by all authors to ensure the rigor and validity of the final report.
3 Results
A total of 10,743 articles were retrieved through a systematic search, the search process and results as shown in Figure 1. To ensure consistency between the methods, results, and the PRISMA flowchart, we explicitly define the key terms used in the selection process. “Records” refer to all initial items retrieved from the databases. “Reports” or “Articles” refer to unique publications assessed for eligibility. “Studies” refer to the final set of reports that met all inclusion criteria. Two reviewers independently screened the article titles and abstracts, excluding 3,221 articles that were not directly related to the research topic, as well as 20 non-English articles. The term “reports not retrieved,” specifically refers to articles that were identified as potentially relevant during the title and abstract screening but for which the full-text document could not be accessed by the reviewers. Then, the two reviewers conducted a thorough assessment of the remaining 58 articles. These studies were not the focus of this research, so 29 articles were excluded. The reason for exclusion was that the topic did not match the main focus of this study. Ultimately, 29 articles were determined to be included in this systematic review scope (see Table 3).
Figure 1
Table 3
| Author/Year/Country | Methods | Technology/model | Mental issue | Mental health application | Data source | Application performance metrics | Limitations | Future trends |
|---|---|---|---|---|---|---|---|---|
| Abdullah et al. 2024 Canada (Abdullah and Negied, 2024) | Quantitative | GPT, BERT | Depression, Anxiety, PTSD, ADHD | Diagnosis, Prediction | 167,444 clinical social media posts | F1-score | Cross-validation should be considered to validate the obtained results, validate the hyperparameters, and fine-tune them based on validation results | Future mental disorders from social media data using ML, Ensemble Learning, and LLMs |
| Al-Otaibi et al. 2025 Saudi Arabia (Al-Otaibi et al., 2025) | Mixed | Transformer, AI Chatbot | Night terror, Depression, Social phobia, panic attacks, Anhedonia, Borderline personality disorder | Support | 7,245 tokens from six conversations | Error | AI risks potentially harm vulnerable users; Lack of cross-system comparison | Explore and compare the performance of alternative systems to provide a broader assessment of AI capabilities in this domain |
| Bartal et al. 2024 Israel (Bartal et al., 2024) | Mixed | Sentence-transformers PLMs, ChatGPT | Childbirth-related post-traumatic stress disorder (CB-PTSD) | Identify | Narratives of length 30 words from 1,295 women | F1 score, sensitivity, specificity, AUC | limitations in analyzing shorter narratives | This textual personal narrative-based assessment strategy employing NLP analysis has the potential to become an accurate, efficient, low-cost, and patient-friendly strategy |
| Bauer et al. 2024 US (Bauer et al., 2024) | Mixed | LLM embeddings, ChatGPT, eXplainable Artificial Intelligence (XAI) | Suicidality | Understand | 2.9 million posts from social media | Mean (SD), Proportion of closest, singular value decomposition | Data time range restrictions and source platform diversity restrictions | Expand the time frame of data collection; explore other web-based platforms; and integrate additional data sources, such as user comments |
| Belcastro et al. 2025 Italy (Belcastro et al., 2025) | Quantitative | ChatGPT, BERT-XDD model, XAI | Depression | Detect | 10,251 tweets + 7,650 posts from social media text | Accuracy, Precision, Recall, F1 score | Labeled data is scarce; Lacks temporal dynamics; Explanatory models may be oversimplified; Cannot replace clinical judgment | Focus on dynamic and longitudinal modeling to capture context and temporal patterns |
| Cai et al. 2025 China (Cai et al., 2025) | Quantitative | BERT, ChatGPT | General mental health | Information extraction | ChatGPT to generate instances of the task | F1 score | Computational efficiency and scalability constraints; Cross-domain adaptability unverified | Explore a dynamic verification framework; Optimize computational efficiency and scalability, conduct cross-domain validation |
| Cardamone et al. 2025 US (Cardamone et al., 2025) | Mixed | ChatGPT | General mental health | Prediction | 1,000 records from the clinic terms | Recall, F1-score | LLM hallucination and bias risks; Single-label classification limits complexity | Expand coding and classification methods (e.g., multi-label classification), and increase clinical coders |
| Cremaschi et al. 2025 Italy (Cremaschi et al., 2025) | Mixed | ChatGPT, Retrieval Augmented Generation (RAG) model | General mental health | Decision support | Creation by the ICD-11 classification system | Accuracy, Precision, Recall, F1-score | Lack of patient history integration (e.g., symptom evolution, age, medication history) | Integrate cloud computing; Incorporate temporal evolution, age, and pharmacological information; Focus on comorbidity management and atypical clinical presentations |
| Dos Santos et al. 2025 Brazil (Dos Santos et al., 2025) | Mixed | BERT, BiLSTM | Depression, Anxiety | Prediction | 19.4 million messages from depression data (Twitter/X textual and non-textual data in Portuguese and Reddit textual data in English) | Precision, Recall, F1-score | Discrepancies between model and human judgment (due to clinical caution); Temporal information intentionally omitted | Explore the ethical implications of the model, maintain responsible AI development |
| Fan et al. 2024 China (Fan et al., 2024) | Quantitative | ChatGPT, AI Chatbot | Depression, Anxiety, Bipolar Disorder, PTSD, Panic Disorder, Eating Disorder | Prediction, Intervention, Suggestion | 1,667 entries labeled with psychological disorders from the efaqa dataset, and 1,300 dialogues from the ESConV dataset | Accuracy, Precision, Recall | Ethical and privacy issues regarding the use of data | Future work will explore the model’s ethical implications, maintaining a focus on responsible AI development |
| Fennig et al. 2024 Israel (Fennig et al., 2025) | Mixed | ChatGPT | Epilepsy, Depression, Anxiety | Assessment | 768,504 posts from Reddit’s 21,906 users | Hazard ratios (HRs) | Reddit user base lacks general representativeness; Healthcare environment constraints; Text model’s content understanding variations | Address social stigmatization caused by mental health issues |
| Hadar-Shoval et al. 2024 Israel (Hadar-Shoval et al., 2024) | Quantitative | ChatGPT | General mental health | Assessment | 53,472 individuals across 49 nations | Internal reliability and intercorrelations of Schwartz’s values, Split-half reliability agreement, and confirmatory factor analysis | Small LLM sample size; Difficult to isolate model capabilities from built-in guardrails; Value construct robustness unassessed | Evaluate impact of subtle prompt changes on model values; Validate predictive validity |
| James et al. 2024 The Netherlands (James et al., 2023) | Mixed | ChatGPT | Severe Mental Illnesses (SMI) | Measurable treatment plan goals | During the five rounds of investigation, these 8 students were given two final goals by LLM | Average scores for both goals in each phase | UI/UX not assessed; No testing with SMI patients or case managers | Recruit SMI patients and case managers to evaluate workflow; Improve UI; Investigate SMI patient attitudes toward privacy and technology |
| Karamat et al. 2024 Pakistan (Karamat et al., 2024) | Quantitative | Hybrid transformer - MentalBERT, and MelBERT models (CNN) | Depression, Anxiety, Borderline personality disorder (BPD), PTSD | Prediction | 40,000 samples from Reddit posts | Accuracy, Precision, Recall, F1-score, ROC, AUC, Loss | Evaluated on a small dataset, failing to capture the full spectrum of disorders/language patterns | Scale up dataset size and diversity for validation |
| Kharitonova et al. 2024 Spain (Kharitonova et al., 2025) | Mixed | ChatGPT | Depression, ADHD | Content extraction | Formulate 10 questions for each scenario and their corresponding correct answers | Coherence, Veracity, Evidence | Limited content window; Restricted answer search space; Inconsistent multilingual support quality | Experiment with alternative LLMs; Analyze multilingual support; Evaluate text simplifiers; Integrate LLM engine ensembles; Hierarchical information organization; Extend to multimodal systems |
| Kim et al. 2024 Korea (Kim et al., 2024) | Mixed | ChatGPT | Depression | Support | A four-week field study involving 28 patients with major depressive disorder and five psychiatrists | Coding paired with thematic analysis | Recruitment method impacts generalizability (e.g., young patients, fixed psychiatrists); GPT model updates cause behavioral inconsistency | Compare multiple APP versions and underlying LLMs; Conduct necessary studies with diverse backgrounds |
| Kumar et al. 2025 UK (Kumar et al., 2025b) | Quantitative | BioGPT, DeBERTa | General mental health | Classification data | DepSeverity consists of 3,553 Reddit posts, SDCNL contains 1895 posts, and Dreaddit comprises 1,191 posts | Accuracy, Precision, Recall, F1-score, AUC, ROC | Reliance on textual data restricts applicability; Reddit data limits cross-cultural/platform generalizability | Evaluate other platforms; Integrate multimodal inputs (e.g., audio, physiological signals); Include real-time human feedback; Explore personalized language modeling |
| Lozoya et al. 2025 Australia (Lozoya et al., 2025) | Mixed | ChatGPT | Depression, Anxiety | Simulated psychotherapy client interactions | 19-question survey | T test | Small sample size of synthetic therapy sessions | Increase professional involvement or session count; Evaluate effectiveness as an educational tool; Explore cross-language and cross-dictionary adaptation |
| Malhotra et al. 2024 India (Malhotra and Jindal, 2024) | Mixed | BERT, XAI | Depression, Suicidal | Classification, Interpretation | 262,922 tweets of data from four datasets | Classification performance evaluation, LIME explanation | Self-reported data lacks manual sanity checks (e.g., for sarcasm, metaphor), leading to false positives | Use data sanity check protocols (manual/automated); Explore XAI applications; Explore multiclass/multilabel/multilingual data analysis and bias detection |
| Nowacki et al. 2025 Poland (Nowacki et al., 2025) | Quantitative | MentalBERT, MentaLLaMA | General mental health | Classification | 3,553 posts from Reddit, 6,850 SMS-like messages, 5,051 records from CAMS, 3029 data from IRF | F1-score | Flan-T5 architecture required additional stabilization layers | Different LLMs are similar to each other. This means that all of these models, despite their different architectures or training methods, achieve a similar level of performance. This indicates their flexibility, comparable quality and ability to process complex data |
| Park et al. 2024 Korea (Park et al., 2024) | Quantitative | ChatGPT, Zero-shot learning, Medical knowledge graphs | Depression | Information extraction | 3,793 PubMed abstracts from BioCreative Datasets, 21,082 documents from Document-Level Relationship Extraction Datasets | F1 score | The limited dataset | Across diverse medical datasets and expanding the types of entities involved |
| Pavez et al. 2024 Chile (Pavez and Allende, 2024) | Quantitative | BERT, Explainability of Bayesian Networks | General mental health | Diagnosis, Classification | 2.3 million data points from social media | Precision, Recall, F1-Score, Support | Method fails to fully capture complex nuances and inherent uncertainties of real-world patient interactions | Deepen collaboration with field professionals for comprehensive assessment; Integrate critical clinical factors (e.g., patient history, hereditary traits) |
| Radwan et al. 2024 US (Radwan et al., 2024) | Quantitative | nBERT | General mental health | Classification, Recognition | 2021 samples | Accuracy, Precision, Recall, F1-score | nBERT generalizability remains a challenge in multilingual and diverse datasets | Explore cross-lingual fine-tuning; Employ transfer learning/cross-domain adaptation; Integrate multimodal data |
| Shayaninasab et al. 2024 Canada (Shayaninasab et al., 2024) | Quantitative | ChatGPT, AI Chatbot | Depression | Assessment, Support | 20 conversational examples (4 examples x 5 depression levels) that were completed with an average of 28 conversation turns, and a total of 562 conversation turns | Depression level | Model cannot classify single input into multiple topics; RAG implementation relies on carefully selected resources | Adopt more sophisticated strategies to address class imbalance |
| Wagay et al. 2024 India (Wagay and Altaf, 2025) | Quantitative | MentalRoBERTa(6), Capsule Layer, LIME (Local Interpretable Model-agnostic Explanations) | General mental health | Classification | 3,553 posts from Reddit | Recall, F1-Score | Platform-specific language remains a challenge; F1/Recall is limited for class imbalance; LIME only offers local, approximate explanations | Adopt more sophisticated strategies to address issues |
| Wang et al. 2024 China (Wang et al., 2025) | Quantitative | ScaleLLM | Depression | Assessment | Responses from 70,692 participants | Accuracy, Precision, F1-score | Limited capability to process structured data; Rapid evolution of research may outpace LLM updates; Cross-cultural/language adaptability is a major challenge; Lack of real-world clinical validation | Additional language alignment steps; Cross-cultural adaptability; Rigorous testing and validation in practical clinical scenarios |
| Wu et al. 2024 China (Wu et al., 2024) | Mixed | AI Chatbot | Understanding, Comforting, Evoking, and scaffolding habits | Persuasion | 5-week field experiment (N = 25) | Kruskal-Wallis test, Intervention Acceptance Rate | Experimental group limited to young adults; Short field experiment time; Validity/reliability needs improvement; Detection relies on self-reporting; Only considers initial use; GPT-3.5 performance is unstable | Expand sample size and diversity; Improve experimental design; Enhance validity and reliability tests; Explore lighter and more robust LLMs |
| Zhang et al. 2024 Australia (Zhang T. et al., 2024) | Quantitative | ChatGPT, Zero-shot learning | General mental health | Prediction | Investigation of 150 university students | Zero-shot Mean Absolute Errors | Subjectivity of self-reported datasets; Imbalanced class distributions lead to model bias | Conduct fine-tuning tasks for daily activity-driven models; Increase dataset size or use resampling techniques |
| Zhang et al. 2024 China (Zhang X. et al., 2024) | Quantitative | AI Chatbot | Depression | Diagnosis, Detection | 1,339 conversations from a depression diagnosis dataset | Precision, Recall, F1-score | Inappropriate for real clinical application; Chinese conversational agent; Lacks reliable strategy for optimal training stopping point | Invite wider community participation to enhance the model; Work with different languages |
Overview of study characteristics.
3.1 Characteristics of studies
The surge in research on LLMs in mental health is driven by both technological singularity and global public health necessity. All 29 papers analyzed were published or accepted within the extremely narrow timeframe of 2024 to 2025. This trend is a direct result of the revolutionary leaps made by general-purpose LLMs, like GPT-4, since 2023, particularly in complex reasoning, emotional understanding, and generating high-quality, human-like text. This technological capability intersected with the deepening global mental health resource crisis, where hundreds of millions lack effective psychological support. Consequently, the research focus has fundamentally shifted: LLMs are now viewed as key strategic digital assets and are being developed as Digital Mental Health Agents capable of offering clinical decision support, multimodal data integration, and personalized therapeutic interventions, leveraging their characteristics of low marginal cost and high scalability to address the resource gap.
The geographic distribution of the research (see Figure 2) (China, the US, and Israel leading the list) clearly illustrates the stratification based on regional economic strength, technological maturity, and public health strategy. China, with 5 papers, leads the world in output, a pattern reflecting its national strategy for basic AI technology localization. Chinese research heavily focuses on developing specialized Chinese foundation models and building knowledge-guided therapeutic applications, aiming to solve the massive mental health resource deficit within its vast population and unique cultural context, emphasizing model professionalism, interpretability, and safety. The US (3 papers) and Israel (3 papers) form the next tier, but with distinct foci: The US leverages its leading data infrastructure and advanced clinical IT systems (EHRs, large-scale social media data) to pursue automated risk prediction and deep integration into clinical workflows. Conversely, Israel, a high-tech innovation hub, focuses on the ethical and psychological depth of AI, concentrating on LLM’s capacity for mentalization, emotional intelligence, and rigorous evaluation of its alignment with human values before widespread deployment.
Figure 2
The middle tier, with 2 papers each from Australia, Canada, Italy, Korea, and India, represents specialized technological penetration. For example, Italy’s work is tailored to its mature but strictly regulated public healthcare system, developing RAG models based on ICD-11 to function as expert clinical decision assistants. Finally, the 8 papers contributed by single nations (including Brazil, Saudi Arabia, the UK, and Spain) often exhibit high pragmatism and cultural adaptation. These regions, frequently facing economic and clinical resource constraints, focus on high-yield, low-cost solutions addressing localized pain points. For instance, Saudi Arabia researches communication errors in Arabic support systems to ensure cross-cultural applicability, while Brazil explores multimodal expert systems integrating text and non-text social data, highlighting a global trend toward diversified, culturally sensitive, and cost-effective LLM deployment against the universal mental health crisis.
3.2 Research method of studies
The overall methodological distribution in this research reveals a structural balance between technical feasibility and clinical prudence, signifying that the study of LLMs in mental health has advanced to a phase of deep interdisciplinary validation. Out of the total 29 papers analyzed (see Figure 3), 16 papers (approximately 55.2%) utilized a purely quantitative research methodology. This segment is primarily driven by computer science and engineering, focusing on quantifiable technical metrics such as model performance, diagnostic accuracy, data prediction capabilities, and system efficiency, thus establishing the technical feasibility of LLMs as digital healthcare tools. Closely following this, 13 papers (approximately 44.8%) employed mixed research methods, combining quantitative and qualitative approaches. The near-equal split highlights a critical consensus among researchers: in the complex domain of human mental health, technical metrics alone are insufficient. The prevalence of mixed methods demonstrates the essential need to integrate rigorous performance indicators with subjective data on user experience, ethical considerations, cultural adaptability, and clinical acceptance, marking a mature shift in the field from merely asking “what can it do” to “how can it be deployed safely and responsibly.”
Figure 3
3.3 LLM technology application landscape in mental health
The comprehensive analysis, drawing from the nine distinct technology combination data points, distinctly charts a sophisticated research landscape in mental health LLM applications characterized by general models as the reference point, domain specialization as the driving force, and clinical trustworthiness as the core architectural principle. ChatGPT, while serving as the most frequently mentioned single model (6 mentions, 16.2%), represents the baseline for general capabilities, and its prominent pairing with Zero-shot Learning (5 mentions, 13.5%) highlights the industry’s successful effort to leverage general LLMs for efficient, low-resource task deployment. Crucially, the dominant research focus has decisively shifted toward vertical domain specialization, evidenced by the overwhelming proportion of dedicated BERT family specialized models, including MentalBERT, MentaLLaMA, BioGPT, DeBERTa, MentalRoBERTa, and nBERT (27 total mentions in combinations), which confirms the community’s critical recognition that achieving the necessary higher clinical accuracy and professional controllability in complex mental health diagnostics requires models trained and fine-tuned on specialized psychological and medical corpora, thereby transcending the limitations of generic language understanding. To ensure safety and confidence in sensitive clinical use cases, Trustworthiness Mechanisms are highly integrated into the architecture: the combination of ChatGPT with the RAG model (3 mentions, 8.1%) is a key strategy employed to ensure factual accuracy by anchoring generated responses to verified knowledge, effectively mitigating the common issue of model “hallucination”; simultaneously, the explicit incorporation of XAI, highlighted by the high-frequency appearance of Explainability of Bayesian Networks (4 mentions, 10.8%) and the complex MentalRoBERTa architecture utilizing LIME (4 mentions, 10.8%), collectively establishes the provision of transparent, traceable decision rationale as an indispensable technical requirement, signaling that the field has unequivocally transitioned into a systematic deployment phase centered on safety, professionalism, and verifiable trust.
3.4 Mental health issues
The largest category, “General mental health” (34.5%), establishes the LLM’s role as a large-scale, low-threshold psychological support system, primarily concentrating on non-diagnostic, generalized frameworks for psychological companionship and behavioral intervention, such as providing understanding, comforting, evoking, and scaffolding habits. This reflects the model’s immense value as a universal solution to the scarcity of mental health resources. However, the research focus quickly shifts to specific disorders with high clinical demand and social impact: Depression has emerged as the absolute research core due to its high prevalence, whether as a single illness (20.7%) or as the primary co-morbidity foundation for common conditions like Anxiety, accounting for nearly half of all disorder combinations mentioned. This intense focus on depression signifies the immense potential of LLMs in the early screening, severity assessment, and high-risk prediction of mood disorders. Furthermore, it is noteworthy that research is actively expanding into more complex, trauma-related disorders (e.g., PTSD) and difficult-to-manage co-morbidities like Borderline Personality Disorder (BPD), alongside specialized studies on extremely high-risk issues related to life safety, such as Suicidality (including standalone and combined counts). This necessitates LLMs possessing heightened professional ethics and granular reasoning capabilities. Overall, the application of LLMs in mental health is transitioning from basic text analysis to a profound, specialized tool designed for the precise identification of high-risk individuals, assisting complex clinical diagnosis, and providing professional risk warnings.
3.5 Data sources for the LLM mental health
The data sources across these 29 articles reveal a dual driving force in LLM mental health research: the need to quantify vast unstructured data and simultaneously deepen clinical expertise, reflecting researchers’ focus on both high-throughput screening and professional-grade validation. Social media platforms are the unequivocally dominant source, providing a massive corpus of raw, real-world, unstructured linguistic data, with millions of posts and messages from Reddit and Twitter/X, including one dataset contributing up to 19.4 million messages. This heavy reliance on online text is the primary characteristic, aiming to leverage LLMs for high-throughput, real-time detection and risk prediction of mood disorders, particularly depression. Second, the sources underscore a systematic pursuit of professional knowledge and structured data, encompassing knowledge bases created from the ICD-11 classification system, biomedical documents like PubMed abstracts, and clinical records from clinic terms, which are critical cornerstones for building clinical decision support systems and knowledge-augmented LLMs. Furthermore, model validation and optimization are achieved through diverse customized data collection, including large-scale multi-national participant surveys, specific scenario short narratives, and both human-generated task instances and small-scale clinical field studies, signaling a shift from pure text mining towards comprehensive validation of domain-specific customization, ethical alignment, and practical clinical efficacy.
3.6 Application performance metrics
The core assessment relies heavily on standard quantitative metrics, with combinations of Accuracy, Precision, Recall, and the F1-score dominating the field, reflecting a primary goal of effective high-throughput screening and detection of mental disorders. Metrics like AUC and ROC further confirm the emphasis on the LLMs’ discriminatory power in risk prediction and binary classification tasks. Crucially, the analysis extends beyond purely technical classification into professional domains: the inclusion of psychometric measures such as Internal Reliability, Split-half Reliability, and Confirmatory Factor Analysis is utilized to ensure the reliability and validity of the psychological constructs being modeled. Furthermore, the appearance of clinical and outcome metrics like Hazard Ratios and T-tests, coupled with the emerging focus on XAI methods (e.g., LIME) and qualitative criteria (e.g., Coherence, Veracity, and Evidence), signals a strategic shift. This transition highlights a commitment to moving LLM applications from simple black-box classifiers to trustworthy, ethically aligned, and clinically interpretable decision-support tools.
4 Discussion
4.1 Main findings and results of studies
This systematic analysis of research on LLM applications in mental health reveals a field undergoing rapid acceleration within an extremely recent timeframe, driven by unprecedented technological leaps and the urgent global necessity of addressing the immense mental health crisis. This confluence has established LLMs as essential digital mental health agents. Geographically, the research output is concentrated in a handful of technologically advanced and economically strong nations. This includes a major Asian country focusing on specialized national foundation models, localization, and ethical safety, alongside the United States, which leverages its robust data infrastructure and clinical IT systems for automated risk prediction. Another key innovation hub is concentrating on the ethical and psychological depth of AI, specifically evaluating LLMs’ capacity for mentalization and alignment with human values. Methodologically, the field is characterized by an interdisciplinary phase, balancing extensive quantitative research with near-equal attention to mixed methods that integrate technical performance with qualitative data on user experience, ethics, and cultural adaptability. Technologically, while general models serve as a capability baseline, the overwhelming focus has shifted to vertical domain specialization through dedicated, fine-tuned models, a move deemed critical for achieving the necessary clinical accuracy. This specialized architecture is heavily reinforced by trustworthiness mechanisms, such as RAG, to anchor responses to verified knowledge, and the strong integration of XAI methods, ensuring transparent and traceable clinical decision rationales. In terms of application, the largest focus is on providing broad, low-threshold General mental health support, but research is intensely concentrated on common conditions like depression and is actively expanding to complex, high-stakes disorders, including PTSD, BPD, and Suicidality. Finally, the data fueling these advances follows a dual track: it relies heavily on massive, unstructured social media data for real-time high-throughput screening, while also systematically incorporating structured professional knowledge bases and clinical records to build expert-grade decision support systems.
4.2 Technical, ethical, and practical limitations and risks of LLMs in mental health
4.2.1 Technical and clinical limitations
The core technical risk of LLMs lies in the deficiency of their clinical accuracy and robustness, particularly when handling high-risk scenarios (Gomes, 2024). For instance, while LLMs have been applied to analyze online discussions to identify high-risk behaviors like suicidal ideation and emotional distress, their precision and reliability in predictive and interventional tasks have not yet met strict clinical standards (Lashgari et al., 2025). A major flaw is the tendency for models to generate “hallucinations,” producing seemingly plausible but false or incorrect clinical information (Kim et al., 2025), which poses a fatal threat to diagnostic decision-support systems like LLMind Chat (Cremaschi et al., 2025). Furthermore, the predictive capacity of LLMs is heavily dependent on the quality of their training data (Yang et al., 2024). Many existing models lack sufficient critical prior knowledge and evidence-based medicine (EBM) data, preventing them from offering deep reasoning and assessments supported by clinical evidence for complex psychological issues (Youngstrom et al., 2017).
4.2.2 Multimodal data challenges
In the realm of multimodal data analysis, such as integrating EEG and physiological signals with text, LLMs show potential, yet they face major technical hurdles (AlSaad et al., 2024). These include data heterogeneity, a lack of interoperability between different sensor systems, and the challenge of establishing a clear clinical correlation between raw sensor data and a person’s mental health status. Effective analysis and reasoning on this disparate data are currently limited (Mezghani et al., 2015).
4.2.3 Practical and cultural barriers
On a practical level, cultural sensitivity represents a significant obstacle. Studies show that even advanced models (like GPT-4o) struggle markedly to identify culturally embedded high-risk narratives (Kazemi et al., 2024). For example, a model’s failure to detect risk signals for filicide-suicide, and its limitations in processing subtle psychological cues specific to certain cultures (Chen et al., 2025). In non-Western linguistic contexts, such as Arabic mental health support inquiries, models like ChatGPT have exhibited clear communication errors, indicating not just a linguistic barrier but a profound lack of understanding of non-mainstream emotional expressions, customs, and help-seeking behaviors (Aleem et al., 2024). For resource-scarce regions (e.g., Africa), LLMs trained on Western principles like Cognitive Behavioral Therapy (CBT) lack cultural resonance with local values, such as limiting user trust, engagement, and effectiveness (Igwe and Durrhiem, 2025).
4.2.4 Ethical and value-alignment risks
The primary ethical risk centers on the model’s value alignment and transparency. The opaque alignment processes of LLMs can unintentionally embed and amplify societal biases, leading to advice that is prejudiced or clinically problematic, potentially harming vulnerable help-seekers (Liu et al., 2023). Concurrently, the over-humanization of conversational AI may blur the lines of the professional therapeutic relationship, leading users to develop unrealistic clinical expectations and dependency (Ngo, 2025). Finally, the handling of highly personal and sensitive mental health data raises acute privacy and security concerns, requiring robust regulatory frameworks to prevent catastrophic data breaches (Kwesi et al., 2025).
4.3 Future research directions and development trends
Based on an analysis of existing research, the future direction of LLMs in mental health will be structured around four core pillars: deep specialization, multimodal fusion, ethical framework development, and global cultural adaptability.
4.3.1 Deep technical specialization
The future trend involves moving beyond general-purpose models to develop specialized LLMs dedicated solely to psychological health. These models will be trained on high-quality, evidence-based psychological datasets that include not just single-turn QA but also multi-turn dialogues and real-world case backgrounds augmented by evidence judgment to ensure deep psychological comprehension and evaluation. Research will continue to enhance the performance of unified information extraction, especially for specific languages like Chinese, by introducing components like type verification for more accurate identification of emotions, psychological states, and underlying issues from unstructured text.
4.3.2 Multimodal integration and empathic LLMs
Multimodal integration is poised to be a breakthrough direction, focusing on how to effectively fuse text, physiological signals (e.g., wearable data, EEG), and behavioral health data. Future Physiology-Driven empathic LLMs (Dongre, 2024) will utilize sophisticated techniques like Science-Guided ML (Sharma and Liu, 2022) to automatically extract features from raw physiological data, enabling the model to achieve precise prediction and contextual awareness of the user’s emotional state, thereby providing highly personalized and empathic interventions.
4.3.3 Clinical reasoning and decision support enhancement
Research will prioritize enhancing LLMs’ clinical reasoning capabilities. This includes developing advanced Chain-of-Thought prompting methods to guide models in complex synthesis and reasoning of multi-sensor data, transforming data classification into deep clinical insights for conditions like depression and anxiety. Furthermore, the RAG architecture will be optimized to verify knowledge in real-time from authoritative diagnostic manuals, serving as a core component of powerful clinical decision support systems and ensuring the professional accuracy of diagnostic suggestions and intervention plans.
4.3.4 Ethical frameworks and global cultural adaptation
Future research will place a strong emphasis on model value alignment and cultural fairness. This involves using frameworks like Schwartz’s theory of basic values to conduct continuous, systematic evaluation and correction of LLMs’ intrinsic values, ensuring their decisions align with core human values and avoiding the embedding of harmful biases. Furthermore, research into Africa-centric LLM frameworks will aim to integrate CBT principles with indigenous values like Ubuntu through fine-tuning (Forane et al., 2024), boosting the cultural relevance of LLMs, and providing support that is globally diverse, equitable, and inclusive. Ultimately, the trend is for LLMs to evolve from simple Q&A tools into highly specialized, culturally intelligent “Health Agents” that safely and reliably alleviate the global shortage of mental health resources.
4.4 Limitations
The conclusions of this systematic scoping review, which synthesizes the current literature on the boundaries, risks, and future trends of LLMs in mental health, are inevitably subject to the following key limitations. The current evidence base is largely restricted to preliminary exploratory studies and proof-of-concept analyses, significantly lacking the clinical rigor of large-scale randomized controlled trials, thus constraining the assessment of LLM efficacy, safety, and long-term impact with high-level evidence-based medicine validation. Compounding this is the rapid, near-instantaneous evolution of LLM technology, which means the literature analyzed may be quickly outdated, posing a severe timeliness challenge in capturing the newest breakthroughs and emergent risks. Furthermore, a heavy reliance on training data from predominantly English and Chinese contexts results in models with documented deficiencies in cultural sensitivity and language generalizability when dealing with non-Western or minority language narratives, highlighting fundamental issues of cultural equity. Finally, the proprietary and opaque ‘black-box’ nature of many high-performance LLMs restricts systematic scrutiny of their embedded ethical biases and value alignment, severely limiting the reproducibility of academic findings.
5 Conclusion
This systematic scoping review aims to systematically explore the boundaries of LLMs in mental health applications, summarizing their core technological pathways, inherent limitations, and future development trends in diagnosis, intervention, and risk prediction. The findings reveal that LLMs have rapidly evolved from simple text analyzers into “Health Agents” capable of integrating multimodal data. Through techniques and the development of specialized models, LLMs are offering novel solutions to alleviate the global shortage of mental health resources. The core strengths of LLMs lie in their advanced language understanding, potential for multimodal fusion, and significant capability to provide personalized, knowledge-guided interventions. However, this review also clearly delineates major challenges facing the field. Technical limitations include the model’s susceptibility to “hallucination,” a lack of clinical evidence-based support, and insufficient robustness and accuracy in high-risk scenarios. Ethical risks are concentrated on the non-transparent value alignment of models, the potential to embed and amplify cultural biases, and the dependency and blurring of the therapeutic relationship caused by over-humanization. Furthermore, the models’ lack of cultural sensitivity in non-Western cultural contexts severely restricts their global scalability and effectiveness. Looking forward, research should focus on: deep specialization, developing psychological professional LLMs based on authoritative psychological and EBM data; advancing cultural equity, developing culturally adaptive LLM frameworks; and establishing regulatory and ethical frameworks to ensure the transparency, trustworthiness, and safe handling of high-risk behaviors by the models. Only through interdisciplinary collaboration and rigorous clinical validation can LLMs be safely and equitably integrated into the mental health service ecosystem to fulfill their immense potential in addressing global psychological distress.
Statements
Author contributions
JY: Conceptualization, Methodology, Writing – original draft. TL: Data curation, Software, Writing – original draft. YL: Software, Visualization, Writing – original draft. TN: Writing – review & editing. PP: Supervision, Writing – review & editing. AX: Validation, Writing – original draft. QY: Data curation, Writing – original draft.
Funding
The author(s) declared that financial support was received for this work and/or its publication. This research was funded by the Macao Science and Technology Development Fund (FDCT; funding ID: 0032/2025/ITP1).
Conflict of interest
The author(s) declared that this work was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.
Generative AI statement
The author(s) declared that Generative AI was not used in the creation of this manuscript.
Any alternative text (alt text) provided alongside figures in this article has been generated by Frontiers with the support of artificial intelligence and reasonable efforts have been made to ensure accuracy, including review by the authors wherever possible. If you identify any issues, please contact us.
Publisher’s note
All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article, or claim that may be made by its manufacturer, is not guaranteed or endorsed by the publisher.
Supplementary material
The Supplementary material for this article can be found online at: https://www.frontiersin.org/articles/10.3389/fpsyg.2025.1715306/full#supplementary-material
References
1
AbdullahM.NegiedN. (2024). Detection and prediction of future mental disorder from social media data using machine learning, ensemble learning, and large language models. IEEE Access12, 120553–120569. doi: 10.1109/ACCESS.2024.3406469
2
Al-OtaibiG. M.AlotaibiH. M.AlsalmiS. S. (2025). Communication errors in human–Chatbot interactions: a case study of ChatGPT Arabic mental health support inquiries. Behav. Sci.15:1119. doi: 10.3390/bs15081119,
3
AleemM.ZahoorI.NaseemM. (2024). Towards culturally adaptive large language models in mental health: using ChatGPT as a case study. Companion Publication of the 2024 Conference on Computer-Supported Cooperative Work and Social Computing.
4
AlSaadR.Abd-AlrazaqA.BoughorbelS.AhmedA.RenaultM.-A.DamsehR.et al. (2024). Multimodal large language models in health care: applications, challenges, and future outlook. J. Med. Internet Res.26:e59505. doi: 10.2196/59505,
5
BartalA.JagodnikK. M.ChanS. J.DekelS. (2024). AI and narrative embeddings detect PTSD following childbirth via birth stories. Sci. Rep.14:8336. doi: 10.1038/s41598-024-54242-2,
6
BauerB.NorelR.LeowA.RachedZ. A.WenB.CecchiG. (2024). Using large language models to understand suicidality in a social media–based taxonomy of mental health disorders: linguistic analysis of reddit posts. JMIR mental health11:e57234. doi: 10.2196/57234,
7
BelcastroL.CantiniR.MarozzoF.TaliaD.TrunfioP. (2025). Detecting mental disorder on social media: a ChatGPT-augmented explainable approach. Online Soc. Netw. Media48:100321. doi: 10.1016/j.osnem.2025.100321
8
CaiZ.FangH.LiuJ.XuG.LongY.GuanY.et al. (2025). Improving unified information extraction in Chinese mental health domain with instruction-tuned LLMs and type-verification component. Artif. Intell. Med.162:103087. doi: 10.1016/j.artmed.2025.103087,
9
CardamoneN. C.OlfsonM.SchmutteT.UngarL.LiuT.CullenS. W.et al. (2025). Classifying unstructured text in electronic health records for mental health prediction models: large language model evaluation study. JMIR Med. Inform.13:e65454. doi: 10.2196/65454,
10
ChenC.-C.ChenJ. A.LiangC.-S.LinY.-H. (2025). Large language models may struggle to detect culturally embedded filicide-suicide risks. Asian J. Psychiatr.105:104395. doi: 10.1016/j.ajp.2025.104395,
11
ChowJ. C.LiK. (2025). Large language models in medical chatbots: opportunities, challenges, and the need to address AI risks. Information16:549. doi: 10.3390/info16070549
12
CremaschiM.DitolveD.CurcioC.PanzeriA.SpotoA.MaurinoA. (2025). Decoding the mind: a RAG-LLM on ICD-11 for decision support in psychology. Expert Syst. Appl.279:127191. doi: 10.1016/j.eswa.2025.127191
13
DongreP. (2024) Physiology-driven empathic large language models (EmLLMs) for mental health support. Extended Abstracts of the CHI Conference on Human Factors in Computing Systems,
14
Dos SantosW. R.ParaboniI.MatsushimaE. H.Da SilvaC. A.de Moura MeiraE. S.GuimarãesJ. V. R. F.et al. (2025). Mixture of experts for depression and anxiety disorder prediction from textual and non-textual social media data. IEEE Access.99:1. doi: 10.1109/ACCESS.2025.3583259
15
FanX.YangL.WangX.LyuD.ChenH. (2024). Constructing a knowledge-guided mental health chatbot with LLMS. The 16th Asian Conference on Machine Learning (Conference Track).
16
FennigU.Yom-TovE.SavitzkyL.NissanJ.AltmanK.LoebensteinR.et al. (2025). Bridging the conversational gap in epilepsy: using large language models to reveal insights into patient behavior and concerns from online discussions. Epilepsia66, 686–699. doi: 10.1111/epi.18226,
17
FishF. J.CaseyP.CaseyP. R.KellyB. (2024). Fish's clinical psychopathology: Signs and symptoms in psychiatry. Cambridge: Cambridge University Press.
18
ForaneS. G.EzugwuA. E.IgweK. (2024). Evaluating the cultural sensitivity of large language models in mental health support: a framework inspired by Ubuntu values. International Conference on Big Data Analytics.
19
GaoY.FuJ.GuoL.LiuH. (2025). Leveraging large language models for spontaneous speech-based suicide risk detection. arXiv. arXiv:2507.00693 [Preprint]. doi: 10.48550/arXiv.2507.00693.
20
GautamD.KellmeyerP. (2025). Exploring the credibility of large language models for mental health support: protocol for a scoping review. JMIR Res. Protoc.14:e62865. doi: 10.2196/62865,
21
GomesT. (2024). The Role of large language models in mental health: a scoping review. Universidade Catolica Portuguesa (Portugal): PQDT-Global.
22
Hadar-ShovalD.AsrafK.MizrachiY.HaberY.ElyosephZ. (2024). Assessing the alignment of large language models with human values for mental health integration: cross-sectional study using Schwartz’s theory of basic values. JMIR Mental Health11:e55988. doi: 10.2196/55988,
23
HanS.WangM.ZhangJ.LiD.DuanJ. (2024). A review of large language models: fundamental architectures, key technological evolutions, interdisciplinary technologies integration, optimization and compression techniques, applications, and challenges. Electronics13:5040. doi: 10.3390/electronics13245040
24
IbrahimI. M. B.MaskatR.AminordinA. B.TeoN. H. I. (2024) Classification of mental health conditions in Reddit post using multinomial naïve Bayes algorithm. 2024 IEEE 22nd Student Conference on Research and Development (SCOReD).
25
IgweK.DurrhiemK. (2025). A scoping review of culturally sensitive large language models-based cognitive behavioural therapy for anxiety and depression: global lessons for African implementation. Interdiscip. J. Sociality Stud.5:a06-a06. doi: 10.38140/ijss-2025.vol5.1.06
26
JamesL. J.MaessenM.GengaL.MontagneB.HagenaarsM. A.Van GorpP. M. (2023). Towards augmenting mental health personnel with LLM technology to provide more personalized and measurable treatment goals for patients with severe mental illnesses. International Conference on Pervasive Computing Technologies for Healthcare.
27
JinY.LiuJ.LiP.WangB.YanY.ZhangH.et al. (2025). The applications of large language models in mental health: scoping review. J. Med. Internet Res.27:e69284. doi: 10.2196/69284,
28
KaramatA.ImranM.YaseenM. U.BukhshR.AslamS.AshrafN. (2024). A hybrid transformer architecture for multiclass mental illness prediction using social media text. IEEE Access.99:1. doi: 10.1109/ACCESS.2024.3519308
29
KazemiS.GerhardtG.KatzJ.KuriaC. I.PanE.PrabhakarU. (2024). Cultural fidelity in large-language models: an evaluation of online language resources as a driver of model performance in value representation. arXiv. arXiv:2410.10489 [Preprint]. doi: 10.48550/arXiv.2410.10489.
30
KharitonovaK.Pérez-FernándezD.Gutiérrez-HernandoJ.Gutiérrez-FandiñoA.CallejasZ.GriolD. (2025). Incorporating evidence into mental health Q&a: a novel method to use generative language models for validated clinical content extraction. Behav. Inform. Technol.44, 2333–2350. doi: 10.1080/0144929X.2024.2321959
31
KimT.BaeS.KimH. A.LeeS.-w.HongH.YangC.et al. (2024). MindfulDiary: harnessing large language model to support psychiatric patients' journaling. Proceedings of the 2024 CHI Conference on Human Factors in Computing Systems.
32
KimY.JeongH.ChenS.LiS. S.LuM.AlhamoudK.et al. (2025). Medical hallucinations in foundation models and their impact on healthcare. arXiv. arXiv:2503.05777 [Preprint]. doi: 10.48550/arXiv.2503.05777
33
KirkbrideJ. B.AnglinD. M.ColmanI.DykxhoornJ.JonesP. B.PatalayP.et al. (2024). The social determinants of mental health and disorder: evidence, prevention and recommendations. World Psychiatry23, 58–90. doi: 10.1002/wps.21160,
34
KumarA.GuptaK.VermaK.KumarS. (2025a). AI-driven mental healthcare 5.0: a survey of opportunities and challenges in leveraging large language models and generative AI.
35
KumarA.SharmaA.SangwanS. R. (2025b). DynaMentA: dynamic prompt engineering and weighted transformer architecture for mental health classification using social media data. IEEE Trans. Comput. Soc. Syst.12, 4193–4203. doi: 10.1109/TCSS.2025.3569400
36
KwesiJ.CaoJ.ManchandaR.Emami-NaeiniP. (2025). Exploring user security and privacy attitudes and concerns toward the use of {general-purpose}{LLM} chatbots for mental health. 34th USENIX security symposium (USENIX security 25).
37
LashgariF.PourvahabM.SousaA.MonteiroA.PaisS. (2025). Risk-aware suicide detection in social media: a domain-guided framework with explainable LLMs. Int. J. Web Res.8, 45–58. doi: 10.22133/ijwr.2025.525754.1288
38
LiY.-H.LiY.-L.WeiM.-Y.LiG.-Y. (2024). Innovation and challenges of artificial intelligence technology in personalized healthcare. Sci. Rep.14:18994. doi: 10.1038/s41598-024-70073-7,
39
LiJ.YangY.MaoC.PangP. C.-I.ZhuQ.XuD.et al. (2025). Revealing patient dissatisfaction with health care resource allocation in multiple dimensions using large language models and the international classification of diseases 11th revision: aspect-based sentiment analysis. J. Med. Internet Res.27:e66344. doi: 10.2196/66344,
40
LinC.KuoC.-F. (2025). Roles and potential of large language models in healthcare: a comprehensive review. Biom. J.48:100868. doi: 10.1016/j.bj.2025.100868,
41
LiuT.ChengY.LuoY.WangZ.PangP. C.-I.XiaY.et al. (2024). The impact of social media on children’s mental health: a systematic scoping review. Healthcare12:2391. doi: 10.3390/healthcare12232391,
42
LiuY.YaoY.TonJ.-F.ZhangX.GuoR.ChengH.et al. (2023). Trustworthy llms: a survey and guideline for evaluating large language models' alignment. arXiv. arXiv:2308.05374 [Preprint]. doi: 10.48550/arXiv.2308.05374
43
LozoyaD. C.ConwayM.De DuroE. S.D'AlfonsoS. (2025). Leveraging large language models for simulated psychotherapy client interactions: development and usability study of Client101. JMIR Med. Educ.11:e68056. doi: 10.2196/68056
44
LuoY.ZhangR.WangF.WeiT. (2023). Customer segment classification prediction in the Australian retail based on machine learning algorithms. Proceedings of the 2023 4th International Conference on Machine Learning and Computer Application.
45
MalhotraA.JindalR. (2024). XAI transformer based approach for interpreting depressed and suicidal user behavior on online social networks. Cogn. Syst. Res.84:101186. doi: 10.1016/j.cogsys.2023.101186
46
MattosS. M.CestariV. R. F.MoreiraT. M. M. (2023). Scoping protocol review: PRISMA-ScR guide refinement. Rev. Enferm. UFPI12:e3062. doi: 10.26694/reufpi.v12i1.3062
47
McGowanJ.StrausS.MoherD.LangloisE. V.O'BrienK. K.HorsleyT.et al. (2020). Reporting scoping reviews—PRISMA ScR extension. J. Clin. Epidemiol.123, 177–179. doi: 10.1016/j.jclinepi.2020.03.016,
48
MezghaniE.ExpositoE.DriraK.Da SilveiraM.PruskiC. (2015). A semantic big data platform for integrating heterogeneous wearable data in healthcare. J. Med. Syst.39:185. doi: 10.1007/s10916-015-0344-x,
49
Montejo-RaezA.Molina-GonzalezM. D.Jimenez-ZafraS. M.Garcia-CumbrerasM. A.Garcia-LopezL. J. (2024). A survey on detecting mental disorders with natural language processing: literature review, trends and challenges. Comput Sci Rev53:100654. doi: 10.1016/j.cosrev.2024.100654
50
NgoV. (2025). Humanizing AI for trust: the critical role of social presence in adoption. AI & Soc., 1–17. doi: 10.1007/s00146-025-02506-4,
51
NowackiA.SitekW.RybińskiH. (2025). LLM-based classifiers for discovering mental disorders. J. Intell. Inf. Syst., 1–18. doi: 10.1007/s10844-025-00934-8,
52
OmopoO. E. (2024). Exploring post-traumatic stress disorder: causes, diagnostic criteria, and treatment options. Int. J. Acad. Inf. Syst. Res.8, 35–44.
53
OrrùG.MelisG.SartoriG. (2025). Large language models and psychiatry. Int. J. Law Psychiatry101:102086. doi: 10.1016/j.ijlp.2025.102086,
54
OwenD.LynhamA. J.SmartS. E.PardinasA. F.Camacho ColladosJ. (2024). Artificial intelligence for analyzing mental health disorders in social media: a quarter-century narrative review of progress and challenges. J. Med. Internet Res.
55
PanZ.ParkC.BrietzkeE.ZuckermanH.RongC.MansurR. B.et al. (2019). Cognitive impairment in major depressive disorder. CNS Spectr.24, 22–29. doi: 10.1017/S1092852918001207,
56
PangP. C.-I.ChangS.VerspoorK.ClavisiO. (2018). The use of web-based Technologies in Health Research Participation: qualitative study of consumer and researcher experiences. J. Med. Internet Res.20:e12094. doi: 10.2196/12094,
57
ParkC.LeeH.JeongO. R. (2024). Leveraging medical knowledge graphs and large language models for enhanced mental disorder information extraction. Future Internet16:260. doi: 10.3390/fi16080260
58
PavezJ.AllendeH. (2024). A hybrid system based on bayesian networks and deep learning for explainable mental health diagnosis. Appl. Sci.14:8283. doi: 10.3390/app14188283
59
PerkinsA.RidlerJ.BrowesD.PeryerG.NotleyC.HackmannC. (2018). Experiencing mental health diagnosis: a systematic review of service user, clinician, and carer perspectives across clinical settings. Lancet Psychiatry5, 747–764. doi: 10.1016/S2215-0366(18)30095-6,
60
RadwanA.AmarnehM.AlawnehH.AshqarH. I.AlSobehA.MagablehA. A. A. R. (2024). Predictive analytics in mental health leveraging LLM embeddings and machine learning models for social media analysis. Int. J. Web Serv. Res.21, 1–22. doi: 10.4018/IJWSR.338222
61
RowaK.WaechterS.HoodH. K.AntonyM. M. (2017). “Generalized anxiety disorder” in Psychopathology: History, diagnosis, and empirical foundations, third edition, 149–186.
62
SharmaN.LiuY. (2022). A hybrid science-guided machine learning approach for modeling chemical processes: a review. AICHE J.68:e17609. doi: 10.1002/aic.17609
63
SharmaC. M.TheinK. Y. M.ChariarV. M. (2024). “Optimized support vector Machines for Detection of mental disorders” in Artificial intelligence in healthcare (CRC Press), 190–219.
64
ShayaninasabM.ZahoorM.YalçinÖ. N. (2024) Enhancing patient intake process in mental health consultations using RAG-driven Chatbot. 2024 12th International Conference on Affective Computing and Intelligent Interaction Workshops and Demos (ACIIW).
65
ShiQ. (2025) Chatbots in mental healthcare: developments and challenges. 2025 IEEE 26th China Conference on System Simulation Technology and its Applications (CCSSTA).
66
SteinD. J.ShoptawS. J.VigoD. V.LundC.CuijpersP.BantjesJ.et al. (2022). Psychiatric diagnosis and treatment in the 21st century: paradigm shifts versus incremental integration. World Psychiatry21, 393–414. doi: 10.1002/wps.20998,
67
VigoD.ThornicroftG.AtunR. (2016). Estimating the true global burden of mental illness. Lancet Psychiatry3, 171–178. doi: 10.1016/S2215-0366(15)00505-2,
68
VrdoljakJ.BobanZ.VilovićM.KumrićM.BožićJ. (2025). A review of large language models in medical education, clinical decision support, and healthcare administration. Healthcare13. doi: 10.3390/healthcare13060603,
69
WagayF. A.AltafY. (2025). MentalRoBERTa-caps: a capsule-enhanced transformer model for mental health classification. MethodsX15:103483. doi: 10.1016/j.mex.2025.103483
70
WangX.ZhouY.ZhouG. (2025). Enhancing health assessments with large language models: a methodological approach. Appl. Psychol. Health Well Being17:e12602. doi: 10.1111/aphw.12602,
71
World Health Organization (2022). World mental health report: Transforming mental health for all. Geneva: World Health Organization.
72
WuR.YuC.PanX.LiuY.ZhangN.FuY.et al. (2024) MindShift: leveraging large language models for mental-states-based problematic smartphone use intervention. Proceedings of the 2024 CHI conference on human factors in computing systems.
73
YangJ.JinH.TangR.HanX.FengQ.JiangH.et al. (2024). Harnessing the power of llms in practice: a survey on chatgpt and beyond. ACM Trans. Knowl. Discov. Data18, 1–26. doi: 10.1145/3653304,
74
YoungstromE. A.Van MeterA.FrazierT. W.HunsleyJ.PrinsteinM. J.OngM. L.et al. (2017). Evidence-based assessment as an integrative model for applying psychological science to guide the voyage of treatment. Clin. Psychol. Sci. Pract.24, 331–363. doi: 10.1111/cpsp.12207
75
ZhangX.CuiW.WangJ.LiY. (2024). Chat, summary and diagnosis: a LLM-enhanced conversational agent for interactive depression detection. 2024 4th International Conference on Industrial Automation, Robotics and Control Engineering (IARCE).
76
ZhangT.TengS.JiaH.D'AlfonsoS. (2024). Leveraging LLMs to predict affective states via smartphone sensor features. Companion of the 2024 on ACM International Joint Conference on Pervasive and Ubiquitous Computing,
Summary
Keywords
large language model, LLMS, mental health, mental illness, systematic scoping review
Citation
Yang J, Liu T, Luo YT, Niu T, Pang P, Xiang A and Yang Q (2026) Exploring the application boundaries of LLMs in mental health: a systematic scoping review. Front. Psychol. 16:1715306. doi: 10.3389/fpsyg.2025.1715306
Received
29 September 2025
Revised
08 December 2025
Accepted
22 December 2025
Published
27 February 2026
Volume
16 - 2025
Edited by
Wing-Yue Geoffrey Louie, Oakland University, United States
Reviewed by
Inez Y. Oh, Washington University in St. Louis, United States
Zhaoxi Fang, Shaoxing University, China
Updates
Copyright
© 2026 Yang, Liu, Luo, Niu, Pang, Xiang and Yang.
This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.
*Correspondence: Patrick Pang, mail@patrickpang.net
ORCID: Jinhua Yang, orcid.org/0009-0004-4633-1830; Ting Liu, orcid.org/0009-0001-0331-262X; Yiming Taclis Luo, orcid.org/0009-0002-6117-738X; Tianyue Niu, orcid.org/0009-0008-3410-6301; Patrick Pang, orcid.org/0000-0002-8820-5443; Ao Xiang, orcid.org/0009-0003-8828-4510; Qin Yang, orcid.org/0009-0007-9843-514X
Disclaimer
All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article or claim that may be made by its manufacturer is not guaranteed or endorsed by the publisher.