DATA REPORT article

Front. Psychiatry

Sec. Digital Mental Health

Volume 16 - 2025 | doi: 10.3389/fpsyt.2025.1646974

Evaluation of Large Language Models on Mental Health: From Knowledge Test to Illness Diagnosis

Provisionally accepted
Yijun  XuYijun Xu1Zhaoxi  FangZhaoxi Fang1Weinan  LinWeinan Lin1Yue  JiangYue Jiang1Wen  JinWen Jin1Prasanalakshmi  BalajiPrasanalakshmi Balaji2Jiangda  WangJiangda Wang1Ting  XiaTing Xia1*
  • 1Shaoxing University, Shaoxing, China
  • 2King Khalid University, Abha, Saudi Arabia

The final, formatted version of the article will be published soon.

Large language models (LLMs) have opened up new possibilities in the field of mental health, offering applications in areas such as mental health assessment, psychological counseling, and education. This study systematically evaluates 15 state-of-the-art LLMs, including DeepSeek-R1/V3 (March 24, 2025), GPT-4.1 (April 15, 2025), Llama4 (April 5, 2025), and QwQ (March 6, 2025, developed by Alibaba), on two key tasks: mental health knowledge testing and mental illness diagnosis. We use publicly available datasets, including Dreaddit, SDCNL, and questions from the CAS Counsellor Qualification Exam. Results indicate that DeepSeek-R1, QwQ, and GPT-4.1 outperform other models in both knowledge accuracy and diagnostic performance. Our findings highlight the strengths and limitations of current LLMs in mental health scenarios and provide clear guidance for selecting and improving models in this sensitive domain.

Keywords: Large language models, Model evaluation, Mental Health, Knowledge test, Illness diagnosis

Received: 14 Jun 2025; Accepted: 14 Jul 2025.

Copyright: © 2025 Xu, Fang, Lin, Jiang, Jin, Balaji, Wang and Xia. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) or licensor are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

* Correspondence: Ting Xia, Shaoxing University, Shaoxing, China

Disclaimer: All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article or claim that may be made by its manufacturer is not guaranteed or endorsed by the publisher.