ORIGINAL RESEARCH article
Front. Oral Health
Sec. Oral Health Promotion
This article is part of the Research TopicCutting-Edge Technologies in Digital DentistryView all 8 articles
Accuracy and Reliability of Manus, ChatGPT, and Claude in Case-Based Dental Diagnosis
Provisionally accepted- 1University of Hail College of Dentistry, Hail, Saudi Arabia
- 2Queen Mary University of London, London, United Kingdom
Select one of your emails
You have multiple emails registered with Frontiers:
Notify me on publication
Please enter your email address:
If you already have an account, please login
You don't have a Frontiers account ? You can register here
Artificial intelligence (AI), particularly large language models (LLMs), is transforming healthcare education and clinical decision-making. While models like ChatGPT and Claude have demonstrated utility in medical contexts, their performance in dental diagnostics remains underexplored; additionally, the potential of emerging platforms, like Manus, is yet to be evaluated. Objective: To compare the diagnostic accuracy and consistency of the ChatGPT, Claude, and Manus – using authentic, case-based dental scenarios. Methods: A set of 117 multiple-choice questions based on validated clinical dental vignettes spanning various specialities was administered to each model under standardised conditions at two separate time points. Responses were scored against expert-validated answer keys. Inter-rater reliability was assessed using Cohen's kappa, and statistical comparisons were made using the chi-square, McNemar, and t-tests. Results: Claude and Manus consistently outperformed ChatGPT across both testing phases. In the second round, Claude and Manus achieved a diagnostic accuracy of 92.3%, compared to ChatGPT's 76.9%. Claude and Manus also demonstrated higher intra-model consistency (Cohen's kappa = 0.714 and 0.782, respectively) than ChatGPT (kappa = 0.560). Although the numerical trends favoured Claude and Manus, pairwise differences in accuracy did not reach statistical significance. Conclusion: Claude and Manus demonstrated numerically higher diagnostic performance and greater response stability compared with ChatGPT; however, these differences did not reach statistical significance and should therefore be interpreted cautiously. This variability across models highlights the need for larger-scale evaluations. These findings underscore the importance of considering both accuracy and consistency when selecting AI tools for integration into dental practice and curricula.
Keywords: artificial intelligence, Large language models, clinical decision-making, Manus AI, ChatGPT, Claude, Intra-model Consistency, dental education
Received: 14 Aug 2025; Accepted: 09 Dec 2025.
Copyright: © 2025 Madfa, Alshammari, Anazi, Alenezi and Alkurdi. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) or licensor are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.
* Correspondence:
Ahmed A. Madfa
Abdullah F Alshammari
Disclaimer: All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article or claim that may be made by its manufacturer is not guaranteed or endorsed by the publisher.
