Large Language Model Bias Auditing for Periodontal Diagnosis Using an Ambiguity-Probe Methodology: A Pilot Study

Nantakeeratipat, Teerachate

doi:10.3389/fdgth.2025.1687820

BRIEF RESEARCH REPORT article

Front. Digit. Health

Sec. Ethical Digital Health

This article is part of the Research TopicEthical Considerations of Large Language Models: Challenges and Best PracticesView all 9 articles

Large Language Model Bias Auditing for Periodontal Diagnosis Using an Ambiguity-Probe Methodology: A Pilot Study

Provisionally accepted

Teerachate Nantakeeratipat^*

Faculty of Dentistry, Srinakharinwirot University, Bangkok, Thailand

The final, formatted version of the article will be published soon.

Background: Large Language Models (LLMs) in healthcare holds immense promise yet carries the risk of perpetuating social biases. While artificial intelligence (AI) fairness is a growing concern, a gap exists in understanding how these models perform under conditions of clinical ambiguity, a common feature in real-world practice. Methods: We conducted a study using an ambiguity-probe methodology with a set of 42 sociodemographic personas and 15 clinical vignettes based on the 2018 classification of periodontal diseases. Ten were clear-cut scenarios with established ground truths, while five were intentionally ambiguous. OpenAI's GPT-4o and Google's Gemini 2.5 Pro were prompted to provide periodontal stage and grade assessments using 630 vignette-persona combinations per model. Results: In clear-cut scenarios, GPT-4o demonstrated significantly higher combined (stage and grade) accuracy (70.5%) than Gemini Pro (33.3%). However, a robust fairness analysis using cumulative link models with false discovery rate correction revealed no statistically significant sociodemographic bias in either model. This finding held true across both clear-cut and ambiguous clinical scenarios. Conclusion: To our knowledge, this is among the first study to use simulated clinical ambiguity to reveal the distinct ethical fingerprints of LLMs in a dental context. While LLM performance gaps exist, our analysis decouples accuracy from fairness, demonstrating that both models maintain sociodemographic neutrality. We identify that the observed errors are not bias, but rather diagnostic boundary instability. This highlights a critical need for future research to differentiate between these two distinct types of model failure to build genuinely reliable AI.

Keywords: Large language models, AI Bias, Clinical Ambiguity, Dental Informatics, GPT-4o, Gemini Pro, Ethical Auditing

Received: 18 Aug 2025; Accepted: 08 Dec 2025.

Copyright: © 2025 Nantakeeratipat. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) or licensor are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

* Correspondence: Teerachate Nantakeeratipat

Disclaimer: All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article or claim that may be made by its manufacturer is not guaranteed or endorsed by the publisher.