Your new experience awaits. Try the new design now and help us make it even better

BRIEF RESEARCH REPORT article

Front. Digit. Health

Sec. Ethical Digital Health

This article is part of the Research TopicEthical Considerations of Large Language Models: Challenges and Best PracticesView all 8 articles

Uncovering Bias and Variability in How Large Language Models Attribute Cardiovascular Risk

Provisionally accepted
  • 1School of Clinical Medicine, University of Cambridge, Cambridge, United Kingdom
  • 2Nanyang Technological University, Singapore, Singapore

The final, formatted version of the article will be published soon.

Large language models (LLMs) are used increasingly in medicine, but their decision-making in cardiovascular risk attribution remains underexplored. This pilot study examined how an LLM apportioned relative cardiovascular risk across different demographic and clinical domains. A structured prompt set across six domains was developed, across general cardiovascular risk, body mass index (BMI), diabetes, depression, smoking, and hyperlipidaemia, and submitted in triplicate to ChatGPT 4.0 mini. For each domain, a neutral prompt assessed the LLM's risk attribution, while paired comparative prompts examined whether including the domain changed the LLM's decision of the higher-risk demographic group. The LLM attributed higher cardiovascular risk to men than women, and to Black rather than white patients, across most neutral prompts. In comparative prompts, the LLM's decision between sex changed in two of six domains: when depression was included, risk attribution was equal between men and women. It changed from females being at higher risk than males in scenarios without smoking, but changed to males being at higher risk than females when smoking was present. In contrast, race-based decisions of relative risk were stable across domains, as the LLM consistently judged Black patients to be higher-risk. Agreement across repeated runs was strong (ICC of 0.949, 95% CI: 0.819-0.992, p = <0.001). The LLM exhibited bias and variability across cardiovascular risk domains. Although decisions between males/females sometimes changed when comorbidities were included, race-based decisions remained the same. This pilot study suggests careful evaluation of LLM clinical decision-making is needed, to avoid reinforcing inequities.

Keywords: Large Language Model, cardiovascular risk, Gender Equality, Bias, artificial intelligence

Received: 22 Sep 2025; Accepted: 24 Nov 2025.

Copyright: © 2025 Tin Nok Chan and Kwek. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) or licensor are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

* Correspondence: Justine Tin Nok Chan

Disclaimer: All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article or claim that may be made by its manufacturer is not guaranteed or endorsed by the publisher.