Your new experience awaits. Try the new design now and help us make it even better

ORIGINAL RESEARCH article

Front. Pharmacol.

Sec. Drugs Outcomes Research and Policies

This article is part of the Research TopicAI in Healthcare: Transforming Clinical Risk Prediction, Medical Large Language Models, and BeyondView all 12 articles

Large language models management of complex medication regimens: a case-based evaluation

Provisionally accepted
Aaron  ChaseAaron Chase1Amoreena  MostAmoreena Most2Shaochen  XuShaochen Xu2Erin  BarretoErin Barreto3Brian  MurrayBrian Murray4Kelli  HenryKelli Henry5Susan  SmithSusan Smith2Tanner  HedrickTanner Hedrick6XianYan  ChenXianYan Chen2Sheng  LiSheng Li7Tianming  LiuTianming Liu2Andrea  SikoraAndrea Sikora4,8*
  • 1University of Georgia, Tbilisi, Georgia
  • 2University of Georgia, Athens, Georgia, United States
  • 3Mayo Clinic, Rochester, Minnesota, United States
  • 4University of Colorado, Denver, Colorado, United States
  • 5WellStar Health System, Atlanta, Georgia, United States
  • 6University of North Carolina Hospitals, Chapel Hill, North Carolina, United States
  • 7University of Virginia, Charlottesville, Virginia, United States
  • 8College of Pharmacy, University of Georgia, Athens, United States

The final, formatted version of the article will be published soon.

Background: Large language models (LLMs) have shown the ability to diagnose complex medical cases, but only limited studies have evaluated the performance of LLMs in the development of evidence-based treatment plans. The purpose of this evaluation was to test four LLMs on their ability to develop safe and efficacious treatment plans on complex patients managed in the intensive care unit (ICU). Methods: Eight high-fidelity patient cases focusing on medication management were developed by critical care clinicians including history of present illness, laboratory values, vital signs, home medications, and current medications. Four LLMs [ChatGPT (GPT-3.5), ChatGPT (GPT-4), Claude-2, and Llama-2-70b] were prompted to develop an optimized medication regimen for each case. LLM generated medication regimens were then reviewed by a panel of seven critical care clinicians to assess safety and efficacy, as defined by medication errors identified and appropriate treatment for the clinical conditions. Appropriate treatment was measured by the average rate of clinician agreement to continue each medication in the regimen and compared using analysis of variance (ANOVA). Results: Clinicians identified a median of 4.1-6.9 medication errors per recommended regimen, and life-threatening medication recommendations were present in 16.3-57.1% of the regimens, depending on LLM. Clinicians continued LLM-recommended medications at a rate of 54.6-67.3%, with GPT-4 having the highest rate of medication continuation among all LLMs tested (p <0.001) and the lowest rate of life-threatening medication errors (p <0.001). Conclusions: Caution is warranted using present LLMs for medication regimens given the number of medication errors that were identified in this pilot study. However, LLMs did demonstrate potential to serve as clinical decision support for the management of complex medication regimens given the need for domain specific prompting and testing.

Keywords: Large Language Model, artificial intelligence, Pharmacy, Medication regimen complexity, Natural language proceeding (NLP)

Received: 13 Mar 2025; Accepted: 05 Nov 2025.

Copyright: © 2025 Chase, Most, Xu, Barreto, Murray, Henry, Smith, Hedrick, Chen, Li, Liu and Sikora. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) or licensor are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

* Correspondence: Andrea Sikora, andrea.sikora@cuanschutz.edu

Disclaimer: All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article or claim that may be made by its manufacturer is not guaranteed or endorsed by the publisher.