Your new experience awaits. Try the new design now and help us make it even better

ORIGINAL RESEARCH article

Front. Artif. Intell.

Sec. Medicine and Public Health

AI for Evidence-based Treatment Recommendation in Oncology: A Blinded Evaluation of Large Language Models and Agentic Workflows

Provisionally accepted
Guannan  ZhaiGuannan Zhai1Merav  BarMerav Bar2Andrew  CowanAndrew Cowan3,4Samuel  RubinsteinSamuel Rubinstein5Qian  ShiQian Shi6Ningjie  ZhangNingjie Zhang7En  XieEn Xie8Will  MaWill Ma8*
  • 1The George Washington University Department of Statistics, Washington, United States
  • 2Bristol Myers Squibb, New Jersey, United States
  • 3Fred Hutchinson Cancer Center Clinical Research Division, Seattle, United States
  • 4The University of Alabama at Birmingham Division of Hematology and Oncology, Birmingham, United States
  • 5The University of North Carolina at Chapel Hill Division of Hematology, Chapel Hill, United States
  • 6Mayo Clinic Department of Quantitative Health Sciences, Rochester, United States
  • 7Shanghai Jiao Tong University Department of Bioinformatics and Biostatistics, Shanghai, China
  • 8Hope AI, Inc, Princeton, NJ, United States

The final, formatted version of the article will be published soon.

Background: Evidence-based medicine is crucial for clinical decision-making, yet studies suggest that a significant proportion of treatment decisions do not fully incorporate the latest evidence. Large Language Models (LLMs) show promise in bridging this gap, but their reliability for medical recommendations remains uncertain. Method: We conducted an evaluation study comparing five LLMs' recommendations across 50 clinical scenarios related to multiple myeloma diagnosis, staging, treatment, and management, using a unified evidence cutoff of June 2024. The evaluation included three general-purpose LLMs (OpenAI o1-preview, Claude 3.5 Sonnet, Gemini 1.5 Pro), one retrieval-augmented generation (RAG) system (Myelo), and one agentic workflow-based system (HopeAI). General-purpose LLMs generated responses based solely on their internal knowledge, while the RAG system enhanced these capabilities by incorporating external knowledge retrieval. The agentic workflow system extended the RAG approach by implementing multi-step reasoning and coordinating with multiple tools and external systems for complex task execution. Three independent hematologist-oncologists evaluated the LLM-generated responses using standardized scoring criteria developed specifically for this study. Performance assessment encompassed five dimensions: accuracy, relevance, comprehensiveness, hallucination rate, and clinical use readiness. Results: HopeAI demonstrated superior performance across accuracy (82.0%), relevance (85.3%), and comprehensiveness (74.0%), compared to OpenAI o1-preview (64.7%, 57.3%, 36.0%), Claude 3.5 Sonnet (50.0%, 51.3%, 29.3%), Gemini 1.5 Pro (48.0%, 46.0%, 30.0%), and Myelo (58.7%, 56%, 32.7%). Hallucination rates were consistently low across all systems: HopeAI (5.3%), OpenAI o1-preview (3.3%), Claude 3.5 Sonnet (10.0%), Gemini 1.5 Pro (8.0%), and Myelo (5.3%). Clinical use readiness scores were relatively low for all systems: HopeAI (25.3%), OpenAI o1-preview (6.0%), Claude 3.5 Sonnet (2.7%), Gemini 1.5 Pro (4.0%), and Myelo (4.0%). Conclusion: This study demonstrates that while current LLMs show promise in medical decision support, their recommendations require careful clinical supervision to ensure patient safety and optimal care. Further research is needed to improve their clinical use readiness before integration into oncology workflows. These findings provide valuable insights into the capabilities and limitations of LLMs in oncology, guiding future research and development efforts toward integrating AI into clinical workflows.

Keywords: Agentic Workflows, Artificial intelligence in medicine, clinical evaluation, Evidence-Based Medicine, Large language models, Multiple Myeloma, Oncology Decision Support, Retrieval-Augmented Generation

Received: 10 Aug 2025; Accepted: 24 Nov 2025.

Copyright: © 2025 Zhai, Bar, Cowan, Rubinstein, Shi, Zhang, Xie and Ma. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) or licensor are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

* Correspondence: Will Ma

Disclaimer: All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article or claim that may be made by its manufacturer is not guaranteed or endorsed by the publisher.