Your new experience awaits. Try the new design now and help us make it even better

ORIGINAL RESEARCH article

Front. Artif. Intell.

Sec. Medicine and Public Health

This article is part of the Research TopicGenAI in Healthcare: Technologies, Applications and EvaluationView all 14 articles

Improving reliability and accuracy of structured data extraction using a consensus Large-Language Model approach – a use case description in Multiple Sclerosis

Provisionally accepted
Philip  Lennart PoserPhilip Lennart Poser1*Rafael  KlimasRafael Klimas1Justus  LuerwegJustus Luerweg1Emilie  ReuterEmilie Reuter1Christoph  HanefeldChristoph Hanefeld2Ralf  Karl GoldRalf Karl Gold1Anke  SalmenAnke Salmen1Jeremias  MotteJeremias Motte1
  • 1Department of Neurology, St. Josef-Hospital, Bochum, Germany
  • 2Department of Internal Medicine, Katholisches Klinikum Bochum gGmbH, Bochum, Germany

The final, formatted version of the article will be published soon.

Background The absence of standardization in the documentation of routine clinical data complicates research usage of retrospective data on a large-scale basis. Medically trained personnel is required for interpretation and conversion into a structured format making it time and cost intensive and creating a potential bias of such data. To address these challenges, we have developed a semi-automated approach for evaluating Multiple Sclerosis (MS) outpatients reports that utilizes different large-language models (LLM) and their consensus in comparison to manual evaluation. Methods We used several commercially available LLMs by OpenAI, Anthropic and Google to create a structured output of several variables with differing complexity of 30 anonymized outpatient reports with zero-shot-learning. We added a consensus output by combining the results of three different LLMs. Over several runs, we adapted the prompt, compared the results with a reference and assessed the error rate. Any deviation from the reference was considered an error. A true-error rate was determined for the LLM consensus output and the neurology specialist output, where only content deviations are counted as errors. Results Through 9 iterations of improving the structure and content of the prompt, we have seen a clear reduction in the error rate of the various LLMs. By creating an LLM consensus with the final prompt design, we were able to overcome a ceiling effect in reducing the error rate. With a true-error rate of 1.48%, the LLM consensus shows a similar error rate as neurologists (around 2%) in the creation of structured data. Discussion Our method enables fast and reliable LLM-based analysis of large clinical routine data sets of varying complexity with a low technical barrier to entry. By generating an LLM consensus, we were able to considerably improve the quality of the output making it comparable to data created by neurology specialists. This approach allows large amounts of unstructured data to be analyzed in a time and cost-efficient manner. Nevertheless, the evaluation of errors in results produced by LLM remains difficult. Scientific work using such methods must continue to be subject to strict testing of the validity of the method in the future.

Keywords: Data extraction, large language model (LLM), Multiple Sclerosis, Neurology, Real world evidence (RWE), Structured data

Received: 02 Jul 2025; Accepted: 28 Jan 2026.

Copyright: © 2026 Poser, Klimas, Luerweg, Reuter, Hanefeld, Gold, Salmen and Motte. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) or licensor are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

* Correspondence: Philip Lennart Poser

Disclaimer: All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article or claim that may be made by its manufacturer is not guaranteed or endorsed by the publisher.