ORIGINAL RESEARCH article

Front. Digit. Health

Sec. Health Informatics

MEDAI-LLM-SUMM: A Reporting Checklist for Medical Text Summarization Studies Using Large Language Models

  • Research and Practical Clinical Center for Diagnostics and Telemedicine Technologies, Moscow, Russia

Article metrics

View details

206

Views

The final, formatted version of the article will be published soon.

Abstract

Background. Medical text summarization using large language models (LLMs) has reached an inflection point in 2024-2025, with adapted models demonstrating capability to match or exceed human expert performance in specific tasks. However, critical gaps persist in safety validation, evaluation frameworks, and clinical deployment readiness. A comprehensive review revealed that only 7% of studies conducted external validation and 3% performed patient safety assessments, with hallucination rates ranging from 1.47% to 61.6%. Existing reporting guidelines, including CONSORT-AI, SPIRIT-AI, TRIPOD-LLM, and DEAL, do not adequately address the specific requirements of medical text summarization tasks. Objective: to develop MEDAI-LLM-SUMM, the first specialized reporting checklist for research on medical text summarization using LLMs, addressing critical gaps in existing reporting standards. Methods. A modified iterative consensus approach was employed, comprising three sequential stages: (1) a systematic literature review of 216 publications from PubMed and eLibrary (2023-2025) following PRISMA guidelines and an analysis of existing reporting standards (TRIPOD-LLM, DEAL, CONSORT-AI, SPIRIT-AI, TRIPOD+AI, CLAIM, STARD-AI); (2) development of an initial 44-item, 7-section checklist by a supervisory group; (3) three rounds of face-to-face consensus discussions with a multidisciplinary expert panel of 11 specialists (3 radiologists, 2 clinicians, 3 medical informatics experts, 1 biostatistician, and 2 medical LLM developers). The consensus criterion required unanimous agreement from all panel members. Results. The final MEDAI-LLM-SUMM checklist comprises 24 items organized into six sections: A) Clinical validity (4 items addressing clinical task definition, expert involvement, hypothesis formulation, and medical expertise requirements); B) Model Selection (5 items covering model justification, system requirements, deployment environment, LLM-as-judge approach, and prompt documentation); C) Data (3 items on datasets, reference summaries with expert consensus, and data stratification); D) Quality Assessment (8 items including evaluation metrics, clinical metrics, expert evaluation, hallucination detection, LLM-judge assessment, sample size justification, pilot testing, and limitations documentation); E) Safety (2 items on ethical approval and data anonymization); and F) Data Availability (2 items on code and dataset accessibility). Comparative analysis with six existing reporting standards demonstrated that MEDAI-LLM-SUMM uniquely addresses hallucination assessment requirements, reference summary creation methodology, LLM-as-judge validation protocols, and detailed pilot testing specifications.

Summary

Keywords

expertconsensus, Large language models, medical text summarization, Patient Safety, Reporting guidelines, reproducibility

Received

05 December 2025

Accepted

31 January 2026

Copyright

© 2026 Khoruzhaya, Varyukhina, Erizhokov, Blokhin, Reshetnikov, Kodenko, Pamova, Burtsev, Arzamasov, Omelyanskaya, Vladzymyrskyy and Vasilev. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) or licensor are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

*Correspondence: Anna N. Khoruzhaya

Disclaimer

All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article or claim that may be made by its manufacturer is not guaranteed or endorsed by the publisher.

Outline

Share article

Article metrics