Your new experience awaits. Try the new design now and help us make it even better

ORIGINAL RESEARCH article

Front. Med.

Sec. Healthcare Professions Education

Evaluating the Quality of Large Language Model-Generated Preoperative Patient Education Material: A Comparative Study Across Models and Surgery Types

Provisionally accepted
Junwei  MaJunwei Ma1Yunshan  ZhangYunshan Zhang2Huifeng  TangHuifeng Tang2Xuemei  YiXuemei Yi2Tangsheng  ZhongTangsheng Zhong2Xinyun  LiXinyun Li1Gang  WangGang Wang2*
  • 1Jilin University School of Nursing, Changchun, China
  • 2First Affiliated Hospital of Jilin University, Changchun, China

The final, formatted version of the article will be published soon.

Background: Numerous studies have confirmed the effectiveness of large language models (LLMs) as a patient education tool; however, these studies primarily relied on the method of asking medical questions. So far, no studies have comprehensively assessed the quality of the complete preoperative patient education material (PEM) generated by LLMs from the perspectives of different models and surgical types. Objective: This study aims to comprehensively assess and compare the quality of different types of complete preoperative PEM generated by six common LLMs. Design: A Cross-sectional Comparative Study. Methods: We prompted 6 LLMs to generate preoperative PEMs for 6 distinct surgical types. For each surgical type, the materials were evaluated by 3 groups of experts from relevant fields using a 5-point scale for their accuracy and completeness. Two researchers assessed the materials for understandability and actionability using the PEMAT-P, and for suitability using SAM. We also analyzed the materials for readability with Flesch-Kincaid and for sentiment with the VADER sentiment analysis tool. Statistical analysis was performed using the Friedman test, followed by Conover's post-hoc test with Bonferroni correction. Results: The research results show that each model has its strengths in different dimensions. All the models demonstrated excellent accuracy, understandability, and actionability with no statistically significant differences. In terms of completeness, Grok-4 and Claude-Opus-4 significantly outperformed GPT-4o. For suitability, Claude-Opus-4 performed the best, while Grok-4 was the worst. For readability, Grok-4 and Gemini-2.5-Pro were the easiest to understand, while Claude-Opus-4 had the lowest readability. Moreover, only Gemini-2.5-Pro could consistently generate content with positive emotions. Conclusion: The research has found that the materials generated by these models can achieve high levels in multiple dimensions, but there is no perfect model. These models can be used by medical staff to generate the initial draft of preoperative PEMs. However, before providing them to the patients, they still need to be reviewed and supplemented by the medical staff.

Keywords: Artificial intelligence (AI), large language model (LLM), Preoperative education, Patient education material (PEM), comparative analysis

Received: 08 Sep 2025; Accepted: 27 Nov 2025.

Copyright: © 2025 Ma, Zhang, Tang, Yi, Zhong, Li and Wang. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) or licensor are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

* Correspondence: Gang Wang

Disclaimer: All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article or claim that may be made by its manufacturer is not guaranteed or endorsed by the publisher.