ORIGINAL RESEARCH article
Front. Med.
Sec. Healthcare Professions Education
Evaluating the Quality of Large Language Model-Generated Preoperative Patient Education Material: A Comparative Study Across Models and Surgery Types
Provisionally accepted- 1Jilin University School of Nursing, Changchun, China
- 2First Affiliated Hospital of Jilin University, Changchun, China
Select one of your emails
You have multiple emails registered with Frontiers:
Notify me on publication
Please enter your email address:
If you already have an account, please login
You don't have a Frontiers account ? You can register here
Background: Numerous studies have confirmed the effectiveness of large language models (LLMs) as a patient education tool; however, these studies primarily relied on the method of asking medical questions. So far, no studies have comprehensively assessed the quality of the complete preoperative patient education material (PEM) generated by LLMs from the perspectives of different models and surgical types. Objective: This study aims to comprehensively assess and compare the quality of different types of complete preoperative PEM generated by six common LLMs. Design: A Cross-sectional Comparative Study. Methods: We prompted 6 LLMs to generate preoperative PEMs for 6 distinct surgical types. For each surgical type, the materials were evaluated by 3 groups of experts from relevant fields using a 5-point scale for their accuracy and completeness. Two researchers assessed the materials for understandability and actionability using the PEMAT-P, and for suitability using SAM. We also analyzed the materials for readability with Flesch-Kincaid and for sentiment with the VADER sentiment analysis tool. Statistical analysis was performed using the Friedman test, followed by Conover's post-hoc test with Bonferroni correction. Results: The research results show that each model has its strengths in different dimensions. All the models demonstrated excellent accuracy, understandability, and actionability with no statistically significant differences. In terms of completeness, Grok-4 and Claude-Opus-4 significantly outperformed GPT-4o. For suitability, Claude-Opus-4 performed the best, while Grok-4 was the worst. For readability, Grok-4 and Gemini-2.5-Pro were the easiest to understand, while Claude-Opus-4 had the lowest readability. Moreover, only Gemini-2.5-Pro could consistently generate content with positive emotions. Conclusion: The research has found that the materials generated by these models can achieve high levels in multiple dimensions, but there is no perfect model. These models can be used by medical staff to generate the initial draft of preoperative PEMs. However, before providing them to the patients, they still need to be reviewed and supplemented by the medical staff.
Keywords: Artificial intelligence (AI), large language model (LLM), Preoperative education, Patient education material (PEM), comparative analysis
Received: 08 Sep 2025; Accepted: 27 Nov 2025.
Copyright: © 2025 Ma, Zhang, Tang, Yi, Zhong, Li and Wang. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) or licensor are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.
* Correspondence: Gang Wang
Disclaimer: All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article or claim that may be made by its manufacturer is not guaranteed or endorsed by the publisher.
