AUTHOR=Hang Hao , Yang Liankai , Wang Zhongjie , Lin Zhebing , Li Pengchong , Zhu Jiayue , Liu Rang , Pu Shuai , Cheng Xinghua 

TITLE=Comparative analysis of accuracy and completeness in standardized database generation for complex multilingual lung cancer pathological reports: large language model-based assisted diagnosis system vs. DeepSeek, GPT-3.5, and healthcare professionals with varied professional titles, with task load variation assessment among medical staff

JOURNAL=Frontiers in Medicine

VOLUME=Volume 12 - 2025

YEAR=2025

URL=https://www.frontiersin.org/journals/medicine/articles/10.3389/fmed.2025.1618858

DOI=10.3389/fmed.2025.1618858

ISSN=2296-858X

ABSTRACT=BackgroundThis study evaluates how AI enhances EHR efficiency by comparing a lung cancer-specific LLM with general-purpose models (DeepSeek, GPT-3.5) and clinicians across expertise levels, assessing accuracy and completeness in complex lung cancer pathology documentation and task load changes pre−/post-AI implementation.MethodsThis study analyzed 300 lung cancer cases (Shanghai Chest Hospital) and 60 TCGA cases, split into training/validation/test sets. Ten clinicians (varying expertise) and three AI models (GPT-3.5, DeepSeek, lung cancer-specific LLM) generated pathology reports. Accuracy/completeness were evaluated against LeapFrog/Joint Commission/ACS standards (non-parametric tests); task load changes pre/post-AI implementation were assessed via NASA-TLX (paired t-tests, p < 0.05).ResultsThis study analyzed 1,390 structured pathology databases: 1,300 from 100 Chinese cases (generated by 10 clinicians and three LLMs) and 90 from 30 TCGA English reports. The lung cancer-specific LLM outperformed nurses, residents, interns, and general AI models (DeepSeek, GPT-3.5) in lesion/lymph node analysis and pathology extraction for Chinese records (p < 0.05), with total scores slightly below chief physicians. In English reports, it matched mainstream AI in lesion analysis (p > 0.05) but excelled in lymph node/pathology metrics (p < 0.05). Task load scores decreased by 38.3% post-implementation (413.90 ± 78.09 vs. 255.30 ± 65.50, t = 26.481, p < 0.001).ConclusionThe fine-tuned lung cancer LLM outperformed non-chief physicians and general LLMs in accuracy/completeness, significantly reduced medical staff workload (p < 0.001), with future optimization potential despite current limitations.