- Department of Oncology, The Second Hospital of Dalian Medical University, Dalian, Liaoning, China
Background: Organoids have become central platforms in precision oncology and translational research, increasing the need for communication that is accurate, transparent, and clinically responsible. Large language models (LLMs) are now widely consulted for organoid-related explanations, but their ability to balance readability, scientific rigor, and educational suitability has not been systematically established.
Methods: Five mainstream LLMs (GPT-5, DeepSeek, Doubao, Tongyi Qianwen, and Wenxin Yiyan) were systematically evaluated using a curated set of thirty representative organoid-related questions. For each model, twenty outputs were independently scored using the C-PEMAT-P scale, the Global Quality Score (GQS), and seven validated readability indices. Between-model differences were analyzed using one-way ANOVA or Kruskal–Wallis tests, and correlation analyses were performed to examine associations between readability and quality measures.
Results: Model performance differed markedly, with GPT-5 achieving the highest C-PEMAT and GQS scores (16.05 ± 1.10; 4.70 ± 0.47; both P < 0.001), followed by intermediate performance from DeepSeek and Doubao (C-PEMAT 11.75 ± 2.07 and 12.05 ± 1.82; GQS 3.65 ± 0.49 and 3.35 ± 0.49). Tongyi Qianwen and Wenxin Yiyan comprised the lowest-performing tier (C-PEMAT 7.85 ± 1.09 and 9.00 ± 2.05; GQS 1.55 ± 0.51 and 2.10 ± 0.55). Score-distribution patterns further highlighted reliability gaps, with GPT-5 showing tightly clustered values and domestic models displaying broader dispersion and unstable performance. Readability differed significantly across models and question categories, with safety-related, diagnostic, and technical questions showing the highest linguistic and conceptual complexity. Correlation analyses showed strong internal coherence among readability indices but only weak-to-moderate associations with C-PEMAT, GQS, and reliability metrics, indicating that linguistic simplicity is not a dependable surrogate for scientific quality.
Conclusion: LLMs exhibited substantial variability in communicating organoid-related information, forming distinct performance tiers with direct implications for patient education and translational decision-making. Because readability, scientific quality, and reliability diverged across models, linguistic simplification alone is insufficient to guarantee accurate or dependable interpretation. These findings underscore the need for organoid-adapted AI systems that integrate domain-specific knowledge, convey uncertainty transparently, ensure output reliability, and safeguard safety-critical information.
1 Introduction
Organoid technology has rapidly evolved into a core platform in modern bioengineering and translational oncology, providing three-dimensional systems that more faithfully recapitulate human tissue architecture, lineage dynamics and treatment response than traditional models (Xia et al., 2019; Verstegen et al., 2025). These models now support a broad array of precision-oncology applications, including drug screening, toxicity evaluation, host–microbe interaction research, and early-phase therapeutic development (Abdel-Rehim et al., 2025). As their use expands from specialized laboratories to multi-center translational pipelines and early clinical testing, the demand for communication that is accurate, accessible, and contextualized to experimental and clinical realities has intensified (Wang et al., 2024; Wang D. et al., 2025). Yet organoid science remains conceptually complex and operationally heterogeneous, creating persistent challenges for users who often turn to online resources to navigate this rapidly advancing field.
Large language models (LLMs) now function as major intermediaries in biomedical communication and are increasingly consulted for organoid-related information, from basic definitions to culture systems, drug-testing workflows and safety considerations (Chen et al., 2025; Sandmann et al., 2025). Their ability to deliver fluent, structured, and seemingly authoritative explanations positions them as promising tools for bridging knowledge gaps among diverse user groups. Yet studies in oncology, rheumatology, and dermatology show that LLMs often fail to balance mechanistic accuracy with appropriate uncertainty disclosure and safety framing (Venerito et al., 2023). They may overlook key distinctions between organoids and genetic testing, downplay sampling risks, misrepresent predictive validity, or omit essential caveats related to assay limitations (Jensen and Little, 2023). Furthermore, LLM outputs can vary substantially across prompts, sessions, and domains, raising concerns not only about accuracy but also about the reliability and stability of generated explanations—an issue of particular relevance in organoid science, where misunderstandings regarding culture conditions, lineage stability, translational readiness, or discordant results may shape experimental decisions, therapeutic choices, financial planning, and patient expectations (Puschhof et al., 2021; Wang Q. et al., 2025).
Although LLMs are increasingly incorporated into laboratory workflows, clinical counselling, and public-facing biomedical communication, their performance specific to organoid science remains poorly characterized. Key uncertainties persist: whether LLMs can accurately articulate core biological principles such as niche dependence and self-organization; whether they provide appropriately cautious interpretations of drug-response data, hereditary risk, and safety considerations; whether their outputs are consistent and reliable across similar queries; whether they correctly frame the clinical utility, turnaround time, and financial aspects of organoid testing; and whether their language is sufficiently readable and actionable for users with diverse scientific literacy levels (Grippaudo et al., 2024; Steyvers et al., 2025). In the absence of systematic evaluation, the safety and reliability of LLM-mediated organoid communication cannot be assured.
To fill this gap, we conducted a systematic, multi-dimensional benchmarking analysis of five widely used LLMs using thirty representative organoid-related questions spanning five practical domains: Technical Cognition, Diagnostic and Therapeutic Value, Safety Concerns, Cost and Process, and Decision Reference. Model outputs were assessed using validated patient-education suitability metrics (C-PEMAT-P), a global scientific quality score, and seven established readability indices, alongside inter-rater consistency measures that enabled evaluation of output reliability (Tam et al., 2024). This framework allowed us to disentangle intrinsic model-level performance differences from the domain-specific communication challenges inherent to organoid-related information.
By mapping how contemporary LLMs interpret, simplify, and at times distort organoid science, this study delivers the first comprehensive and evidence-based evaluation of AI-mediated communication in this high-complexity biomedical domain (Liu Y. et al., 2025). Importantly, it reveals that readability, scientific quality, and reliability diverge substantially across models, highlighting where LLMs can responsibly contribute to knowledge dissemination, where they introduce risks requiring caution, and how next-generation domain-adapted systems and governance frameworks should be constructed to align AI-generated explanations with the conceptual, ethical, and translational demands of organoid research, clinical decision-making, and public communication. Based on differences in model architecture, training strategies, and domain exposure, we hypothesized that large language models would exhibit systematic performance stratification rather than uniform capability in communicating organoid-related concepts. We further anticipated that linguistic readability would not align consistently with scientific quality or reliability, reflecting a structural dissociation between surface accessibility and mechanistic fidelity. Finally, we expected that questions involving safety considerations and therapeutic interpretation would pose greater challenges than descriptive or logistical queries, given their reliance on multi-step reasoning, uncertainty handling, and clinically bounded inference.
2 Materials and methods
2.1 Ethical considerations
All data used in this study were generated by LLMs and did not involve human participants, patient-identifiable information, biological specimens, or animal experiments. No content was obtained from clinical records, and no interventions were performed. In accordance with institutional and international academic standards, research based solely on publicly accessible AI-generated data does not require ethical approval.
2.2 Research procedure
Three specialists in organoid biology and translational oncology designed a structured set of 30 representative questions to capture the practical information needs surrounding organoid technology. Question development was informed by authoritative literature, laboratory training materials, and recurrent inquiries from patients, clinicians, and early-career researchers. After several rounds of refinement, the questions were consolidated and classified into five domains: Technical Cognition, Diagnostic and Therapeutic Value, Safety Concerns, Cost and Process, and Decision Reference, as shown in Table 1. Each question was then submitted verbatim to five widely accessible LLMs within a fixed 2-day period. To approximate real-world user behavior, a new session was initiated for each query, and no follow-up prompts, clarifications, or optimization strategies were provided. When multiple responses were generated, the first complete answer was selected. This approach was chosen to approximate typical real-world user interactions, in which only the initial response is usually consulted, while acknowledging that alternative sampling strategies could capture within-model variability. All outputs were compiled into a standardized dataset, anonymized, and assigned randomized identifiers; any metadata that could reveal the model’s identity was removed prior to evaluation. The resulting corpus served as a domain-specific benchmark that blinded expert reviewers assessed across three predefined dimensions: readability, reliability, and scientific quality. The five large language models evaluated in this study were selected because they are among the most widely accessible and commonly consulted systems for biomedical information in real-world settings, collectively representing both internationally deployed and regionally dominant platforms. This selection was intended to prioritize ecological validity and generalizability, enabling a focused assessment of how commonly used LLMs communicate organoid-related concepts in practice.
2.3 Readability evaluation
We employed multiple formulas from the Text Readability Assessment Tool (http://readabilityformulas.com/) to quantitatively assess the readability of LLM-generated responses. Because no single optimal metric has been established and no universally accepted gold standard exists for biomedical text readability, we applied a set of widely used indices that have been consistently adopted in prior research.
The following metrics were calculated for each response (Table 2): the Coleman–Liau Index (CLI), Linear Write Index (LW), Automated Readability Index (ARI), Simple Measure of Gobbledygook (SMOG), Fog Index, Flesch Reading Ease Score (FRES), and Flesch–Kincaid Grade Level (FKGL) (Özduran and Hanci, 2022; Yilmaz Hanci, 2023; Hanci et al., 2024). Each index emphasizes distinct linguistic features, including sentence structure, word length, syllabic complexity, and lexical density, thereby justifying the use of a multi-metric approach to capture different aspects of language difficulty. These metrics capture complementary dimensions of linguistic complexity, including sentence length, word length, and lexical difficulty, and provide estimates of how closely model-generated language approximates standard written English and its comprehensibility for non-expert readers. All indices were computed on the unedited raw outputs using identical software settings to ensure objective, model-agnostic comparison. To preserve the integrity of model-generated linguistic features, no manual correction, segmentation, or text cleaning was performed.
2.4 Reliability and quality assessment
This study used the C-PEMAT-P scale and the Global Quality Score (GQS) to evaluate the comprehensibility, actionability, and scientific quality of LLM-generated responses. Prior to formal scoring, the two evaluators underwent a calibration process using a subset of representative responses to ensure consistent interpretation of the scoring criteria. All responses were then independently scored by both reviewers. Inter-rater agreement was assessed using Cohen’s kappa coefficient, and discrepancies were resolved through discussion and adjudication by a third senior reviewer. This workflow was designed to ensure scoring consistency, reliability, and reproducibility across evaluators. The C-PEMAT-P includes 24 binary-scored items across two domains: Comprehensibility (16 items), which assesses logical structure, clarity of biological explanations, terminological accuracy, and sufficiency of background context; and Actionability (8 items), which evaluates whether the text provides specific, usable guidance, appropriately framed safety information, and content aligned with user needs (Gunduz et al., 2024). Total scores range from 0 to 24, with higher scores indicating greater suitability for patient education.
The GQS provides a global qualitative rating of content accuracy, coherence, depth, and practical relevance using a five-point scale, where scores from 1 to 5 correspond to poor, weak, moderate, good, and excellent scientific quality, respectively (Nian et al., 2024).
To assess reliability, two senior experts in organoid biology and translational medicine independently evaluated all responses. Inter-rater agreement was quantified using Cohen’s kappa coefficient, with values > 0.75 interpreted as excellent reliability (Rau and Shih, 2021). Any discrepancies were resolved through adjudication by a third senior reviewer. Both assessment tools demonstrated high inter-rater consistency, ensuring that the evaluation of LLM performance was methodologically robust and reproducible across reviewers (Faherty et al., 2020).
2.5 Statistical analysis
Statistical analyses were performed according to the distributional characteristics of each variable. Continuous variables that met normality requirements, including C-PEMAT-P and GQS values, were summarized as mean ± standard deviation and compared across the five LLMs using one-way analysis of variance (ANOVA) with Bonferroni-adjusted post hoc testing. Metrics that did not show normal distribution, such as the ARI and FRES, were summarized as median with interquartile range and analyzed using the Kruskal–Wallis H test, followed by Dunn’s post hoc tests with adjusted significance thresholds when applicable. All statistical tests were two-tailed, with significance defined as P < 0.05. Data analysis was conducted using IBM SPSS Statistics version 25.0, and visualizations were generated with GraphPad Prism version 9.0. All prompts and model settings used for model querying are provided in the Supplementary Material to facilitate reproducibility.
3 Results
3.1 Readability analysis
This study systematically examined how different LLMs and content categories influence the readability and overall quality of organoid-related educational text. We compared five widely used models (DeepSeek, Doubao, GPT-5, Tongyi Qianwen, and Wenxin Yiyan) across three evaluation dimensions: patient-education suitability (C-PEMAT-P), overall scientific quality (GQS), and seven established readability indices (ARI, FRES, GFOG, FKGL, CLI, SMOG, and LW). We also analyzed these metrics across five thematic domains of organoid education—Technical Cognition, Diagnostic and Therapeutic Value, Safety Concerns, Cost and Process, and Decision Reference. By integrating model-level and domain-level analyses, we identified how algorithmic characteristics and question type jointly determine the clarity and educational value of AI-generated organoid information.
At the model level (Table 3), the five LLMs demonstrated substantial variation in patient-education suitability and overall scientific quality. Both C-PEMAT-P and GQS differed highly significantly across models (F = 71.22 and 124.88; both P < 0.001). GPT-5 showed the strongest performance on both metrics (C-PEMAT 16.05 ± 1.10; GQS 4.70 ± 0.47), DeepSeek and Doubao formed an intermediate tier (C-PEMAT ∼11.75–12.05; GQS ∼3.35–3.60), and Tongyi Qianwen and Wenxin Yiyan represented the lowest-performing group (C-PEMAT 7.85 ± 1.09 and 9.00 ± 2.05; GQS 1.55 ± 0.51 and 2.10 ± 0.55). All seven readability indices also differed significantly among models (all P < 0.001). Based on median values, GPT-5 and Wenxin Yiyan generated text with higher ARI, GFOG, FKGL, CLI, and SMOG and lower FRES, indicating longer sentences, denser terminology, and greater reading difficulty. DeepSeek and Tongyi Qianwen produced more readable outputs, with Doubao falling in between. LW values also varied significantly, with GPT-5 showing the lowest and Wenxin Yiyan the highest median LW, suggesting model-specific patterns in sentence and paragraph structuring.
At the content-category level (Table 4), question topic had a pronounced influence on readability but only a minimal effect on quality scores. C-PEMAT-P and GQS showed no significant differences across the five domains (F = 0.21 and 0.04; P = 0.934 and 0.997), suggesting that educational suitability and scientific quality were generally stable regardless of whether questions focused on process, decision-making, or technical considerations. In contrast, most readability indices showed significant domain-related variation. ARI differed across domains (P = 0.033), while FRES, GFOG, FKGL, CLI, and SMOG exhibited even stronger differences (P values 0.002 to <0.001). LW was the only index without significant variation (P = 0.784). Responses addressing Diagnostic and Therapeutic Value, Safety Concerns, and Technical Cognition displayed higher ARI, GFOG, FKGL, and SMOG and lower FRES, indicating longer, more technical, and more difficult text. By comparison, questions related to Cost and Process and Decision Reference yielded relatively more readable outputs, though still not fully accessible for lay audiences. Overall, these findings indicate that readability is jointly shaped by intrinsic model characteristics and by the inherent complexity of the question category.
3.2 Reliability and quality assessment
Across the five models, C-PEMAT-P scores varied substantially (Figure 1), demonstrating strong model dependence in the patient-education suitability of organoid-related content and indicating considerable variability in the reliability of information provided. GPT-5 represented the highest-performing tier, with markedly higher C-PEMAT-P scores than all other models, reflecting explanations that were both cognitively accessible and operationally actionable. Doubao and DeepSeek formed a mid-level tier, with median scores above 12 and tight distribution patterns, suggesting relatively reliable and stable delivery of clear, stepwise guidance that could meaningfully support patient decision-making. In contrast, Wenxin Yiyan and Tongyi Qianwen constituted the lowest tier, with scores clustered at or below 10 and visibly broader distributions, indicating frequent production of content that is difficult to act upon, inconsistently structured, and poorly matched to patient literacy levels. This three-tier pattern suggests that LLMs differ not only in average performance but also segregate into distinct strata of educational reliability, with some domestic models approaching clinically usable clarity while others exhibit systematic limitations likely to hinder comprehension and behavioral uptake.
Figure 1. C-PEMAT scores across five large language models. Violin plots display the distribution of C-PEMAT scores for all five models. GPT-5 shows the highest and most concentrated values, indicating superior educational suitability. Doubao and DeepSeek demonstrate intermediate performance, whereas Wenxin Yiyan and Tongyi Qianwen yield consistently lower scores. Statistical significance was evaluated using one-way ANOVA followed by post hoc testing (*ns; *P < 0.05; **P < 0.01; ***P < 0.001; ***P < 0.0001).
A similar performance hierarchy was observed for GQS (Figure 2), highlighting model architecture–related differences in scientific rigor. GPT-5 again occupied the top tier, with consistently high and tightly clustered GQS values, reflecting outputs that were accurate, coherent, and contextually appropriate in explaining organoid principles and applications. Doubao and DeepSeek formed a middle tier, characterized by overlapping medians and narrow interquartile ranges, indicating generally reliable performance but not uniformly strong adherence to evidence-based communication standards. Wenxin Yiyan and Tongyi Qianwen remained in the lowest tier, with GQS distributions skewed toward the lower end and elongated violin plots, suggesting greater variability in factual robustness and internal consistency. This graded performance pattern indicates that only a subset of current LLMs can be considered suitable for high-stakes scientific communication on organoids, whereas others may produce unstable or partially credible explanations that could compromise safe and accurate knowledge translation.
Figure 2. Global Quality Score distributions among five large language models. Violin plots illustrate GQS distributions across models. GPT-5 performs best, producing accurate and coherent explanations. Doubao and DeepSeek represent mid-range performers with moderate variability, while Wenxin Yiyan and Tongyi Qianwen exhibit the lowest and most dispersed scores. One-way ANOVA with post hoc comparisons was applied to assess significance (*ns; *P < 0.05; **P < 0.01; ***P < 0.001; ***P < 0.0001).
3.3 Correlation analysis
Correlation analysis showed a clear separation between readability metrics and overall text quality, although several consistent correlation patterns were identified (Figure 3). The C-PEMAT score displayed moderate correlations with several readability indicators. Positive correlations with SMOG (0.46), FKGL (0.45), ARI (0.44) and CL (0.24) indicate that higher patient-education suitability is often associated with greater lexical and syntactic complexity. Negative correlations with FRES (−0.33) and LW (−0.36) suggest that responses that are too brief or overly simplified may lack the depth or actionable detail required for effective understanding. These findings indicate that high-quality patient-education materials require a balance between clarity and informational completeness, and that the use of appropriate professional vocabulary does not reduce accessibility when embedded in clear structure and contextual explanation. The presence of moderate rather than strong correlations further suggests that readability alone cannot be used as a substitute for assessing the reliability of LLM-generated content.
Figure 3. Correlation matrix of readability and quality metrics. Heatmap visualizing Pearson correlations between readability indices and quality measures. Readability metrics cluster closely, reflecting shared assessment of linguistic complexity, whereas C-PEMAT and GQS show only weak-to-moderate correlations with these indices. The pattern indicates that textual readability and scientific quality represent partially independent dimensions.
In contrast, the GQS score demonstrated a different pattern of correlations. Modest positive associations with ARI (0.40), SMOG (0.42), FKGL (0.40) and CL (0.15), together with weak or negative correlations with FRES (−0.25) and LW (−0.35), indicate that higher scientific quality often involves more structured reasoning and more specialized terminology, which strengthens accuracy and coherence. The weak-to-moderate correlations show that the reliability of scientific explanations depends more on conceptual precision and internal consistency than on linguistic simplicity. Strong internal correlations among readability metrics, including ARI with FKGL (0.94), SMOG with FKGL (0.95) and SMOG with GFOG (0.81), confirm that these indicators primarily measure linguistic complexity rather than the quality of scientific reasoning. Overall, the results demonstrate that readability, quality and reliability represent partially independent dimensions. Scientific quality is shaped by accuracy, completeness and the organisation of knowledge, while readability metrics describe the linguistic form through which information is conveyed.
4 Discussion
Organoid technology has progressed from a specialized experimental method to a central platform in precision oncology, regenerative medicine and translational therapeutics (Zeng et al., 2024). Its conceptual complexity, protocol sensitivity and variable clinical readiness create persistent challenges for accurate communication across scientific, clinical and public settings (Getz and Campo, 2017). At the same time, LLMs have become widely used sources of organoid-related information for patients, trainees and clinicians, meaning that their strengths and limitations directly influence how this rapidly growing biotechnology is perceived (Li et al., 2025). In this context, both the quality and reliability of LLM-generated explanations are essential for maintaining scientific integrity. This study provides the first systematic and data-driven evaluation of how major LLMs communicate organoid concepts, revealing performance patterns with implications for safety, comprehension and translation (Goodman et al., 2023).
Three major insights emerge. First, model performance forms distinct tiers rather than a continuous scale. GPT-5 consistently produced the highest scores in both educational suitability and scientific quality, offering explanations that were coherent, accurate and stable. DeepSeek and Doubao constituted an intermediate tier that delivered information of acceptable structure but variable depth and inconsistent articulation of limitations. Tongyi Qianwen and Wenxin Yiyan formed a lower tier characterized by fragmented logic, missing mechanistic steps and limited educational relevance. This stratification indicates that architecture-level differences translate directly into disparities in factual stability and interpretive correctness. Because organoid communication involves sampling risk, assay interpretation and the distinction between experimental prediction and clinical judgement, using models with unstable performance may distort a knowledge base that is already technically complex and actively evolving (Gu et al., 2023; Liu X. et al., 2025). This stratification likely reflects differences in linguistic representation capacity, domain-specific biomedical exposure during training, stability of long-chain reasoning, and the depth of integrated biomedical knowledge, all of which directly shape how LLMs handle technically complex and safety-sensitive organoid-related questions.
Beyond overall score differences, the variability in reliability observed across models was associated with several recurring error patterns. First, factual inaccuracies were common in responses describing organoid capabilities or experimental constraints, often involving oversimplified or imprecise biological statements. Second, unwarranted clinical extrapolation occurred when experimental organoid findings were implicitly framed as predictive of patient-level therapeutic outcomes without appropriate qualification. Third, omission of safety-related information was frequently observed in discussions of sampling procedures, assay limitations, and downstream decision-making, reducing the completeness of risk communication. Finally, incomplete reasoning chains were evident in responses that presented locally fluent explanations but failed to maintain logical continuity across multiple inferential steps, particularly in safety- and treatment-related scenarios. Together, these patterns help explain why surface-level coherence does not necessarily translate into reliable or clinically responsible organoid communication.
Second, the readability analysis highlights a structural challenge. Although readability indices differed across models, the largest increases in linguistic and conceptual difficulty consistently appeared in responses about diagnostic and therapeutic value, safety considerations and technical principles. These categories require multi-step reasoning, mechanistic accuracy and precise terminology, all of which naturally increase reading difficulty. In comparison, questions on cost, logistics or general decision advice were somewhat easier to read yet remained non-trivial for individuals without a biomedical background. These findings suggest that expectations for uniformly simplified language are misaligned with the cognitive demands of organoid science (Di Marco et al., 2024). Moreover, excessive simplification may obscure mechanistic constraints and boundary conditions, increasing the risk of misinterpretation in technically complex contexts (Lee et al., 2025).
Third, the correlation analysis shows that readability and content quality function as partially independent dimensions. Readability indices were strongly correlated with one another but only weak to moderate in relation to C-PEMAT and GQS. Higher-quality responses tended to contain moderate lexical and syntactic complexity, reflecting the need for structured reasoning and domain-specific terminology to preserve mechanistic fidelity (Wang J. et al., 2025). In contrast, responses with very high readability often lacked the depth required to support informed decisions or meaningful understanding (Büker and Mercan, 2025). These findings demonstrate that reliability, defined as the stability, correctness and internal coherence of explanations, cannot be inferred from readability alone. Reliable communication depends on factual accuracy and explicit clarification of uncertainty rather than simplified vocabulary or shorter sentences (Magnaguagno et al., 2022). For advanced biotechnologies such as organoid systems, conceptual precision and transparent reasoning contribute more to user comprehension than reductions in linguistic complexity.
Together, these results carry practical implications. General-purpose language models should not be deployed uncritically in patient communication, laboratory onboarding or early clinical decision support. Only a subset of models demonstrates reliability adequate for safety-critical content, and reliability varies systematically by domain and query type (Ouertani et al., 2023). Communication strategies should emphasize structured and layered explanations, preserving essential terminology while adding context, stepwise logic and explicit caveats about assay limitations. Given that topics related to safety and treatment are consistently more difficult to express, additional safeguards are necessary, such as standardized descriptions of variability sources, reproducibility constraints and clear reminders that organoid assays cannot replace clinical expertise or informed consent processes (Anderson and Ledford, 2024).
From an ethical and governance perspective, responsible use of large language models in organoid communication requires the definition of minimum safeguards rather than reliance on general principles alone. Inaccurate or misleading organoid-related content generated by large language models carries distinct ethical risks. Overstated or imprecise descriptions of organoid capabilities may inflate patient expectations regarding diagnostic certainty or therapeutic benefit. In addition, misrepresentation of experimental readiness or clinical applicability may influence financial decisions, including out-of-pocket testing, participation in unproven interventions, or resource allocation. Finally, such misinformation may complicate communication between patients and clinicians by introducing unrealistic assumptions or misaligned interpretations of experimental findings, thereby undermining informed discussion and shared decision-making. First, transparency of information sources is essential, including clear indication of whether statements are derived from experimental literature, synthesized inference, or generalized biomedical knowledge. Second, model outputs should be auditable, with mechanisms to trace claims back to supporting evidence and to flag content generated under uncertainty or low confidence. Third, explicit boundaries for clinical interpretation must be enforced, ensuring that organoid-based explanations are not presented as substitutes for clinical judgment, regulatory approval, or informed consent processes. Finally, standardized safety warnings should be mandatory for topics involving tissue sampling, therapeutic interpretation, or downstream decision-making, to prevent implicit normalization of experimental findings as clinical recommendations. Together, these safeguards provide a concrete foundation for responsible AI deployment in high-risk biomedical communication contexts.
It is essential to clearly distinguish between general educational information and content that may be interpreted as personalized medical advice. Although the LLM-generated responses evaluated in this study were intended to provide general explanations of organoid concepts, topics involving drug sensitivity, disease risk, or treatment interpretation are particularly vulnerable to misinterpretation. Without explicit boundary signaling, users may construe descriptive or probabilistic statements as individualized clinical guidance, despite the absence of patient-specific data. Such unintended medical interpretation may influence treatment expectations, self-directed decision-making, and clinician–patient communication, with potentially serious implications for clinical safety and shared decision-making. These findings highlight the need for explicit disclaimers, contextual framing, and clear separation between educational explanation and individualized medical decision-making when deploying general-purpose language models in clinically adjacent domains.
The observed disparities in model performance across languages raise important fairness and equity considerations. Uneven access to high-quality, domain-specific biomedical training data across languages likely contributes to these differences, resulting in systematically lower reliability and completeness in non-dominant language contexts. Such disparities may translate into unequal access to accurate organoid-related information, disproportionately affecting patients and clinicians who rely on models operating in under-resourced languages. Addressing these gaps will require targeted investment in multilingual biomedical corpora, transparent reporting of language-specific performance, and fairness-aware evaluation frameworks to ensure equitable information access across linguistic settings. Methodologically, this study establishes a multiparametric evaluation framework that integrates educational suitability, global quality, multi-index readability profiling and correlation structures (Eweje et al., 2021). Incorporating reliability analysis further clarifies how linguistic form relates to epistemic integrity, revealing tensions that single-metric approaches cannot capture (Chen et al., 2024). This framework can be applied to other emerging biotechnologies, including CAR-based therapies, organ-on-chip systems and genome-editing platforms, where misunderstanding risk is amplified by rapid development and limited standardization (Wang H. et al., 2025). The observed performance stratification also provides a foundation for next-generation domain-adapted models, which may integrate structured ontologies, validated protocol repositories, uncertainty representations and curated translational datasets (Jain et al., 2025).
This study has several limitations. Because large language models are subject to ongoing system updates and context-dependent behavior, the findings of this study reflect model performance at a specific point in time. First, model performance was assessed at a single time point, which limits generalizability because LLMs evolve rapidly. Second, although the curated FAQ set reflects common organoid-related questions, it does not include highly specialized scenarios such as immune co-culture or lineage engineering. Third, the evaluation relied solely on textual outputs, including readability, quality and inferred reliability, without directly assessing user comprehension or decision-making. Fourth, reliability was inferred from internal consistency rather than from adversarial or uncertainty-focused testing, which may overestimate real-world robustness. Finally, the study examined general-purpose models with limited transparency in their training data, and future domain-adapted or multimodal architectures may perform differently. These considerations should inform interpretation of the findings.
Future research can advance in three interconnected directions. First, expanding model and question diversity across languages, health-literacy groups and disease-specific organoid scenarios will strengthen external validity. This expansion should be supported by multidimensional evaluation frameworks that connect objective performance metrics with user comprehension, behavioral change and clinical relevance. Incorporating behavioral endpoints will clarify how AI-generated explanations influence understanding, expectation management, adherence and shared decision-making. Second, reliability assessment should include longitudinal stability, clarity and consistency of uncertainty disclosure, cross-session reproducibility and resilience to ambiguous or clinically nuanced prompts. These elements are essential for a balanced evaluation of model quality. Third, organoid-specific models developed from curated ontologies, standardized protocols and high-quality translational datasets are needed to ensure biological accuracy. When combined with adaptive text-generation systems that adjust readability to user literacy, such models can improve the clarity and safety of organoid communication. In parallel, governance frameworks should establish minimum standards for accuracy, uncertainty disclosure, provenance transparency, reliability monitoring and long-term auditing. These efforts will support the development of safer, more interpretable and more equitable AI systems for organoid science. To advance from benchmarking to responsible deployment, future work should consolidate organoid communication into an auditable infrastructure centered on a domain-specific knowledge graph that encodes experimental constraints, translational boundaries, and clinically relevant context. This must be paired with standardized uncertainty and risk-disclosure protocols that explicitly separate experimental inference from clinical actionability in safety-critical scenarios. Embedding these mechanisms within a lifecycle governance framework with expert review and post-deployment surveillance would convert organoid-informed AI outputs from persuasive narratives into accountable scientific communication.
5 Conclusion
Organoid technology has become integral to precision oncology, which makes the clarity, accuracy and reliability of AI-generated explanations increasingly consequential. This study shows marked performance differences among current LLMs, and only a limited number of models can provide organoid information that is both scientifically robust and suitable for educational use. The weak correspondence between readability and scientific quality further indicates that simplified language alone cannot ensure reliable interpretation. Taken together, these findings highlight the need for organoid-adapted AI systems that integrate domain-specific knowledge, express uncertainty with clarity and support communication practices that enable safe, trustworthy and equitable translation of organoid science.
Data availability statement
The original contributions presented in the study are included in the article/Supplementary Material, further inquiries can be directed to the corresponding author.
Author contributions
MS: Conceptualization, Data curation, Formal Analysis, Methodology, Project administration, Writing – original draft. DZ: Funding acquisition, Validation, Writing – original draft. JC: Funding acquisition, Project administration, Writing – review and editing.
Funding
The author(s) declared that financial support was received for this work and/or its publication. This work was supported by the National Natural Science Foundation of China (Grant No. 82203056).
Acknowledgements
We thank all evaluators and contributors for their efforts in data collection and validation.
Conflict of interest
The author(s) declared that this work was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.
Generative AI statement
The author(s) declared that generative AI was not used in the creation of this manuscript.
Any alternative text (alt text) provided alongside figures in this article has been generated by Frontiers with the support of artificial intelligence and reasonable efforts have been made to ensure accuracy, including review by the authors wherever possible. If you identify any issues, please contact us.
Publisher’s note
All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article, or claim that may be made by its manufacturer, is not guaranteed or endorsed by the publisher.
Supplementary material
The Supplementary Material for this article can be found online at: https://www.frontiersin.org/articles/10.3389/fbioe.2026.1750225/full#supplementary-material
References
Abdel-Rehim, A., Orhobor, O., Griffiths, G., Soldatova, L., and King, R. D. (2025). Establishing predictive machine learning models for drug responses in patient derived cell culture. npj Precis. Oncol. 9, 180. doi:10.1038/s41698-025-00937-2
Anderson, L. N., and Ledford, C. J. W. (2024). Improving patient comprehension through explanatory communication. JAMA 332, 2027–2028. doi:10.1001/jama.2024.20868
Büker, M., and Mercan, G. (2025). Readability, accuracy and appropriateness and quality of AI chatbot responses as a patient information source on root canal retreatment: a comparative assessment. Int. J. Med. Inf. 201, 105948. doi:10.1016/j.ijmedinf.2025.105948
Chen, K., Song, N., Zhao, Y., Peng, J., and Chen, Y. (2024). Online attention versus knowledge utilization: exploring how linguistic features of scientific papers influence knowledge diffusion. Inf. Process. Manage. 61, 103691. doi:10.1016/j.ipm.2024.103691
Chen, X., Yi, H., You, M., Liu, W., Wang, L., Li, H., et al. (2025). Enhancing diagnostic capability with multi-agents conversational large language models. Npj Digit. Med. 8, 159. doi:10.1038/s41746-025-01550-0
Di Marco, N., Loru, E., Bonetti, A., Serra, A. O. G., Cinelli, M., and Quattrociocchi, W. (2024). Patterns of linguistic simplification on social media platforms over time. Proc. Natl. Acad. Sci. U. S. A. 121, e2412105121. doi:10.1073/pnas.2412105121
Eweje, F. R., Bao, B., Wu, J., Dalal, D., Liao, W.-H., He, Y., et al. (2021). Deep learning for classification of bone lesions on routine MRI. EBioMedicine 68, 103402. doi:10.1016/j.ebiom.2021.103402
Faherty, A., Counihan, T., Kropmans, T., and Finn, Y. (2020). Inter-rater reliability in clinical assessments: do examiner pairings influence candidate ratings? BMC Med. Educ. 20, 147. doi:10.1186/s12909-020-02009-4
Getz, K. A., and Campo, R. A. (2017). Trends in clinical trial design complexity. Nat. Rev. Drug Discov. 16, 307. doi:10.1038/nrd.2017.65
Goodman, R. S., Patrinely, J. R., Stone, C. A., Zimmerman, E., Donald, R. R., Chang, S. S., et al. (2023). Accuracy and reliability of chatbot responses to physician questions. JAMA Netw. Open 6, e2336483. doi:10.1001/jamanetworkopen.2023.36483
Grippaudo, F. R., Nigrelli, S., Patrignani, A., and Ribuffo, D. (2024). Quality of the information provided by ChatGPT for patients in breast plastic surgery: are we already in the future? JPRAS Open 40, 99–105. doi:10.1016/j.jpra.2024.02.001
Gu, Y., Zhang, W., Wu, X., Zhang, Y., Xu, K., and Su, J. (2023). Organoid assessment technologies. Clin. Transl. Med. 13, e1499. doi:10.1002/ctm2.1499
Gunduz, M. E., Matis, G. K., Ozduran, E., and Hanci, V. (2024). Evaluating the readability, quality, and reliability of online patient education materials on spinal cord stimulation. Turk. Neurosurg. 34 (3), 588–599. doi:10.5137/1019-5149.JTN.42973-22
Hanci, V., Otlu, B., and Biyikoğlu, A. S. (2024). Assessment of the readability of the online patient education materials of intensive and critical care societies. Crit. Care Med. 52, e47–e57. doi:10.1097/CCM.0000000000006121
Jain, A., Gut, G., Sanchis-Calleja, F., Tschannen, R., He, Z., Luginbühl, N., et al. (2025). Morphodynamics of human early brain organoid development. Nature 644, 1010–1019. doi:10.1038/s41586-025-09151-3
Jensen, K. B., and Little, M. H. (2023). Organoids are not organs: sources of variation and misinformation in organoid biology. Stem Cell Rep. 18, 1255–1270. doi:10.1016/j.stemcr.2023.05.009
Lee, H.-S., Song, S.-H., Park, C., Seo, J., Kim, W. H., Kim, J., et al. (2025). The ethics of simplification: balancing patient autonomy, comprehension, and accuracy in AI-generated radiology reports. BMC Med. Ethics 26, 136. doi:10.1186/s12910-025-01285-3
Li, J., Zhou, Z., Lyu, H., and Wang, Z. (2025). Large language models-powered clinical decision support: enhancing or replacing human expertise? Intell. Med. 5, 1–4. doi:10.1016/j.imed.2025.01.001
Liu, X., Zhou, Z., Zhang, Y., Zhong, H., Cai, X., and Guan, R. (2025). Recent progress on the organoids: techniques, advantages and applications. Biomed. Pharmacother. 185, 117942. doi:10.1016/j.biopha.2025.117942
Liu, Y., He, H., Han, T., Zhang, X., Liu, M., Tian, J., et al. (2025). Understanding LLMs: a comprehensive overview from training to inference. Neurocomputing 620, 129190. doi:10.1016/j.neucom.2024.129190
Magnaguagno, L., Zahno, S., Kredel, R., and Hossner, E.-J. (2022). Contextual information in situations of uncertainty: the value of explicit-information provision depends on expertise level, knowledge acquisition and prior-action congruency. Psychol. Sport Exerc. 59, 102109. doi:10.1016/j.psychsport.2021.102109
Nian, P. P., Saleet, J., Magruder, M., Wellington, I. J., Choueka, J., Houten, J. K., et al. (2024). ChatGPT as a source of patient information for lumbar spinal fusion and laminectomy: a comparative analysis against google web search. Clin. Spine Surg. 37, E394–E403. doi:10.1097/BSD.0000000000001582
Ouertani, A., Krini, O., and Börcsök, H. J. (2023). “A practical approach for reliability prediction of safety critical software using multi-model ensemble techniques,” in 2023 7th International Conference on System Reliability and Safety (ICSRS), 498–506. doi:10.1109/ICSRS59833.2023.10381372
Özduran, E., and Hanci, V. (2022). Evaluating the readability, quality and reliability of online information on behçet’s disease. Reumatismo 74. doi:10.4081/reumatismo.2022.1495
Puschhof, J., Pleguezuelos-Manzano, C., and Clevers, H. (2021). Organoids and organs-on-chips: insights into human gut-microbe interactions. Cell Host Microbe 29, 867–878. doi:10.1016/j.chom.2021.04.002
Rau, G., and Shih, Y.-S. (2021). Evaluation of cohen’s kappa and other measures of inter-rater agreement for genre analysis and other nominal data. J. Engl. Acad. Purp. 53, 101026. doi:10.1016/j.jeap.2021.101026
Sandmann, S., Hegselmann, S., Fujarski, M., Bickmann, L., Wild, B., Eils, R., et al. (2025). Benchmark evaluation of DeepSeek large language models in clinical decision-making. Nat. Med. 31, 2546–2549. doi:10.1038/s41591-025-03727-2
Steyvers, M., Tejeda, H., Kumar, A., Belem, C., Karny, S., Hu, X., et al. (2025). What large language models know and what people think they know. Nat. Mach. Intell. 7, 221–231. doi:10.1038/s42256-024-00976-7
Tam, T. Y. C., Sivarajkumar, S., Kapoor, S., Stolyar, A. V., Polanska, K., McCarthy, K. R., et al. (2024). A framework for human evaluation of large language models in healthcare derived from literature review. npj Digit. Med. 7, 258. doi:10.1038/s41746-024-01258-7
Venerito, V., Puttaswamy, D., Iannone, F., and Gupta, L. (2023). Large language models and rheumatology: a comparative evaluation. Lancet, Rheumatol. 5, e574–e578. doi:10.1016/S2665-9913(23)00216-3
Verstegen, M. M. A., Coppes, R. P., Beghin, A., De Coppi, P., Gerli, M. F. M., de Graeff, N., et al. (2025). Clinical applications of human organoids. Nat. Med. 31, 409–421. doi:10.1038/s41591-024-03489-3
Wang, H., Ning, X., Zhao, F., Zhao, H., and Li, D. (2024). Human organoids-on-chips for biomedical research and applications. Theranostics 14, 788–818. doi:10.7150/thno.90492
Wang, D., Villenave, R., Stokar-Regenscheit, N., and Clevers, H. (2025). Human organoids as 3D in vitro platforms for drug discovery: opportunities and challenges. Nat. Rev. Drug Discov., 1–23. doi:10.1038/s41573-025-01317-y
Wang, H., Zhu, W., Xu, C., Su, W., and Li, Z. (2025). Engineering organoids-on-chips for drug testing and evaluation. Metab. Clin. Exp. 162, 156065. doi:10.1016/j.metabol.2024.156065
Wang, J., Kim, Y.-S. G., Lam, J. H. Y., and Leachman, M. A. (2025). A meta-analysis of relationships between syntactic features and writing performance and how the relationships vary by student characteristics and measurement features. Assess. Writ. 63, 100909. doi:10.1016/j.asw.2024.100909
Wang, Q., Yuan, F., Zuo, X., and Li, M. (2025). Breakthroughs and challenges of organoid models for assessing cancer immunotherapy: a cutting-edge tool for advancing personalised treatments. Cell Death Discov. 11, 222. doi:10.1038/s41420-025-02505-w
Xia, X., Li, F., He, J., Aji, R., and Gao, D. (2019). Organoid technology in cancer precision medicine. Cancer Lett. 457, 20–27. doi:10.1016/j.canlet.2019.04.039
Yilmaz Hanci, S. (2023). How readable and quality are online patient education materials about helicobacter pylori? assessment of the readability, quality and reliability. Med. Baltim. 102, e35543. doi:10.1097/MD.0000000000035543
Keywords: artificial intelligence, large language models, online medical information, organoids, readability
Citation: Sun M, Zang D and Chen J (2026) Benchmarking readability, reliability, and scientific quality of large language models in communicating organoid science. Front. Bioeng. Biotechnol. 14:1750225. doi: 10.3389/fbioe.2026.1750225
Received: 20 November 2025; Accepted: 02 January 2026;
Published: 16 January 2026.
Edited by:
Xuyong Wei, Hangzhou First People’s Hospital, ChinaReviewed by:
Bin Ai, Peking University, ChinaPeng Liu, China Academy of Chinese Medical Sciences, China
Copyright © 2026 Sun, Zang and Chen. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.
*Correspondence: Jun Chen, Y2hlbmp1bl9kbXVAMTI2LmNvbQ==
†These authors have contributed equally to this work
Dan Zang†