Your new experience awaits. Try the new design now and help us make it even better

MINI REVIEW article

Front. Med., 12 January 2026

Sec. Ophthalmology

Volume 12 - 2025 | https://doi.org/10.3389/fmed.2025.1741888

Making Chatbots more human: deep reasoning large language models in ophthalmology


Xuanqiao LinXuanqiao Lin1Yizhou Yang&#x;Yizhou Yang1Yuecheng Ren
&#x;Yuecheng Ren2*
  • 1Department of Ophthalmology, Eye, Ear, Nose, and Throat Hospital of Fudan University, Shanghai, China
  • 2Department of Ophthalmology, Jiaxing Traditional Chinese Medicine Hospital Affiliated to Zhejiang Chinese Medical University, Jiaxing, Zhejiang, China

Recent advances in deep-reasoning large language models (LLMs)—including OpenAI's GPT series and open-source DeepSeek models—have expanded their potential applications in ophthalmology. In ophthalmology, image interpretation continues to rely primarily on conventional computer vision and vision language model pipelines, whereas text-based LLMs contribute to language-centric workflows, such as report interpretation, patient education drafting, and electronic health record (EHR) summarization. Multimodal systems that integrate visual inputs with reasoning have been explored in simulated or retrospective settings for tasks such as personalized planning. Although these approaches may enhance workflow efficiency and decision-making, their direct clinical benefits have not yet been established. Nevertheless, practical implementation remains challenging because of computational demands, privacy and bias considerations, and persistent issues with transparency and interpretability. Additionally, system congestion and inconsistent response times further complicate real-world clinical use. Therefore, future research should focus on addressing operational and ethical constraints, tailoring AI systems to ophthalmic workflows, and ensuring that such tools remain an assistive, equitable, and transparent partner in clinical decision-making. Thoughtful integration of deep reasoning models appears promising for ophthalmic practice, but prospective interventional studies are required before making any claims regarding patient outcomes.

1 Introduction

With recent advancements in deep learning (DL)—particularly the development of transformer-based models such as BERT and GPT—large language models (LLMs) have achieved remarkable accuracy in domains including natural language processing and image recognition (1, 2). Although they demonstrate impressive capabilities, conventional LLMs often struggle with handling tasks that require deeper reasoning and deliberate, multi-step logic (Table 1).

Table 1
www.frontiersin.org

Table 1. Comparison between new large language models and traditional artificial intelligence models.

In ophthalmology, the rapid progress of LLMs has primarily led to applications in language-centered workflows, such as interpretation of narrative clinical reports, generation of patient education materials, summarization of electronic health record (EHR) notes, and supporting referral triage (3, 4). More recently, frontier multimodal models that incorporate visual processing have begun to provide workflow-level assistance—for example, structuring biometry data for intraocular lens (IOL) calculations or interpreting the spatial layout of complete visual field test reports (5, 6).

Within this review, the capability for Deep Reasoning is defined as a model capable of explicit, multistep decomposition, verification, and self-correction during the reasoning process and improving the stability of its conclusions through leveraging test-time compute or self-consistency when necessary (7). Deep Reasoning does not mean “larger models”; it refers to a structured reasoning workflow that has systematically mitigated common failure modes of traditional LLMs in clinical reasoning, including hallucinations, cross-step logical breaks, and inconsistent conclusions (8).

Advances in reasoning-oriented LLMs are particularly relevant to ophthalmology. These models employ mechanisms such as extended chain-of-thought (CoT), test-time scaling, and reinforcement learning from human feedback (RLHF) to improve logical consistency, internal verification, and iterative refinement (9). Such capabilities create opportunities for enhancing diagnostic reasoning and individualized patient counseling in both benchmarked and simulated settings. However, translating these systems into clinical practice remains challenging due to issues related to computational efficiency, ethical safeguards, and model interpretability (4).

Ophthalmology is particularly suitable for deep-reasoning LLMs because most ophthalmic conditions require longitudinal decision-making and multi-step therapeutic planning rather than one-off answers (10, 11). Disease management often proceeds across repeated visits, necessitating an integration of symptoms, imaging-derived reports, refraction, intraocular pressure, comorbidities, and treatment responses (12). Clinicians use this evolving information to update differential diagnoses, refine risk stratification, and adjust stepwise care pathways. Such workflows demand multi-step reasoning, consistency checks, and plan verification, matching the core strengths of deep-reasoning LLMs compared with conventional single-pass generation.

This review provides a comprehensive overview of the evolution of LLMs, highlights the emergence of deep-reasoning architectures, and examines their current and potential applications in ophthalmology. It also discusses major limitations and outlines future research priorities essential for safe and effective integration of these technologies into clinical practice.

2 Method

Following PRISMA-aligned best practices for narrative reviews, we searched PubMed and Web of Science Core Collection using the query (ophthalmology OR eye) AND (“large language model” OR LLM OR “vision language model” OR “chatbot”) AND (diagnos* OR triage OR report* OR “IOL” OR surgery OR “instrument tracking” OR “decision support”) with a publication window from 1980 to 2025. This search identified 251 unique records in PubMed and 215 in Web of Science. After cross-database deduplication, 357 remained. Title screening retained 159 records and excluded 198, while abstract screening retained 99 and excluded 60. The most common reasons for exclusion during abstract screening were lack of an available abstract (n = 33), the task falling outside the scope of the review (n = 21), and reviews lacking empirical evaluation (n = 6). We compiled an inclusion list with the following source-specific identifiers: PMID or UT, publication year, and article title. Consistent with the narrative review, we did not register a review protocol or apply a formal risk of bias assessment tool.

3 Development and current status of deep reasoning LLMs

3.1 Development of deep reasoning LLMs

The advent of deep reasoning LLMs can be traced to the introduction of transformer architectures in natural language processing (NLP). Introduced in 2017, the Transformer model replaced traditional recurrent neural networks (RNNs) with an attention mechanism, substantially enhancing sequence data processing capabilities (13). OpenAI's GPT series, launched in 2018, further advanced LLM development through a pretraining-fine-tuning framework (14). With GPT-3 demonstrating significant emergent reasoning abilities in 2020, although limitations in complex, multi-step reasoning tasks remained (15).

By 2022, studies showed that prompting models to generate intermediate reasoning steps—known as chain-of-thought (CoT)—significantly improved performance on complex reasoning benchmarks (16). Techniques, such as self-consistency, which involved sampling multiple reasoning paths and selecting a consensus answer, further increased accuracy (17). More recently, reinforcement learning has been introduced to encourage coherent reasoning traces. Although these techniques enhance benchmark performance, their contribution to genuine reasoning remains under debate. Recent analyses question whether CoT reliably scales to high-complexity or out-of-distribution problems (18). Consistent with this view, a more rigorous framing characterizes current models as strong in crystallized intelligence—the recall and application of accumulated knowledge—yet still limited in fluid intelligence, which requires adaptive and flexible problem solving.

Concurrently, the field has shifted toward a slow-thinking paradigm that integrates training-time optimization with test-time computation strategies. Test-time scaling, structured multistep reasoning, and verification are increasingly combined with reinforcement learning from human or programmatic feedback to improve the reliability and consistency of reasoning processes. This integration is often described as a pathway toward more robust stepwise deliberation, although it still faces challenges in terms of efficiency, interpretability, and generalization (4).

Beyond text-based approaches, hybrid pipelines have emerged in which structured or relational models complement LLMs. Graph neural networks (GNNs) can capture structural dependencies, and when integrated with convolutional neural network (CNN) feature extractors, they enable tasks that combine imaging with language-based reasoning. Such modular integrations do not necessarily make the LLM itself multimodal; rather, they illustrate a system-level design in which visual features inform downstream reasoning for diagnostic and imaging analysis scenarios (19).

To avoid conflating different technical routes, this review classify ophthalmic “image interpretation” systems into three categories: (1) traditional computer-vision models, which process images alone for tasks such as classification or segmentation; (2) native vision–language models, which take multimodal inputs of images and text and conduct unified reasoning across vision and language; and (3) text LLMs coupled with external visual encoders, where image information is first converted to structured descriptors in the form of features, captions, or reports, and subsequently reasoned over in text. Thus, if a deep reasoning LLM does not take native image input, such as DeepSeek-R1, its primary clinical value should be framed in terms of superior textual reasoning—for example, interpreting ophthalmic reports, operative notes, and longitudinal records—rather than direct image reading.

3.2 Representative deep reasoning LLMs

3.2.1 GPT-o1

In 2024, OpenAI introduced GPT-o1, a model specifically engineered for deep reasoning. Unlike previous LLMs, GPT-o1 generates extended sequences of intermediate reasoning steps before producing a final response, thereby strengthening its ability to manage complex, multi-step logical tasks. Using reinforcement learning, particularly RLHF, GPT-o1 was trained to refine these intermediate reasoning processes, which improved its benchmark performance on multistep tasks. GPT-o1 has also demonstrated strong performance in mathematics competitions such as the American Invitational Mathematics Examination (AIME) and programming challenges.

In medical imaging analysis, GPT-o1's ability to process visual input and leverage internal reasoning processes has the potential to improve diagnostic accuracy. Recent studies have shown that GPT-o1 achieves diagnostic accuracies comparable to those of professional clinicians in clinical case analyses (20). However, the computational intensity and associated costs of GPT-o1 pose practical challenges for its widespread clinical implementation (20).

3.2.2 ChatGPT 5

In August 2025, OpenAI introduced GPT-5, a unified system featuring two complementary modes—a fast default mode and a dedicated thinking mode—automatically selected by a routing mechanism according to task complexity and user requirements. This design allows the adaptive allocation of computational resources and effectively balances efficiency with reasoning depth. GPT-5 achieved 94.6% accuracy on the AIME-2025 benchmark for mathematical reasoning and set new state-of-the-art results on SWE-bench Verified, MMMU, and HealthBench-Hard. Additionally, it demonstrated improvements in instruction adherence, reduced hallucination rates, and greater robustness against sycophantic responses (21).

In biomedical and clinical contexts, GPT-5 functions as an active thought partner rather than a decision-maker, showing cautious reasoning behavior and dynamic contextual adaptation. On the MedXpertQA-MM benchmark, GPT-5 outperformed GPT-4o by +29.26% in reasoning and +26.18% in comprehension performance, exceeding the scores of licensed human experts by +24.23 and +29.40%, respectively (22). Despite these promising outcomes, GPT-5 remains a newly released model, and its broader implications for medical diagnosis and clinical education warrant further systematic evaluation.

3.2.3 Deepseek-R1

In early 2025, the open-source community introduced DeepSeek-R1, a large language model designed to advance deep reasoning through reinforcement learning (RL). Unlike conventionally supervised models, DeepSeek-R1 employs a multi-stage optimization pipeline that integrates “cold-start” supervised fine-tuning, rule-based RL, rejection sampling, and an additional RL phase incorporating preference and safety rewards (23). This approach enhances the coherence and interpretability of reasoning processes while reducing computational costs to ~5%−10% of those required by GPT-o1, yet maintaining comparable performance on complex reasoning benchmarks (24).

A defining feature of DeepSeek-R1 is its emphasis on the chain-of-thought (CoT), which enables systematic problem analysis and synthesis. Notably, emergent behaviors such as self-reflection and verification have been observed during training. Furthermore, the model's reasoning capability has been successfully distilled into smaller variants, increasing accessibility for research and clinical developers (23). Its performance on medical reasoning tasks, including ophthalmic case studies, matches proprietary models, with an accuracy of ~82% (24). Because of its open-source nature, it supports fine-tuning tailored to specific medical domains. Although it does not natively support multimodal inputs, DeepSeek-R1 can be integrated with external vision encoders (e.g., CNN-based feature extractors) to enable language–image workflows in medical imaging analysis, thereby providing a flexible and cost-efficient platform for both research and clinical applications. Reported limitations include suboptimal structured output and tool use, reduced token efficiency issues on simple queries, prompt sensitivity (with zero-shot prompting preferred), and prior language-mixing issues that have since been mitigated through a language-consistency reward mechanism (23).

3.2.4 Grok-3

In early 2025, xAI introduced Grok-3, an advanced LLM designed with enhanced reasoning capabilities, supported by specialized inference methods like “Think Mode” and “Big Brain Mode,” and capable of more in-depth analysis and nuanced decision-making. Grok-3 performed well, achieving competitive scores on complex reasoning tasks, including a high score from the 2025 AIME (25). In the biomedical context, Grok-3 shows promising potential applications in ophthalmology, such as enhancing diagnostic accuracy, facilitating personalized treatment planning, and supporting patient education.

3.3 Additional research in deep reasoning LLMs

In addition to GPT-o1 and Deepseek-R1, other institutions have explored alternative strategies for enhancing LLM reasoning abilities. Anthropic's Claude series utilizes extended context windows to improve the handling of long-chain reasoning tasks (26). Google's Gemini series explores dynamic allocation of computational resources during reasoning—often referred to as test-time compute—to enhance performance on complex tasks. Furthermore, ongoing research focuses on integrating external knowledge bases and computational tools, such as code execution, to further optimize LLM reasoning capabilities and highlight potential future developments.

4 Application of deep reasoning models in ophthalmology

4.1 Diagnostic assistance

Previous studies have demonstrated that LLMs achieve high diagnostic accuracy in ophthalmology (27, 28). Deep reasoning models—such as DeepSeek-R1, GPT-o1, Gemini 2.0, and Grok3—have shown significant potential in ophthalmic diagnostics, particularly for conditions such as glaucoma, diabetic retinopathy (DR), and age-related macular degeneration (AMD) (29, 30). Notably, most of these evaluations were conducted in retrospective, benchmark, or simulated settings, and thus should be interpreted as performance evidence rather than demonstrated real-world patient outcome benefit. By analyzing large volumes of medical data, including symptoms, test results, and medical history, these models can provide doctors with diagnostic recommendations and differential diagnosis lists (31). In principle, this capability may reduce the risk of diagnostic errors and assist ophthalmologists in making more informed clinical decisions. A comparative evaluation of DeepSeek-R1 and GPT-o1 in pediatric clinical decision support reported diagnostic accuracies comparable to those of clinical specialists (92.8% for GPT-o1 and 87.0% for DeepSeek-R1), underscoring their potential as clinical decision support tools (32).

Beyond ophthalmology, LLMs have also demonstrated high accuracy in automated extraction of Coronary Artery Disease-Reporting and Data System (CAD-RADS) 2.0, components from semi-structured coronary CT angiography reports, particularly when used with CoT prompting (33). These findings underscore the broader value of structured reasoning strategies in improving reliability across clinical domains and further support their potential utility in ophthalmology.

Ophthalmology is uniquely positioned for the integration of deep-reasoning LLMs due to the abundance of quantifiable ocular biometric parameters, such as axial length, anterior chamber depth, lens thickness, and corneal curvature, routinely obtained through imaging modalities including OCT, B-scan ultrasonography, and anterior segment photography. This extensive repository of structured and image-derived data provides a strong foundation for applying reasoning models to diagnostic and analytical workflows. However, some limitations remain because not all deep-reasoning models support multimodal learning. For instance, DeepSeek-R1 remains confined to text-based inputs and cannot natively process images. However, its open-source framework enables customized server configurations that may partially mitigate this limitation and extend its applicability to ophthalmology (Figure 1).

Figure 1
Diagram illustrating the application of deep reasoning models in healthcare. Sections labeled Diagnose, Surgery, Education, Psychological Care, and Risk surround an eye graphic. A computer is linked to deep reasoning models by an upward arrow.

Figure 1. Application and risk of deep reasoning models in ophthalmology.

4.2 Surgical assistance

Traditional LLMs have been explored for perioperative support in both retrospective and simulation settings (34, 35). Through improvements in preoperative planning and postoperative monitoring, deep reasoning models are progressively influencing ophthalmic surgical procedures. Most deep-reasoning LLMs, with their transparent reasoning pathways, can provide explicit justifications for surgical decision-making, thereby enhancing clinician confidence (6). LLMs are capable of interpreting IOLMaster outputs and assisting with the selection of toric or non-toric IOLs, although the power calculation itself continues to rely on conventional statistical or machine learning formulas (5). Consequently, the LLM contribution is more accurately described as IOL selection support rather than IOL power calculation, to clearly distinguish language-model assistance from formula-based computation. Moreover, in the future, multimodal pipelines that couple external visual encoders with text-based DR-LLMs could support intraoperative decision, further augmenting surgeons' capabilities and enhancing patient safety. Postoperatively, AI systems facilitate remote wound-healing assessment, early identification of complications, and improved recovery outcomes.

4.3 Patient education

Deep-reasoning LLMs can enhance patient management by providing personalized therapeutic recommendations, educational content, and remote monitoring support. These models effectively simplify complex medical information, making it more understandable and engaging for patients, which in turn improves comprehension and adherence to ophthalmic treatments. For example, they can generate easy-to-understand explanations of medical conditions, treatments, and procedures to help patients understand their health better and make informed decisions. They can also create personalized educational materials based on the specific needs and understanding levels of patients, thereby enhancing the effectiveness of patient education (36, 37).

Several comparative studies have examined the role of different LLMs in patient education and demonstrated promising results (38, 39). Some models have achieved better readability and comprehensibility than those provided by clinicians. Semeraro et al. (32) evaluated models such as ChatGPT-4o, Gemini Advanced, and DeepSeek-R1 for their ability to convey critical medical guidelines—such as those from the European Resuscitation Council (ERC) on CPR—to the public. These studies showed promising results in terms of readability and correctness metrics for patient education.

4.4 Patients' psychological care

A recent study examined the ability of six LLMs—GPT-4o, GPT-o1, DeepSeek-R1, Claude 3.5 Sonnet, Sonar Large (LLaMA-3.1), and Gemma-2-2b—to detect risks of domestic violence, suicide, and filicide suicide within the Taiwanese flash fiction Barbecue (40). Notably, GPT-o1 demonstrated an ability to identify suicide risk based on subtle cultural cues, suggesting that deep reasoning–based LLMs may outperform traditional models in recognizing latent psychological states during conversations. This observation holds particular significance in ophthalmology, where patients with ocular trauma, advanced glaucoma, or optic nerve atrophy often experience irreversible vision loss and poor surgical outcomes, which may lead to despair, resentment, or even suicidal ideation (41, 42).

However, recent reports such as Chatbot psychosis have highlighted that the unconstrained conversational use of LLMs in psychiatric contexts may inadvertently exacerbate symptoms or undermine the clinician–patient relationship (43). In light of these concerns, future applications should avoid allowing LLMs to provide unsupervised psychiatric counseling. Instead, more restricted formats—such as one-way question-answer screening tools or non-dialogue-based linguistic analyses—could enable the early recognition of at-risk patients while mitigating the psychological risks associated with unrestricted conversational interactions. Thus, although deep-reasoning LLMs show considerable promise in identifying latent psychological distress, their role must be carefully designed with strict ethical oversight.

4.5 Evidence gap: performance vs. clinical benefit

Although deep-reasoning LLMs show promising accuracy and workflow performance in benchmarks, retrospective analyses, and simulated scenarios, direct clinical benefits have not yet been established. Importantly, recent benchmark research in medicine has highlighted a broader shift from knowledge-based testing (where leading models may reach near-saturated performance) to practice-based assessment, revealing a substantial knowledge–practice gap in which high examination scores can be misleading proxies for clinical readiness and safety (44). Accordingly, future work in ophthalmology should prioritize prospective interventional studies and practice-oriented validation that evaluate clinically meaningful endpoints beyond accuracy, such as reductions in diagnostic errors, time-to-decision, unnecessary testing/referrals, and downstream visual outcomes (e.g., postoperative refractive error, visual acuity, complication rates), as well as patient-centered outcomes, ideally within powerful human oversight frameworks.

5 Risks and challenges in clinical implementation

Despite these promising developments, deep-reasoning LLMs face several significant challenges. Models such as Deepseek-R1 and GPT-o1 require considerable computational resources, raising concerns about practical deployment in clinical settings. These intensive computational demands may hinder widespread implementation, especially in resource-limited environments (24, 45). These resource demands are closely related to test-time compute, which operationalizes deliberate “slow thinking” by allocating extra inference steps to improve the reliability of reasoning (46). However, this introduces an intrinsic trade-off in latency and operational cost that may conflict with the immediate analytical access needed by clinicians in time-sensitive practice. In addition, DeepSeek-R1 has been reported to experience intense system congestion during operation, and remarkable discrepancies in reasoning time have been observed among models, further complicating real-time use. The tolerance for latency varies significantly across ophthalmic workflows (47). Scenarios such as acute triage, real-time decision-making, and intraoperative support must respond with much speed, whereas tasks like pre-visit planning, report auditing, longitudinal risk stratification, and patient education will be able to afford longer inference times. To address this variance, practical strategies include a tiered response mode providing fast preliminary guidance followed by optional deeper reasoning; precomputation and caching of common scenarios; explicit time-budget controls; and strong human-in-the-loop oversight where rapid decisions are required to ensure safety.

Ethical and regulatory concerns also pose substantial barriers to clinical adoption (48, 49). The protection of patient privacy, data security, and compliance with healthcare regulations remain critical priorities. Because ocular images serve as biometric identifiers, the development of privacy-preserving pipelines—and when appropriate, synthetic ophthalmic media—is essential for secure data sharing and educational purposes. Recent text-to-video systems that convert fluorescence fundus angiography (FFA) reports into dynamic angiography videos have demonstrated realistic retinal findings while maintaining privacy preservation for multicenter use (50). Such approaches provide new inspiration for our own privacy-preserving pipelines.

Chatbot psychosis and unsupervised psychiatric counseling should be framed as a systemic ethical and patient-safety risk, rather than an isolated edge case (43). The central failure mode arises when a generative model is deployed in an open-ended “counseling” format without clinical assessment, risk stratification, and ongoing human oversight, potentially producing confident but inappropriate guidance that can reinforce delusional beliefs, exacerbate anxiety or dependency, and delay timely access to professional care (51). As a mitigation strategy, clinical deployments should preferentially adopt restricted application formats, constraining LLM use to structured psychoeducation, resource navigation, and visit preparation (e.g., organizing concerns and questions) rather than delivering diagnostic or therapeutic psychiatric advice.

Furthermore, addressing biases related to race, sex, and age is imperative to avoid unfair treatment and diagnostic disparities; hence, the necessity for rigorous bias detection and mitigation strategies (52).

Moreover, the transparency and interpretability of models remain essential for clinician trust and practical use (53, 54). Clinicians require understandable explanations of model reasoning processes to effectively integrate AI-driven insights into clinical workflows. For example, during IOL selection, ophthalmic report interpretation, and the generation of pre-and post-operative patient education materials, clinician oversight remains essential. Although human control cannot be eliminated entirely, LLMs have the potential to enable a single clinician to supervise multiple LLMs simultaneously.

In addition, delegating tasks to LLMs introduces distinct ethical challenges. Recent experimental evidence demonstrates that when users (principals) provide high-level goals or example-based prompts rather than explicit rules, LLMs are more prone to exhibit dishonest or ethically questionable behaviors. Moreover, machine agents tend to comply with unethical requests more completely than human agents. Although the addition of strongly worded prohibitive guardrails at the user level has been shown to reduce this tendency, such measures rarely eliminate it entirely. This highlights the importance of explicit and auditable instruction pathways in clinical deployment to safeguard against unsafe compliance (55).

6 Conclusion

Development has significantly evolved from simple prompt-engineering methods to more structured inference mechanisms that support deep multi-step reasoning. These changes indicate possible applications in very complex clinical decision-making for medical diagnostics. However, the practical translation to a medical environment will have to be done very cautiously, considering cost, computational efficiency, and safety. Future research should focus on domain-specific medical adaptation, enhancement of reasoning transparency and interpretability, and the development of a regulatory and governance framework specifying various requirements for transparency, auditability, and accountability prior to general clinical use. Importantly, future interventional studies will be needed prior to claims of benefits regarding real-world patient outcomes beyond benchmark or simulated accuracy.

Author contributions

XL: Supervision, Validation, Conceptualization, Writing – review & editing, Visualization. YY: Writing – original draft, Formal analysis, Data curation, Investigation. YR: Data curation, Conceptualization, Writing – original draft, Formal analysis.

Funding

The author(s) declared that financial support was not received for this work and/or its publication.

Acknowledgments

We sincerely thank Dr. Jin Yang for her encouragement and support throughout the study.

Conflict of interest

The author(s) declared that this work was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Generative AI statement

The author(s) declared that generative AI was not used in the creation of this manuscript.

Any alternative text (alt text) provided alongside figures in this article has been generated by Frontiers with the support of artificial intelligence and reasonable efforts have been made to ensure accuracy, including review by the authors wherever possible. If you identify any issues, please contact us.

Publisher's note

All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article, or claim that may be made by its manufacturer, is not guaranteed or endorsed by the publisher.

References

1. Achiam J, Adler S, Agarwal S, Ahmad L, Akkaya I, Aleman FL, et al. Gpt-4 Technical Report. arXiv preprint arXiv:230308774. (2023).

Google Scholar

2. Wang Y, Sun Y, Fu Y, Zhu D, Tian Z. Spectrum-BERT: pretraining of deep bidirectional transformers for spectral classification of Chinese liquors. IEEE Trans Instrum Meas. (2024) 73:1–13. doi: 10.1109/TIM.2024.3374300

Crossref Full Text | Google Scholar

3. Kuang JF, Wang JH, Zeng MB, Chen FL, Huang WB. Application prospect of large language model represented by ChatGPT in ophthalmology. Int J Ophthalmol. (2025) 18:1790–6. doi: 10.18240/ijo.2025.09.21

PubMed Abstract | Crossref Full Text | Google Scholar

4. Zhang Q, Wang S, Wang X, Xu C, Liang J, Liu Z. Advancing ophthalmology with large language models: applications, challenges, and future directions. Surv Ophthalmol. (2025) 70:1019–28. doi: 10.1016/j.survophthal.2025.02.009

PubMed Abstract | Crossref Full Text | Google Scholar

5. Tan JCK. Using a large language model to process biometry reports and select intraocular lens for cataract surgery. J Cataract Refract Surg. (2025) 51:351–2. doi: 10.1097/j.jcrs.0000000000001620

PubMed Abstract | Crossref Full Text | Google Scholar

6. Tan JCK. Coherent interpretation of entire visual field test reports using a multimodal large language model (ChatGPT). Vision. (2025) 9. doi: 10.3390/vision9020033

PubMed Abstract | Crossref Full Text | Google Scholar

7. Wang X, Wei J, Schuurmans D, Le Q, Chi EH, Zhou D. Self-consistency improves chain of thought reasoning in language models. ArXiv. (2022) abs/2203.11171. doi: 10.48550/arXiv.2203.11171. [Epub ahead of print].

Crossref Full Text | Google Scholar

8. Sim SZY, Chen T. Critique of impure reason: unveiling the reasoning behaviour of medical large language models. Elife. (2025) 14. doi: 10.7554/eLife.106187

PubMed Abstract | Crossref Full Text | Google Scholar

9. Pan Q, Ji W, Ding Y, Li J, Chen S, Wang J, et al. A survey of slow thinking-based reasoning LLMs using reinforcement learning and test-time scaling law. Inf Process Manag. (2026) 63:104394. doi: 10.1016/j.ipm.2025.104394

Crossref Full Text | Google Scholar

10. Committee ADAPP. 12. Retinopathy, neuropathy, and foot care: standards of care in diabetes-−2024. Diabetes Care. (2023) 47(Supplement_1):S231–43. doi: 10.2337/dc24-S012

PubMed Abstract | Crossref Full Text | Google Scholar

11. Chaikitmongkol V, Sagong M, Lai TYY, Tan GSW, Ngah NF, Ohji M, et al. Treat-and-extend regimens for the management of neovascular age-related macular degeneration and polypoidal choroidal vasculopathy: consensus and recommendations from the Asia-pacific vitreo-retina society. Asia Pac J Ophthalmol. (2021) 10:507–18. doi: 10.1097/APO.0000000000000445

PubMed Abstract | Crossref Full Text | Google Scholar

12. Gedde SJ, Vinod K, Wright MM, Muir KW, Lind JT, Chen PP, et al. Primary open-angle glaucoma preferred practice pattern®. Ophthalmology. (2021) 128:P71–150. doi: 10.1016/j.ophtha.2020.10.022

PubMed Abstract | Crossref Full Text | Google Scholar

13. Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, et al. Attention is all you need. Adv Neural Inf Process Syst. (2017) 30.

Google Scholar

14. Radford A, Wu J, Child R, Luan D, Amodei D, Sutskever I. Language models are unsupervised multitask learners. OpenAI blog. (2019) 1:9.

Google Scholar

15. Brown T, Mann B, Ryder N, Subbiah M, Kaplan JD, Dhariwal P, et al. Language models are few-shot learners. Adv Neural Inf Process Syst. (2020) 33:1877–901.

Google Scholar

16. Wei J, Wang X, Schuurmans D, Bosma M, Xia F, Chi E, et al. Chain-of-thought prompting elicits reasoning in large language models. Adv Neural Inf Process Syst. (2022) 35:24824–37.

Google Scholar

17. Wang X, Wei J, Schuurmans D, Le QV, Chi H, Narang S, et al. Self-consistency improves chain of thought reasoning in language models. The Eleventh International Conference on Learning Representations (2023).

Google Scholar

18. Shojaee P, Mirzadeh I, Alizadeh K, Horton M, Bengio S, Farajtabar M. The Illusion of Thinking: Understanding the Strengths and Limitations of Reasoning Models via the Lens of Problem Complexity. arXiv:2506.06941. (2025) Available online at: https://ui.adsabs.harvard.edu/abs/2025arXiv250606941S (Accessed September 17, 2025). doi: 10.70777/si.v2i6.15919

Crossref Full Text | Google Scholar

19. Alzubaidi L, Zhang J, Humaidi AJ, Al-Dujaili A, Duan Y, Al-Shamma O, et al. Review of deep learning: concepts, CNN architectures, challenges, applications, future directions. J Big Data. (2021) 8:1–74. doi: 10.1186/s40537-021-00444-8

PubMed Abstract | Crossref Full Text | Google Scholar

20. Rajesh AE, Davidson OQ, Lee CS, Lee AY. Artificial intelligence and diabetic retinopathy: AI framework, prospective studies, head-to-head validation, and cost-effectiveness. Diabetes Care. (2023) 46:1728–39. doi: 10.2337/dci23-0032

PubMed Abstract | Crossref Full Text | Google Scholar

21. OpenAI. Introducing GPT-5. (2025). Available online at: https://openai.com/index/introducing-gpt-5/ (Accessed September 17, 2025).

Google Scholar

22. Wang S, Hu M, Li Q, Safari M, Yang X. Capabilities of Gpt-5 on Multimodal Medical Reasoning. arXiv preprint arXiv:250808224. (2025).

Google Scholar

23. Guo D, Yang D, Zhang H, Song J, Wang P, Zhu Q, et al. DeepSeek-R1 incentivizes reasoning in LLMs through reinforcement learning. Nature. (2025) 645:633–8.

PubMed Abstract | Google Scholar

24. Guo D, Yang D, Zhang H, Song J, Zhang R, Xu R, et al. Deepseek-r1: Incentivizing reasoning capability in llms via Reinforcement Learning. arXiv preprint arXiv:250112948. (2025).

Google Scholar

25. Grok 3 Beta — The Age of Reasoning Agents. (2025). Available online at: https://x.ai/news/grok-3 (Accessed September 18, 2025).

Google Scholar

26. Caruccio L, Cirillo S, Polese G, Solimando G, Sundaramurthy S, Tortora G. Claude 20 large language model: tackling a real-world classification problem with a new iterative prompt engineering approach. Intell Syst Appl. (2024) 21:200336. doi: 10.1016/j.iswa.2024.200336

Crossref Full Text | Google Scholar

27. Mihalache A, Huang RS, Popovic MM, Patil NS, Pandya BU, Shor R, et al. Accuracy of an artificial intelligence Chatbot's interpretation of clinical ophthalmic images. JAMA Ophthalmol. (2024) 142:321–6. doi: 10.1001/jamaophthalmol.2024.0017

PubMed Abstract | Crossref Full Text | Google Scholar

28. Sorin V, Kapelushnik N, Hecht I, Zloto O, Glicksberg BS, Bufman H, et al. Integrated visual and text-based analysis of ophthalmology clinical cases using a large language model. Sci Rep. (2025) 15:4999. doi: 10.1038/s41598-025-88948-8

PubMed Abstract | Crossref Full Text | Google Scholar

29. Srinivasan S, Ai X, Zou M, Zou K, Kim H, Lo TWS, et al. Can OpenAI o1 Reason Well in Ophthalmology? A 6,990-Question Head-to-Head Evaluation Study. arXiv preprint arXiv:250113949. (2025).

Google Scholar

30. Xu P, Wu Y, Jin K, Chen X, He M, Shi D. DeepSeek-R1 Outperforms Gemini 2.0 Pro, OpenAI o1, and o3-mini in Bilingual Complex Ophthalmology Reasoning. arXiv preprint arXiv:250217947. (2025). doi: 10.1016/j.aopr.2025.05.001

PubMed Abstract | Crossref Full Text | Google Scholar

31. Kaczmarczyk R, Wilhelm TI, Martin R, Roos J. Evaluating multimodal AI in medical diagnostics. NPJ Digit Med. (2024) 7:205. doi: 10.1038/s41746-024-01208-3

PubMed Abstract | Crossref Full Text | Google Scholar

32. Semeraro F, Cascella M, Montomoli J, Bellini V, Bignami EG. Comparative analysis of AI tools for disseminating CPR guidelines: implications for cardiac arrest education. Resuscitation. (2025) 208:110528. doi: 10.1016/j.resuscitation.2025.110528

PubMed Abstract | Crossref Full Text | Google Scholar

33. Min D, Jin KN, Bang S, Kim MY, Kim HL, Jeong WG. et al. Large language models for CAD-RADS 20 extraction from semi-structured coronary CT angiography reports: a multi-institutional study. Korean J Radiol. (2025) 26:817–31. doi: 10.3348/kjr.2025.0293

Crossref Full Text | Google Scholar

34. Mehraeen E, Attarian N, Tabari A, SeyedAlinaghi S. Transforming plastic surgery: an innovative role of Chat GPT in plastic surgery practices. Updates Surg. (2025). doi: 10.1007/s13304-025-02149-6. [Epub ahead of print].

PubMed Abstract | Crossref Full Text | Google Scholar

35. Kurapati SS, Barnett DJ, Yaghy A, Sabet CJ, Younessi DN, Nguyen D, et al. Eyes on the text: assessing readability of AI & ophthalmologist responses to patient surgery queries. Ophthalmologica. (2025) 1–18. doi: 10.1159/000546049. [Epub ahead of print].

Crossref Full Text | Google Scholar

36. Cohen SA, Brant A, Fisher AC, Pershing S, Do D, Pan CDr. Google vs. Dr ChatGPT: exploring the use of artificial intelligence in ophthalmology by comparing the accuracy, safety, and readability of responses to frequently asked patient questions regarding cataracts and cataract surgery. Semin Ophthalmol. (2024) 39:472–9. doi: 10.1080/08820538.2024.2326058

Crossref Full Text | Google Scholar

37. Yang Y, Bai L, Ren Y, Lin X. Assessing the quality and educational applicability of AI-generated anterior segment images in ophthalmology. Sci Rep. (2025) 15:42778. doi: 10.1038/s41598-025-27020-x

PubMed Abstract | Crossref Full Text | Google Scholar

38. Shi R, Liu S, Xu X, Ye Z, Yang J, Le Q, et al. Benchmarking four large language models' performance of addressing Chinese patients' inquiries about dry eye disease: a two-phase study. Heliyon. (2024) 10:e34391. doi: 10.1016/j.heliyon.2024.e34391

PubMed Abstract | Crossref Full Text | Google Scholar

39. Huang Y, Shi R, Chen C, Zhou X, Zhou X, Hong J, et al. Evaluation of large language models for providing educational information in orthokeratology care. Cont Lens Anterior Eye. (2025) 102384. doi: 10.1016/j.clae.2025.102384. [Epub ahead of print].

PubMed Abstract | Crossref Full Text | Google Scholar

40. Chen CC, Chen JA, Liang CS, Lin YH. Large language models may struggle to detect culturally embedded filicide-suicide risks. Asian J Psychiatr. (2025) 105:104395. doi: 10.1016/j.ajp.2025.104395

PubMed Abstract | Crossref Full Text | Google Scholar

41. Jesus J, Ambrósio J, Meira D, Rodriguez-Uña I, Beirão JM. Blinded by the mind: exploring the hidden psychiatric burden in glaucoma patients. Biomedicines. (2025) 13. doi: 10.3390/biomedicines13010116

PubMed Abstract | Crossref Full Text | Google Scholar

42. Global estimates on the number of people blind or visually impaired by glaucoma: a meta-analysis from 2000 to 2020. Eye. (2024) 38:2036–46.

Google Scholar

43. Østergaard SD. Will generative artificial intelligence chatbots generate delusions in individuals prone to psychosis? Schizophr Bull. (2023) 49:1418–9. doi: 10.1093/schbul/sbad128

PubMed Abstract | Crossref Full Text | Google Scholar

44. Gong EJ, Bang CS, Lee JJ, Baik GH. Knowledge-practice performance gap in clinical large language models: systematic review of 39 benchmarks. J Med Internet Res. (2025) 27:e84120. doi: 10.2196/84120

PubMed Abstract | Crossref Full Text | Google Scholar

45. Temsah A, Alhasan K, Altamimi I, Jamal A, Al-Eyadhy A, Malki KH, et al. DeepSeek in healthcare: revealing opportunities and steering challenges of a new open-source artificial intelligence frontier. Cureus. (2025) 17. doi: 10.7759/cureus.79221

PubMed Abstract | Crossref Full Text | Google Scholar

46. Yang W, Ma S, Lin Y, Wei F. Towards Thinking-Optimal Scaling of Test-Time Compute for LLM Reasoning. ArXiv. (2025) abs/2502.18080.

Google Scholar

47. Chen J, Chen CM, Zheng Y, Zhong L. Characteristics of eye-related emergency visits and triage differences by nurses and ophthalmologists: perspective from a single eye center in southern China. Front Med. (2023) 10:1091128. doi: 10.3389/fmed.2023.1241114

PubMed Abstract | Crossref Full Text | Google Scholar

48. Murdoch B. Privacy and artificial intelligence: challenges for protecting health information in a new era. BMC Med Ethics. (2021) 22:1–5. doi: 10.1186/s12910-021-00687-3

PubMed Abstract | Crossref Full Text | Google Scholar

49. Mennella C, Maniscalco U, De Pietro G, Esposito M. Ethical and regulatory challenges of AI technologies in healthcare: a narrative review. Heliyon. (2024) 10:e26297. doi: 10.1016/j.heliyon.2024.e26297

PubMed Abstract | Crossref Full Text | Google Scholar

50. Wu X, Wang L, Chen R, Liu B, Zhang W, Yang X, et al. Generation of fundus fluorescein angiography videos for health care data sharing. JAMA Ophthalmol. (2025) 143:623–32. doi: 10.1001/jamaophthalmol.2025.1419

PubMed Abstract | Crossref Full Text | Google Scholar

51. Moore J, Grabb D, Agnew W, Klyman K, Chancellor S, Ong DC, et al. Expressing stigma and inappropriate responses prevents LLMs from safely replacing mental health providers. In: Proceedings of the 2025 ACM Conference on Fairness, Accountability, and Transparency: Association for Computing Machinery. New York, NY (2025). p. 599–627. doi: 10.1145/3715275.3732039

Crossref Full Text | Google Scholar

52. Colón-Rodríguez C. Shedding Light on Healthcare Algorithmic and Artificial Intelligence Bias. Rockville, MD: US Department of Health and Human Services Office of Minority Health (2023). p. 2024.

Google Scholar

53. Marey A, Arjmand P, Alerab ADS, Eslami MJ, Saad AM, Sanchez N, et al. Explainability, transparency and black box challenges of AI in radiology: impact on patient care in cardiovascular radiology. Egypt J Radiol Nucl Med. (2024) 55:183. doi: 10.1186/s43055-024-01356-2

Crossref Full Text | Google Scholar

54. Rosenbacke R, Melhus Å, McKee M, Stuckler D. How explainable artificial intelligence can increase or decrease clinicians' trust in AI applications in health care: systematic review. JMIR AI. (2024) 3:e53207. doi: 10.2196/53207

PubMed Abstract | Crossref Full Text | Google Scholar

55. Köbis N, Rahwan Z, Rilla R, Supriyatno BI, Bersch C, Ajaj T, et al. Delegation to artificial intelligence can increase dishonest behaviour. Nature. (2025). doi: 10.31219/osf.io/dnjgz_v2. [Epub ahead of print].

PubMed Abstract | Crossref Full Text | Google Scholar

Keywords: large language models, artificial intelligence, ophthalmology, clinical decision support, Chatbots, deep reasoning

Citation: Lin X, Yang Y and Ren Y (2026) Making Chatbots more human: deep reasoning large language models in ophthalmology. Front. Med. 12:1741888. doi: 10.3389/fmed.2025.1741888

Received: 07 November 2025; Revised: 07 December 2025;
Accepted: 16 December 2025; Published: 12 January 2026.

Edited by:

Sayan Basu, L V Prasad Eye Institute, India

Reviewed by:

Lu Yuan, Zhejiang University, China

Copyright © 2026 Lin, Yang and Ren. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

*Correspondence: Yuecheng Ren, cmVueXVlY2hlbmcxOTk4QDE2My5jb20=

These authors have contributed equally to this work

Disclaimer: All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article or claim that may be made by its manufacturer is not guaranteed or endorsed by the publisher.