EDITORIAL article

Front. Digit. Health

Sec. Ethical Digital Health

Editorial: Ethical Considerations of Large Language Models: Challenges and Best Practices

  • 1. University Health Network (UHN), Toronto, Canada

  • 2. University of Toronto Dalla Lana School of Public Health, Toronto, Canada

  • 3. IQVIA Applied AI Science, Ottawa, Canada

  • 4. The American College of Greece, Athens, Greece

Article metrics

View details

1

Views

The final, formatted version of the article will be published soon.

Abstract

Responsibility for LLM-assisted decisions leading to adverse outcomes remains unresolved, representing a critical gap in terms of wide-scale adoption. Fareed et al.'s systematic review highlights gaps in accountability treatment, proposing regulatory safeguards, technical controls, human oversight, and transparency/accountability are necessary for clinical integration (Fareed et al., 2025). Qi and Pan extend this by examining general-purpose LLMs in evidence-based medicine tasks, highlighting risks including disembodiment (separation from clinical context), deinstitutionalization (bypass of review processes), and depragmatization (loss of clinical judgment) (Qi & Pan, 2026). LLMs may also exhibit other limitations including numeric errors and unverifiable citations -reinforcing the need for auditable, reviewable workflows (Ji et al., 2023).LLMs can greatly increase access in resource-constrained environments. A developing-nation perspective on medical education shows that LLMs can improve access and learning but introduce risks including plagiarism and misinformation. This underscores the need for clear AI use policies, authorship rules, and training for faculty and students (Jaleel et al., 2025). Tung et al.'s survey further confirms that fragmented governance requires multi-layered socio-technical frameworks integrating technical fixes with robust oversight and legal guidelines (Tung et al., 2025). Bias and fairness concerns appear in 26% of studies examined by Fareed et al., while Chan et al.demonstrate that LLMs assigned higher cardiovascular risk to men and Black or South Asian patients. Notably, race-based decisions remained stable across contexts while sex-based judgments varied, suggesting deeply embedded biases (Chan & Kwek, 2025;Fareed et al., 2025). This study also revealed inconsistent citations, hallucinations, and systematic omission of social determinants of health as related risks, echoing broader evidence that systems trained on historical data can perpetuate existing health disparities (Obermeyer et al., 2019).At the infrastructure layer, biobanking-related work focuses on size, site, access, and speed: prioritizing quality-over-volume, recognizing biobanks as socio-technical 'boundary objects,' coupling FAIR with fairness and data sovereignty, and maintaining human oversight as AI accelerates workflows (Mayrhofer, 2025). While data networks build critical mass and increase adoption, scale carries strategic and political power that can exacerbate inequities without careful governance. Jaleel et al. emphasize that the digital divide in developing nations creates unequal LLM access (Jaleel et al., 2025;Wiens et al., 2019). Privacy vulnerabilities demand technical and governance solutions. DP-CARE 's framework performs differentially private, classifier-only training atop a frozen domain encoder, formally bounding privacy loss while favoring recall where missed positives are costlier, at a modest compute overhead -demonstrating the feasibility of privacy-preserving training in sensitive mental health applications (Karpontinis & Soufleri, 2025). The mathematical foundations of differential privacy, which bound the influence of any individual training record, provide the formal guarantee underlying such approaches (Abadi et al., 2016). Mayrhofer complements this with infrastructure-level privacy frameworks balancing AI advancement with data sovereignty (Mayrhofer, 2025). Tung et al. identify privacy as one of four major risks requiring multi-layered technical, procedural, and security solutions (Tung et al., 2025). Several studies tested different LLMs in a number of use cases and conditions. Nantakeeratipat used an ambiguity-probe audit (structured clinical vignettes with clear-cut and intentionally ambiguous cases) to show that apparent errors come from different failures modes, distinguishing bias from diagnostic boundary instability -a crucial difference for mitigation strategies (dataset diversification vs. edge-case calibration). Models exhibit "model-specific ethical fingerprints", requiring ambiguity-sensitive evaluation and periodic re-audits (Nantakeeratipat, 2026).Complementing this, HEAL-Summ illustrates multi-dimensional evaluation for health communication summarization, evaluating outputs across semantic consistency, readability, lexical diversity, emotional alignment and toxicity. Notably, this evaluation approach is paired with low-resource deployment, supporting scalable health communication while flagging different kinds of potential harms (Fisher et al., 2025). A comprehensive bias taxonomy such as the one described in Mehrabi et al. provides conceptual grounding for such multi-dimensional evaluation efforts (Mehrabi et al., 2022) Matching model capabilities to contexts and resources emerges as critical. Fisher et al.demonstrate that smaller, specialized models provide effective, ethical communication without large-scale infrastructure (Fisher et al. 2025). A companion engineering-focused review underscores that automation gains validation, standards, and human-in-the-loop safeguards for safety-critical contexts (Nguyen & Kittur, 2025). Jaleel et al. argue that benefits in resourceconstrained environments must be balanced against integrity risks (Jaleel et al., 2025). The fact that deployment context, not capability alone, determines whether AI safely serves the population of interest is reinforced in literature, such as in Rajpurkar et al.'s perspective on AI in health and medicine (Rajpurkar et al., 2022). Informed by the papers mentioned above, three key insights emerge: first, Nantakeeratipat's distinction between diagnostic instability and bias suggests a multi-methods approach is essential to resolve inaccuracies in LLM deployment. For example, bias requires dataset diversification while instability requires improved training on edge cases (Nantakeeratipat, 2026) (Fisher et al., 2025;Jaleel et al., 2025).LLMs are now present across health and education. Used well, they can improve access, quality, and efficiency; used poorly, they can amplify inequities and erode trust. The contributions here point to a practical charter for implementation: institutionalize governance and auditability; embed equity, fairness and integrity; perform multi-method evaluation beyond accuracy; design for privacy and defensibility; and implement right-sized deployment with experts-in-the-

Summary

Keywords

access, artificial intelligence, Bias, Equity, Ethics, governance, Large langauge models

Received

09 February 2026

Accepted

20 February 2026

Copyright

© 2026 Velmovitsky, Arbuckle and Papadopoulou. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) or licensor are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

*Correspondence: Pedro Elkind Velmovitsky; Luk Arbuckle; Paraskevi Papadopoulou

Disclaimer

All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article or claim that may be made by its manufacturer is not guaranteed or endorsed by the publisher.

Outline

Share article

Article metrics