ORIGINAL RESEARCH article
Front. Artif. Intell.
Sec. Medicine and Public Health
Volume 8 - 2025 | doi: 10.3389/frai.2025.1557920
Swedish Medical LLM Benchmark (SMLB): Development and Evaluation of a Framework for Assessing Large Language Models in the Swedish Medical Domain
Provisionally accepted- 1Division of Speech Music and Hearing, KTH Royal Institute of Technology, Stockholm, Sweden
- 2ETH Zürich, Zurich, Zürich, Switzerland
Select one of your emails
You have multiple emails registered with Frontiers:
Notify me on publication
Please enter your email address:
If you already have an account, please login
You don't have a Frontiers account ? You can register here
We present the Swedish Medical LLM Benchmark (SMLB), an evaluation framework for assessing Large Language Models (LLMs) in the Swedish medical domain. The SMLB addresses the lack of language-specific, clinically relevant benchmarks by incorporating four datasets: translated PubMedQA questions, Swedish Medical Exams, Emergency Medicine scenarios, and General Medicine cases. Our evaluation of 18 state-of-the-art LLMs reveals GPT-4-turbo, Claude-3.5 (October 2023), and the o3 model as top performers, demonstrating strong alignment between medical reasoning and general language understanding capabilities. Hybrid systems incorporating retrieval-augmented generation (RAG) improved accuracy for clinical knowledge questions, highlighting promising directions for safe implementation. The SMLB provides not only an evaluation tool but also reveals fundamental insights about LLM capabilities and limitations in Swedish healthcare applications, including significant performance variations between models.By open-sourcing the benchmark, we enable transparent assessment of medical LLMs while promoting responsible development through community-driven refinement. This work emphasizes the critical need for rigorous evaluation frameworks as LLMs become increasingly integrated into clinical workflows, particularly for non-English medical contexts where linguistic and cultural specificity are paramount.
Keywords: large language model (LLM), Swedish Medical LLM Benchmark, Healthcare AI Evaluation, Emergency Medicine, General Medicine, Medical knowledge, MCQ, Open Source AI Artificial Intelligence EM Emergency Medicine (benchmark component) GM General Medicine (benchmark component) LLM Large Language Model MCQ Multiple-Choice Question MMLU Massive Multitask Language Understanding
Received: 09 Jan 2025; Accepted: 19 Jun 2025.
Copyright: © 2025 Moell, Beskow and Farenstam. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) or licensor are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.
* Correspondence: Birger Moell, Division of Speech Music and Hearing, KTH Royal Institute of Technology, Stockholm, Sweden
Disclaimer: All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article or claim that may be made by its manufacturer is not guaranteed or endorsed by the publisher.