Your new experience awaits. Try the new design now and help us make it even better

BRIEF RESEARCH REPORT article

Front. Digit. Health

Sec. Health Informatics

Volume 7 - 2025 | doi: 10.3389/fdgth.2025.1660887

This article is part of the Research TopicThe Digitization of Neurology - Volume IIView all articles

Performance of Vision-Language Models for Optic Disc Swelling Identification on Fundus Photographs

Provisionally accepted
  • 1National Healthcare Group Eye Institute, 63709, Singapore, Singapore
  • 2Stanford University, Stanford, United States
  • 3University of California Davis, Davis, United States

The final, formatted version of the article will be published soon.

Introduction: Vision-language models (VLMs) combine image analysis capabilities with large language models (LLMs). Due to their multimodal capabilities, VLM offer a clinical advantage over image classification models for diagnosis of optic disc swelling by allowing consideration of clinical context. We compare the performance of non-specialty-trained VLMs with different prompts in the classification of optic disc swelling on fundus photographs.A diagnostic test accuracy study was conducted utilizing an open-sourced dataset. 5 different prompts (increasing in context) were used with each of 5 different VLMs (Llama 3.2-vision, LLaVa-Med, LLaVa, GPT-4o and DeepSeek-4V) resulting in 25 prompt-model pairs. VLM's performance in classifying photos with and without optic disc swelling were measured using Youden's index (YI), F1 score, and accuracy.Results: 779 images of normal optic discs and 295 images of swollen discs were obtained from an open-source image database. Amongst the 25 prompt-model pairs, valid response rates ranged from 7.8%-100% (median 93.6%). Diagnostic performance ranged from YI: 0.00-0.231 (median 0.042), F1 score: 0.00-0.716 (median 0.401), and accuracy: 27.5%-70.5% (median 58.8%). The best performing prompt-model pair was GPT-4o with role-playing with Chain-of-Thought (CoT) and fewshot prompting. On average, Llama 3.2-vision performed the best (average YI across prompts 0.181). There was no consistent relationship between the amount of information given in the prompt and model performance.Conclusions: Non-specialty trained VLM can classify photos of swollen and normal optic discs better than chance with performance varying by model. Increasing prompt complexity did not consistently improve performance. Specialty specific VLM may be necessary to improve ophthalmic image analysis performance.

Keywords: Vision Language Model, Disc swelling, Papilledema, Prompt Engineering, artificial intelligence, machine learning

Received: 07 Jul 2025; Accepted: 07 Aug 2025.

Copyright: © 2025 Li, Nguyen and Moss. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) or licensor are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

* Correspondence: Heather E Moss, Stanford University, Stanford, United States

Disclaimer: All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article or claim that may be made by its manufacturer is not guaranteed or endorsed by the publisher.