BRIEF RESEARCH REPORT article
Front. Digit. Health
Sec. Health Informatics
Volume 7 - 2025 | doi: 10.3389/fdgth.2025.1660887
This article is part of the Research TopicThe Digitization of Neurology - Volume IIView all articles
Performance of Vision-Language Models for Optic Disc Swelling Identification on Fundus Photographs
Provisionally accepted- 1National Healthcare Group Eye Institute, 63709, Singapore, Singapore
- 2Stanford University, Stanford, United States
- 3University of California Davis, Davis, United States
Select one of your emails
You have multiple emails registered with Frontiers:
Notify me on publication
Please enter your email address:
If you already have an account, please login
You don't have a Frontiers account ? You can register here
Introduction: Vision-language models (VLMs) combine image analysis capabilities with large language models (LLMs). Due to their multimodal capabilities, VLM offer a clinical advantage over image classification models for diagnosis of optic disc swelling by allowing consideration of clinical context. We compare the performance of non-specialty-trained VLMs with different prompts in the classification of optic disc swelling on fundus photographs.A diagnostic test accuracy study was conducted utilizing an open-sourced dataset. 5 different prompts (increasing in context) were used with each of 5 different VLMs (Llama 3.2-vision, LLaVa-Med, LLaVa, GPT-4o and DeepSeek-4V) resulting in 25 prompt-model pairs. VLM's performance in classifying photos with and without optic disc swelling were measured using Youden's index (YI), F1 score, and accuracy.Results: 779 images of normal optic discs and 295 images of swollen discs were obtained from an open-source image database. Amongst the 25 prompt-model pairs, valid response rates ranged from 7.8%-100% (median 93.6%). Diagnostic performance ranged from YI: 0.00-0.231 (median 0.042), F1 score: 0.00-0.716 (median 0.401), and accuracy: 27.5%-70.5% (median 58.8%). The best performing prompt-model pair was GPT-4o with role-playing with Chain-of-Thought (CoT) and fewshot prompting. On average, Llama 3.2-vision performed the best (average YI across prompts 0.181). There was no consistent relationship between the amount of information given in the prompt and model performance.Conclusions: Non-specialty trained VLM can classify photos of swollen and normal optic discs better than chance with performance varying by model. Increasing prompt complexity did not consistently improve performance. Specialty specific VLM may be necessary to improve ophthalmic image analysis performance.
Keywords: Vision Language Model, Disc swelling, Papilledema, Prompt Engineering, artificial intelligence, machine learning
Received: 07 Jul 2025; Accepted: 07 Aug 2025.
Copyright: © 2025 Li, Nguyen and Moss. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) or licensor are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.
* Correspondence: Heather E Moss, Stanford University, Stanford, United States
Disclaimer: All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article or claim that may be made by its manufacturer is not guaranteed or endorsed by the publisher.