ORIGINAL RESEARCH article
Front. Med.
Sec. Gastroenterology
This article is part of the Research TopicAdvancing Gastrointestinal Disease Diagnosis with Interpretable AI and Edge Computing for Enhanced Patient CareView all 9 articles
Improving Image-Retrieval Performance of Foundation Models in Gastrointestinal Endoscopic Images
Provisionally accepted- 1Chungbuk National University, Cheongju-si, Republic of Korea
- 2Soonchunhyang University, Asan, Republic of Korea
Select one of your emails
You have multiple emails registered with Frontiers:
Notify me on publication
Please enter your email address:
If you already have an account, please login
You don't have a Frontiers account ? You can register here
The quality of gastrointestinal endoscopy is verified by documenting specific required images. However, identifying these images from the numerous photographs captured during a procedure is tedious. Conventional deep-learning approaches to automate this process are often limited by subjective assessments and poor interpretability. We introduce a novel content-based image-retrieval framework that employs a dual-backbone architecture, integrating a general-purpose vision foundation model (DINOv2) and domain-specific endoscopic model (GastroNet). This system is trained using parameter-efficient metric learning and generates discriminative embeddings for efficient similarity searches. The framework is evaluated on 3,500 public endoscopic images (from the Kvasir and HyperKvasir datasets) and validated on entirely unseen real-world and synthetic data, wherein it achieved state-of-the-art performance (97.71% Recall@1, 99.14% Recall@5, and 96.74% mean average precision). These results are significantly superior to those of single-backbone baseline models. Ablation studies confirm that this improvement is primarily due to the two backbones capturing complementary features. This framework offers an accurate and automated tool for assessing the procedural quality of gastrointestinal endoscopy.
Keywords: Gastrointestinal endoscope, artificial intelligence, deep learning, Image retrieval, Foundation model
Received: 18 Oct 2025; Accepted: 26 Nov 2025.
Copyright: © 2025 Kangsan, Park, Kim and Hwang. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) or licensor are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.
* Correspondence: Youngbae Hwang
Disclaimer: All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article or claim that may be made by its manufacturer is not guaranteed or endorsed by the publisher.
