Improving Image-Retrieval Performance of Foundation Models in Gastrointestinal Endoscopic Images

Kangsan, Kim; Park, Junseok; Kim, Sang Hyun; Hwang, Youngbae

doi:10.3389/fmed.2025.1727884

ORIGINAL RESEARCH article

Front. Med.

Sec. Gastroenterology

This article is part of the Research TopicAdvancing Gastrointestinal Disease Diagnosis with Interpretable AI and Edge Computing for Enhanced Patient CareView all 9 articles

Improving Image-Retrieval Performance of Foundation Models in Gastrointestinal Endoscopic Images

Provisionally accepted

Kim Kangsan¹

Junseok Park²

Sang Hyun Kim²

Youngbae Hwang^1*

¹Chungbuk National University, Cheongju-si, Republic of Korea
²Soonchunhyang University, Asan, Republic of Korea

The final, formatted version of the article will be published soon.

The quality of gastrointestinal endoscopy is verified by documenting specific required images. However, identifying these images from the numerous photographs captured during a procedure is tedious. Conventional deep-learning approaches to automate this process are often limited by subjective assessments and poor interpretability. We introduce a novel content-based image-retrieval framework that employs a dual-backbone architecture, integrating a general-purpose vision foundation model (DINOv2) and domain-specific endoscopic model (GastroNet). This system is trained using parameter-efficient metric learning and generates discriminative embeddings for efficient similarity searches. The framework is evaluated on 3,500 public endoscopic images (from the Kvasir and HyperKvasir datasets) and validated on entirely unseen real-world and synthetic data, wherein it achieved state-of-the-art performance (97.71% Recall@1, 99.14% Recall@5, and 96.74% mean average precision). These results are significantly superior to those of single-backbone baseline models. Ablation studies confirm that this improvement is primarily due to the two backbones capturing complementary features. This framework offers an accurate and automated tool for assessing the procedural quality of gastrointestinal endoscopy.

Keywords: Gastrointestinal endoscope, artificial intelligence, deep learning, Image retrieval, Foundation model

Received: 18 Oct 2025; Accepted: 26 Nov 2025.

Copyright: © 2025 Kangsan, Park, Kim and Hwang. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) or licensor are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

* Correspondence: Youngbae Hwang

Disclaimer: All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article or claim that may be made by its manufacturer is not guaranteed or endorsed by the publisher.