ORIGINAL RESEARCH article
Front. Artif. Intell.
Sec. Medicine and Public Health
Volume 8 - 2025 | doi: 10.3389/frai.2025.1557508
This article is part of the Research TopicData Science and Digital Health Technologies for Personalized HealthcareView all 5 articles
VMDU-Net: A Dual Encoder Network with Transformer and Mamba Fusion for Enhanced Long-Distance Dependency in Polyp Segmentation
Provisionally accepted- 1Asia Pacific University of Technology & Innovation, Kuala Lumpur, Malaysia
- 2Gansu Provincial Tumor Hospital, Gan Su WuWei, China
Select one of your emails
You have multiple emails registered with Frontiers:
Notify me on publication
Please enter your email address:
If you already have an account, please login
You don't have a Frontiers account ? You can register here
Rectal cancer is typically manifested as polyps. Early screening and timely removal of polyps can effectively prevent colorectal cancer and help halt its progression to malignancy. Although polyp segmentation algorithms play a key role in polyp removal, accurate segmentation remains challenging due to the diverse shapes, indistinct boundaries, and varying sizes of polyps. Furthermore, these algorithms need to capture long-range dependencies, but current polyp segmentation algorithms often struggle to converge when attempting this, posing challenges in practical applications. To address these issues, this study proposes a Transformer and Mamba dual encoder fusion network structure-VMDU-Net. In this model, one encoder integrates the Vision Mamba component, while the other employs the designed Cross-Shaped Transformer. By combining the Mamba structure with the Cross-Shaped Transformer, the network's ability to extract semantic information about polyp shapes and boundaries is enhanced. Additionally, to promote dual encoder fusion, we design the a feature fusion module named Mamba-Transformer-Merge (MTM), which performs attention-weighted fusion from both spatial and channel dimensions, fully leveraging the advantages of both Transformer and Mamba features. To address potential convergence issues during model training, this study employs depthwise separable convolutions for multiscale feature extraction and accelerates convergence using the inductive bias of convolution. Furthermore, experiments were conducted on five widely used polyp datasets, and the results demonstrated outstanding performance in segmentation accuracy and edge detail preservation. Notably, our method achieved a Dice score of 0.934 on the Kvasir-SEG dataset and 0.951 on the CVC-ClinicDB dataset, surpassing existing state-of-the-art algorithms.
Keywords: polyp segmentation, Mamba, transformer, Feature fusion, Medical image segmentation
Received: 08 Jan 2025; Accepted: 16 May 2025.
Copyright: © 2025 Li, Ding and Lim. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) or licensor are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.
* Correspondence: Peng Li, Asia Pacific University of Technology & Innovation, Kuala Lumpur, Malaysia
Disclaimer: All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article or claim that may be made by its manufacturer is not guaranteed or endorsed by the publisher.