ORIGINAL RESEARCH article

Front. Artif. Intell.

Sec. Medicine and Public Health

Volume 8 - 2025 | doi: 10.3389/frai.2025.1557508

This article is part of the Research TopicData Science and Digital Health Technologies for Personalized HealthcareView all 5 articles

VMDU-Net: A Dual Encoder Network with Transformer and Mamba Fusion for Enhanced Long-Distance Dependency in Polyp Segmentation

Provisionally accepted
  • 1Asia Pacific University of Technology & Innovation, Kuala Lumpur, Malaysia
  • 2Gansu Provincial Tumor Hospital, Gan Su WuWei, China

The final, formatted version of the article will be published soon.

Rectal cancer is typically manifested as polyps. Early screening and timely removal of polyps can effectively prevent colorectal cancer and help halt its progression to malignancy. Although polyp segmentation algorithms play a key role in polyp removal, accurate segmentation remains challenging due to the diverse shapes, indistinct boundaries, and varying sizes of polyps. Furthermore, these algorithms need to capture long-range dependencies, but current polyp segmentation algorithms often struggle to converge when attempting this, posing challenges in practical applications. To address these issues, this study proposes a Transformer and Mamba dual encoder fusion network structure-VMDU-Net. In this model, one encoder integrates the Vision Mamba component, while the other employs the designed Cross-Shaped Transformer. By combining the Mamba structure with the Cross-Shaped Transformer, the network's ability to extract semantic information about polyp shapes and boundaries is enhanced. Additionally, to promote dual encoder fusion, we design the a feature fusion module named Mamba-Transformer-Merge (MTM), which performs attention-weighted fusion from both spatial and channel dimensions, fully leveraging the advantages of both Transformer and Mamba features. To address potential convergence issues during model training, this study employs depthwise separable convolutions for multiscale feature extraction and accelerates convergence using the inductive bias of convolution. Furthermore, experiments were conducted on five widely used polyp datasets, and the results demonstrated outstanding performance in segmentation accuracy and edge detail preservation. Notably, our method achieved a Dice score of 0.934 on the Kvasir-SEG dataset and 0.951 on the CVC-ClinicDB dataset, surpassing existing state-of-the-art algorithms.

Keywords: polyp segmentation, Mamba, transformer, Feature fusion, Medical image segmentation

Received: 08 Jan 2025; Accepted: 16 May 2025.

Copyright: © 2025 Li, Ding and Lim. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) or licensor are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

* Correspondence: Peng Li, Asia Pacific University of Technology & Innovation, Kuala Lumpur, Malaysia

Disclaimer: All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article or claim that may be made by its manufacturer is not guaranteed or endorsed by the publisher.