Multi-Scale Feature Fusion and Visual State Space Models for Echocardiogram Segmentation

Luo, Zihe; Wang, Fufeng; Liao, Xinkui; Lv, Wei; Zhu, Xiaolin

doi:10.3389/frai.2025.1614544

ORIGINAL RESEARCH article

Front. Artif. Intell.

Sec. Medicine and Public Health

Volume 8 - 2025 | doi: 10.3389/frai.2025.1614544

This article is part of the Research TopicDigital Medicine and Artificial IntelligenceView all articles

Multi-Scale Feature Fusion and Visual State Space Models for Echocardiogram Segmentation

Provisionally accepted

Zihe Luo¹

Fufeng Wang¹

Xinkui Liao² Wei Lv

Wei Lv^1,3

Xiaolin Zhu^1,3*

¹Zhuhai College of Science and Techology, Zhuhai, China
²School of Medicine, Jinan University, Guangzhou, Guangdong Province, China
³City University of Macau, Macao, Macao, SAR China

The final, formatted version of the article will be published soon.

Echocardiography plays a vital role in early cardiac disease intervention, enabling accurate and noninvasive cardiac function assessment. However, echocardiogram segmentation remains challenging due to low image quality, anatomical variability, and the need to model both local structures and global contextual information. CNNs are effective at local feature extraction but lack global modeling capability, while Transformer-based models offer stronger context reasoning at the cost of quadratic complexity and heavy parameters. To address these limitations, we propose EchoUMamba, a multiscale feature fusion network built upon Mamba, a recently introduced state space model that enables global context modeling with linear complexity. By leveraging the structured state-space formulation, EchoUMamba reduces computational cost while maintaining long-range feature representation, offering a principled alternative to conventional self-attention mechanisms. EchoUMamba integrates a multi-scale visual state-space block (EchoVSSB) to facilitate cross-scale feature interaction and global information flow. Residual fusion pathways are employed to enhance low-level feature preservation and gradient propagation. Unlike prior Mamba-based designs trained from scratch, our model utilizes ImageNet pretraining to improve convergence and generalization in medical imaging. Extensive experiments on publicly available datasets validate the effectiveness of our approach. EchoUMamba achieves a precision of 93.75%, recall of 95.28%, Dice score of 94.51%, and IoU of 89.66%. It also exhibits strong efficiency, requiring only 26.3M parameters, 32.7 GFLOPs, and 22.8 ms inference time.

Keywords: echocardiogram segmentation, Mamba, Multi-scale feature fusion, Pre-Trained Model, Computational efficiency

Received: 19 Apr 2025; Accepted: 29 May 2025.

Copyright: © 2025 Luo, Wang, Liao, Lv and Zhu. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) or licensor are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

* Correspondence: Xiaolin Zhu, Zhuhai College of Science and Techology, Zhuhai, China

Disclaimer: All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article or claim that may be made by its manufacturer is not guaranteed or endorsed by the publisher.