AUTHOR=Triyono Liliek , Gernowo Rahmat , Prayitno  

TITLE=MoNetViT: an efficient fusion of CNN and transformer technologies for visual navigation assistance with multi query attention

JOURNAL=Frontiers in Computer Science

VOLUME=Volume 7 - 2025

YEAR=2025

URL=https://www.frontiersin.org/journals/computer-science/articles/10.3389/fcomp.2025.1510252

DOI=10.3389/fcomp.2025.1510252

ISSN=2624-9898

ABSTRACT=Aruco markers are crucial for navigation in complex indoor environments, especially for those with visual impairments. Traditional CNNs handle image segmentation well, but transformers excel at capturing long-range dependencies, essential for machine vision tasks. Our study introduces MoNetViT (Mini-MobileNet MobileViT), a lightweight model combining CNNs and MobileViT in a dual-path encoder to optimize global and spatial image details. This design reduces complexity and boosts segmentation performance. The addition of a multi-query attention (MQA) module enhances multi-scale feature integration, allowing end-to-end learning guided by ground truth. Experiments show MoNetViT outperforms other semantic segmentation algorithms in efficiency and effectiveness, particularly in detecting Aruco markers, making it a promising tool to improve navigation aids for the visually impaired.