ResFormer: An Efficient Transformer Framework for Scalable Semantic Segmentation of Remote Sensing Images

Ou, Yanglin; Wang, Xiqi

doi:10.3389/frsen.2025.1683696

ORIGINAL RESEARCH article

Front. Remote Sens.

Sec. Image Analysis and Classification

This article is part of the Research TopicMachine Learning for Advanced Remote Sensing: From Theory to Applications and Societal ImpactView all 8 articles

ResFormer: An Efficient Transformer Framework for Scalable Semantic Segmentation of Remote Sensing Images

Provisionally accepted

Yanglin Ou^1,2

Xiqi Wang^3*

¹Huizhou Technician Institute, Huizhou, China
²Huazhong University of Science and Technology, Wuhan, China
³Shandong Jianzhu University, Jinan, China

The final, formatted version of the article will be published soon.

The translation of machine learning theory into operational remote sensing applications that deliver measurable societal value remains a paramount challenge. This endeavor requires models that are not only accurate but also scalable, reliable, and directly applicable to real-world problems such as climate resilience and sustainable urban development. While Convolutional Neural Networks (CNNs) have been foundational, their limited receptive fields often fail to capture the global context essential for interpreting complex scenes. Vision Transformers, with their global self-attention mechanism, offer a powerful alternative but typically incur prohibitive computational costs. To address these challenges, this paper introduces ResFormer, a novel architecture designed to bridge the gap between algorithmic innovation and demonstrable public good. Specifically, we propose a novel linear-complexity Transformer block integrated with residual connections, which drastically reduces the computational overhead from quadratic to linear complexity without sacrificing global context modeling. This efficiency enables the processing of high-resolution remote sensing imagery on commodity hardware. On the large-scale UAVid urban-scene dataset, ResFormer achieves a mean Intersection-over-Union (mIoU) of 68.7%, and on the ISPRS Potsdam dataset, it attains 85.9% mIoU. By holistically addressing scalability, reliability, and impact, ResFormer serves as a reproducible exemplar that moves the field toward machine learning systems that generate trustworthy and actionable knowledge for the public good. The implementation will be made publicly available to foster further research and application.

Keywords: image segmentation1, UAV2, Transformer3, Scalability4, reliability5

Received: 11 Aug 2025; Accepted: 08 Oct 2025.

Copyright: © 2025 Ou and Wang. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) or licensor are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

* Correspondence: Xiqi Wang, wangxiqi24@sdjzu.edu.cn

Disclaimer: All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article or claim that may be made by its manufacturer is not guaranteed or endorsed by the publisher.