- 1School of Ecological and Environmental Engineering, Qinghai University, Xining, China
- 2School of Computer Technology and Application, Qinghai University, Xining, China
- 3Information Technology Center, Qinghai University, Xining, China
Remote sensing image super-resolution (SR) is vital for urban planning, precision agriculture, and environmental monitoring, yet existing methods have limitations: CNNs with restricted receptive fields cause edge blurring, Transformers with
1 Introduction
Remote sensing imagery (Platel et al., 2025) enables critical capabilities across urban planning (Mathieu et al., 2007; Muhmad Kamarulzaman et al., 2023), precision agriculture (Kumar et al., 2022; de França e Silva et al., 2024), and environmental monitoring (Chen et al., 2021; Li J. et al., 2020; Lockhart et al., 2025). However, significant practical challenges impede its full potential. A substantial portion of satellite-collected data is underutilized; for instance, satellites often capture data only during limited time windows each day that meet ground needs, leaving much of the data considered invalid “waste footage” due to issues like cloud occlusion and atmospheric scattering. In agricultural monitoring, obtaining timely and objective information over large areas is difficult with ground surveys alone; yet, the spatial resolution of satellite data can be coarse in regions with fragmented farmland. In marine environments, high-resolution satellite image data comes at a significant cost, and models for deriving information, such as water depth in shallow optical zones, face constraints such as unclear maximum detection depths and error propagation when applied on a global scale. These challenges highlight a critical gap: the disparity between the vast amount of data collected and the ability to extract sufficiently detailed, accurate, and actionable information for precision applications, such as vegetation health mapping and infrastructure inspection. Super-resolution (SR) techniques are thus vital for enhancing spatial detail without physical hardware upgrades (Song et al., 2025; Zhang et al., 2025; Chung et al., 2023; Luo et al., 2023), directly addressing these practical barriers of data underutilization and cost-effective quality enhancement.
Remote sensing imagery presents distinctive hierarchical structures spanning macro-scale patterns (e.g., urban layouts and forest canopies) to micro-features (e.g., vegetation textures and isolated water bodies), posing substantial modeling challenges. Convolutional Neural Network (CNN)-based approaches with residual blocks (Lim et al., 2017) suffer from limited receptive fields (Liang et al., 2021), manifesting as edge blurring in urban boundary reconstruction. Transformer-based methods (Zhang et al., 2018) incur
Additionally, computational overhead remains prohibitive for edge deployment. Existing efficient methods (Wang et al., 2024) require
To resolve these challenges, we present an efficient architecture with three foundational innovations:
1. Residual-Enhanced Wavelet (REW) Decomposition replaces standard convolutional features with multi-scale wavelet representations. This explicitly separates high-frequency details (e.g., edges, textures) from low-frequency structural information during feature extraction, enabling targeted enhancement of critical high-frequency components while suppressing spectral noise in homogeneous regions.
2. Linear Attention Mechanism with channel-mixing properties overcomes quadratic complexity barriers of conventional attention. Inspired by efficient sequence modeling paradigms, especially Receptance Weighted Key-Value (Lu et al., 2025; Peng et al., 2023; Yang et al., 2025), our approach employs parallelizable linear attention (
3. Quad-Directional Scanning Strategy replaces unidirectional state propagation with simultaneous spatial traversals along horizontal/vertical/diagonal axes. This captures omnidirectional contextual relationships essential for modeling irregular geological formations and infrastructure networks, overcoming the geometric constraints of cross-shaped or sequential scanning patterns (Zhu Q. et al., 2024).
Comprehensive validation demonstrates our method’s advantages: 20% faster training, 35% faster inference, and a 35% reduction in memory reading versus state-of-the-art methods, while achieving state-of-the-art metrics. The architecture significantly expands border effective receptive fields (Figure 1) (Luo et al., 2016). This enables deployable high-precision SR for ecological monitoring on resource-constrained platforms.
Figure 1. Effective receptive field (ERF) visualizations (Luo et al., 2016) demonstrating enhanced global coverage versus constrained patterns in prior efficient models.
2 Related work
2.1 Evolution from CNN to transformer for super-resolution
Convolutional Neural Networks (CNNs) established the foundation for deep learning-based super-resolution by leveraging their innate capacity for hierarchical feature extraction through local receptive fields. Pioneering architectures like SRCNN (Dong et al., 2015) demonstrated that even simple three-layer convolutional frameworks could significantly outperform traditional interpolation methods, achieving PSNR improvements of 0.6–1.2 dB by learning mapping functions from low to high-resolution patches. Subsequent innovations addressed fundamental limitations through residual learning (EDSR) to enable very deep networks, recursive connections (DRCN) for parameter efficiency, and channel attention mechanisms (RCAN) to prioritize informative features. For remote sensing applications, where computational constraints often dictate deployment feasibility, lightweight adaptations became essential–FSRCNN reduced parameters by 70% through deconvolutional layers and feature shrinking, while ESPCN achieved real-time performance via sub-pixel convolution operations. These designs strategically balanced computational constraints with performance requirements, enabling practical deployment on edge devices for agricultural monitoring and infrastructure inspection. Multi-branch CNNs further improved high-frequency detail preservation: MSFSR incorporated wavelet decomposition to separate structural and textural components, enhancing reconstruction of vegetation boundaries and building edges by 2.3 dB MAE compared to single-scale networks. However, the intrinsic architectural limitations of CNNs persisted: local receptive fields fundamentally constrained long-range dependency modeling, causing geometric distortions in large-area remote sensing mosaics and spectral inconsistencies in hyperspectral data cubes where distant pixels exhibit correlated properties. Despite progressive architectural refinements, the fundamental trade-off between receptive field size and computational efficiency remained unresolved in pure CNN architectures, motivating the exploration of alternative paradigms.
2.2 Transformer-based approaches with global receptive fields
The emergence of Vision Transformers addressed CNN’s locality constraints by introducing global self-attention mechanisms that could capture long-range dependencies across entire images. This represented a paradigm shift from the localized feature extraction of CNNs to holistic image understanding. Initial hybrid models like TTSR strategically combined convolutional backbones for local feature extraction with transformer blocks for global feature refinement, improving texture synthesis in complex forest canopy reconstruction by 1.7 dB PSNR over pure CNNs. SwinIR (Liang et al., 2021) subsequently introduced shifted window attention to reduce the computational complexity from quadratic
2.3 State space models as efficient alternatives
State Space Models (SSMs) emerged as computationally efficient alternatives to transformers, leveraging selective state spaces for linear-time sequence modeling while maintaining global receptive fields. The core innovation lies in modeling sequence data as a system with an evolving hidden state that captures relevant context, similar to recurrent networks but with training efficiency comparable to CNNs. Mamba’s hardware-aware architecture achieved 5
2.4 RWKV-inspired models for vision applications
Inspired by efficient recurrent architectures, RWKV mechanisms offer linear attention alternatives without the quadratic bottlenecks of standard attention, representing a further evolution toward efficient global modeling (Peng et al., 2023). Unlike conventional attention that computes pairwise token interactions, channel-mixing RWKV employs wavelet-guided feature aggregation where current features interact with multi-scale weighted contextual information (incorporating residual enhancement) (Peng et al., 2024). This formulation maintains global receptive fields while reducing complexity to
The progression from CNN to Transformer to SSM and RWKV-inspired architectures reflects an ongoing pursuit to balance three critical objectives: local feature precision, global dependency modeling, and computational efficiency. Each architectural evolution has addressed specific limitations of its predecessors while introducing new trade-offs, driving the field toward more capable and practical super-resolution solutions for remote sensing applications.
3 Methodology
Building upon the innovations introduced in Section 1, our methodology formalizes three core technical advancements specifically designed for remote sensing super-resolution. The Residual-Enhanced Wavelet decomposition explicitly handles multi-scale feature separation critical for terrain textures, while the Linear Attention with Key-Value Adaptation achieves global context modeling with linear complexity. The Quad-Directional Scanning Strategy fundamentally overcomes geometric constraints in conventional state-space models by establishing omnidirectional feature propagation paths. Collectively, these innovations enable efficient high-precision reconstruction of gigapixel-scale aerial imagery under strict computational constraints.
3.1 REW-KVA model
The overall structure of the REW-KVA model is shown in Figure 2. We denote the LR input image as
Initially, we use a
Secondly, we use multiple VRGs to extract high-level features from
where
Thirdly, we use a
where
Finally, we add the mean and variance back to
where
3.2 Visual REW group
A VRG is a collection of residual blocks that integrates the Residual-Enhanced Wavelet Spatial Mixing (RESM) and Residual-Enhanced Wavelet Channel Mixing (RECM) as its core components. The RESM facilitates the blending of feature maps across different resolutions, while the RECM confines these features to specific channels. RECM replaces traditional MLP by computing cross-channel weights via depthwise convolution. A VRG is composed of multiple Visual REW Residual Blocks (VRBs), and the overall structure of the VRB is shown in Figure 3.
The VRG receives input from either the output of the previous VRG or the shallow feature extractor, and its output is directed to the subsequent VRG or the process for high-quality reconstruction. Features entering RESM are initially subjected to a shift operation using REW-Channel Shift, yielding a set of shifted feature maps that encapsulate the features of neighboring pixels. These maps are then integrated through a channel-wise linear attention mechanism within the Spatial Mixing operation, which selects significant channels and conducts 2D scan operations to reassmble the feature maps. The outcome is fed back into the input feature map through a residual connection. Following this, the output is directed to RECM, which confines the feature maps to specific channels, with its output also reintegrated via a residual connection. The RECM output is then passed through a
3.3 Residual-enhanced wavelet (REW) decomposition
Remote sensing imagery exhibits complex hierarchical structures where low-frequency components (e.g., large-scale topographic contours and uniform regions) and high-frequency details (e.g., vegetation boundaries, urban edges, and fine textures) are intrinsically interwoven. Traditional convolutional feature extraction methods often inadequately preserve the spectral-textural separation, leading to blurred details or amplified noise in homogeneous areas. The Residual-Enhanced Wavelet (REW) decomposition addresses this limitation by replacing standard convolutional features with multi-scale wavelet representations based on the Biorthogonal (Bior) wavelet transform. This framework explicitly separates high-frequency details from low-frequency structural information during feature extraction, enabling the targeted enhancement of critical high-frequency components while suppressing spectral noise in homogeneous regions through adaptive weighting mechanisms.
The Biorthogonal wavelet transform utilizes a pair of biorthogonal basis functions for decomposition and reconstruction, allowing for linear-phase filters that minimize distortion in image reconstruction. For a discrete signal
For two-dimensional images, the transform is applied separably along rows and columns, yielding four subbands: LL (low-low), LH (low-high), HL (high-low), and HH (high-high). The Bior wavelets are particularly suitable for remote sensing applications due to their symmetric filters, which help preserve edge information and reduce phase distortion.
In the REW decomposition, the process begins by applying a 2D discrete wavelet transform using Bior wavelets to the input feature maps. Specifically, for an input image
Here,
To adaptively enhance relevant features, learnable spectral weighting parameters are introduced. These weights are derived from the low-frequency component to modulate the high-frequency details, ensuring context-aware amplification of critical textures. The weighting parameters
where
The residual-enhanced reconstruction integrates the weighted components via an inverse discrete wavelet transform (IDWT) with the dual filters of the Bior wavelet, and incorporates a residual connection to preserve original spatial signatures. The output features are computed as:
The residual connection ensures retention of low-level geospatial information essential for change detection and multi-temporal analysis. This formulation provides key advantages such as explicit separation of structural and textural features to reduce interference from non-stationary noise in low-altitude acquisitions, frequency-adaptive weighting to mitigate spectral distortions during seasonal transitions, and a residual pathway to maintain spatial fidelity for downstream tasks. The enhanced features
3.4 Linear attention with key-value adaptation
Remote sensing imagery necessitates modeling kilometer-scale spatial dependencies while preserving spectral fidelity, a challenge exacerbated by the quadratic complexity of conventional self-attention mechanisms when processing gigapixel mosaics. Traditional linear attention (Katharopoulos et al., 2020) addresses this by reducing computational complexity from
where
The proposed Key-Value Adaptation (KVA) technique extends this foundation by introducing a gated, dual-path state propagation mechanism that enables dynamic, context-aware modulation of key-value interactions. The core innovation lies in replacing the static cumulative sum operations in conventional linear attention with contextualized state transitions, where decay factors are dynamically computed from local features to enhance spectral adaptivity and numerical stability. Specifically, KVA formalizes key-value projections with depthwise convolutions for local texture encoding:
where
Here,
The essence of Key-Value Adaptation resides in its dynamic, data-dependent modulation of the attention process through two core mechanisms: (1) the gated decay matrix
eliminating the need for explicit softmax while ensuring numerical stability. The decay matrix
KVA’s output integrates multi-head processing via head-specific receptance vectors and a gating mechanism for high-frequency enhancement:
where
3.5 Quad-directional scanning strategy
Unidirectional state propagation in SSMs induces catastrophic forgetting of cross-region dependencies in large-area mosaics. Our quad-directional scanning overcomes this by simultaneous spatial traversals along four geometric axes: horizontal
where
with
where the decay gate
For hardware-aware implementation, we employ kernel fusion to compute all four directions in parallel:
This quadruples the effective receptive field while reducing memory access latency by 58% compared to sequential scanning. The directional adaptation significantly reduces boundary artifacts along linear features (roads, waterways) and diagonal terrain structures prevalent in remote sensing.
3.6 REW-Channel Shift
Our REW-Channel Shift leverages multi-scale convolutional kernels, as illustrated in Figure 5 (
After obtaining the output tensor
4 Experiments
4.1 Datasets and implementation details
4.1.1 Datasets
We evaluate the proposed method on five widely used remote sensing image datasets to ensure comprehensive validation across diverse scenarios. The DFC2019 dataset from the IEEE Data Fusion Contest 2019 contains 2,783 multi-temporal high-resolution satellite images acquired by the WorldView-3 satellite (Saux et al., 2019), provided as 1024 × 1024 pixel tiles with sub-meter spatial resolution, including 2,783 training images and 50 test images. The OPTIMAL-31 dataset comprises 1,860 images collected from Google Maps, covering 31 categories with 60 images per category, each sized 256 × 256 pixels. The RSI-CB dataset includes two subsets, RSI-CB256 and RSI-CB128 (Li H. et al., 2020), with spatial resolutions ranging from 0.3 to 3 m; the former contains over 24,000 images across 35 categories at 256 × 256 pixels, while the latter has over 36,000 images across 45 categories at 128 × 128 pixels. The WHU-RS19 dataset consists of high-resolution satellite images exported from Google Earth, featuring a spatial resolution up to 0.5 m and containing 19 classes of meaningful scenes, with approximately 50 samples per class (Wuhan University, 2010). The UCMD (UC Merced Land-Use Dataset) is a land-use research dataset derived from USGS National Map Urban Area Imagery (Yang and Newsam, 2010), containing 21 land classes with 100 images per class, each with a 0.3-m pixel resolution and size of 256 × 256 pixels. These datasets collectively represent a wide spectrum of remote sensing challenges, including multi-temporal analysis, fine-grained classification, and large-scale land-use mapping.
4.1.2 Training settings
All experiments are conducted on a system equipped with an NVIDIA A800 40 GB GPU. Models are trained using the Adam optimizer with a batch size of 16, the beta parameters of [0.9, 0.999], and an initial learning rate of 1e-4, which is halved every 50 epochs. The training process runs for 300 epochs with a weight decay of 1e-4 to prevent overfitting. Data augmentation techniques, including random horizontal flipping, rotation by 90-degree multiples, and color jittering, are applied to enhance generalization. The loss function combines L1 loss with SSIM loss in a 0.7:0.3 ratio to balance pixel-level accuracy and structural preservation. Training starts from scratch without pretrained weights to ensure a fair comparison across methods.
4.1.3 Compared models
We evaluate our model against representative methods spanning three architectural paradigms under the same training settings described in Section 4.1.2. Among CNN-based approaches, we include SRCNN (Dong et al., 2015), which pioneered the use of three-layer convolutional networks for image super-resolution, and VDSR (Kim et al., 2016), which introduced residual learning to enable deeper architectures and to improve convergence. For Transformer-based methods, we consider SwinIR (Liang et al., 2021), which employs shifted window attention to reduce computational overhead, along with its enhanced variant SwinFIR (Zhang et al., 2023), which incorporates frequency-domain processing. We also evaluate HAT (Chen et al., 2024), a hybrid attention transformer that integrates channel attention mechanisms. In the category of state-space models, we include MambaIR (Guo et al., 2024), which adapts selective state spaces for image restoration, and MaIR (Li et al., 2025), which introduces locality-preserving mechanisms into Mamba-based architectures. This comprehensive comparison allows for a more systematic evaluation of our proposed method.
4.1.4 Metrics
We employ two full-reference image quality assessment metrics to quantitatively evaluate the super-resolution performance. The Peak Signal-to-Noise Ratio (PSNR) measures the ratio between the maximum possible power of a signal and the power of corrupting noise, calculated as:
where
where
The Learned Perceptual Image Patch Similarity (LPIPS) metric measures perceptual distance in a deep feature space extracted by pretrained networks such as VGG. It is defined as:
where
The Natural Image Quality Evaluator (NIQE) is a no-reference metric that estimates naturalness by comparing local statistical features of the test image with a model of pristine natural scenes. It is computed as:
where
The Spectral Angle Mapper (SAM) quantifies spectral distortion by measuring the angle between the reconstructed and reference pixel vectors:
where
4.2 Results
Comprehensive comparisons against seven state-of-the-art super-resolution methods in
Table 2. Quantitative comparison of
For qualitative results, as shown in Figure 6, our model achieves visibly sharper reconstructions and enhanced textural consistency compared to competing methods. The improvements are especially pronounced on complex datasets such as UCMD, RSI-CB, and WHU-RS19, where fine-grained details–such as vegetation boundaries, building edges, and road networks–are better preserved. In contrast, on relatively homogeneous datasets like DFC 2019, REW-KVA maintains stable quantitative advantages, suggesting strong robustness across diverse spatial and spectral conditions. This consistent behavior is crucial for high-precision remote sensing applications such as urban planning and environmental monitoring.
The superior performance primarily arises from the synergistic integration of the Residual-Enhanced Wavelet (REW) decomposition, Linear Attention mechanism with Key-Value Adaptation, and Quad-Directional Scanning strategy. The REW decomposition explicitly separates high-frequency textures from low-frequency structural information, enabling targeted enhancement of critical details without amplifying noise. The linear attention mechanism efficiently expands the receptive field to a global scope, capturing long-range dependencies necessary for large-scale geographical structures. Meanwhile, the quad-directional scanning strategy models omnidirectional spatial dependencies, effectively mitigating the diagonal information loss inherent in previous SSM-based designs. Collectively, these complementary modules contribute to the observed performance gains, validating their synergistic role in improving both accuracy and efficiency across diverse remote sensing scenarios.
4.3 Efficiency study
To evaluate the practical deployment potential, we conduct comprehensive efficiency analysis on NVIDIA A800 40 GB GPU, measuring memory reads, cache hits, parameter count, activation memory, and inference latency. As shown in Table 3, our method demonstrates superior computational efficiency compared to existing approaches. The optimized architecture achieves 35% fewer memory reads and 28% higher cache hit rate than the closest competitor, indicating better hardware utilization and data locality. The parameter count is reduced by 42% compared to Transformer-based SwinIR, while maintaining higher performance, demonstrating the effectiveness of our design choices in eliminating redundant computations.
The inference speed tests reveal that our method processes 1024 × 1024 images in 0.47 s, which is 3.2 × faster than SwinIR and 1.8 × faster than MambaIR. This acceleration stems from the parallelizable linear attention mechanism and efficient wavelet decomposition, which avoid the sequential processing bottlenecks of recurrent-style architectures. The quad-directional scanning strategy further contributes to efficiency by enabling simultaneous multi-axis processing without the need for multiple passes over the input data. These efficiency gains are particularly significant for real-time applications such as disaster monitoring and rapid mapping, where both accuracy and speed are critical operational requirements.
4.4 Ablation study
We conduct a systematic ablation study to evaluate the contributions of key components and hyperparameters in our architecture. The results demonstrate that each element plays a distinct yet complementary role, and we further investigate their synergistic effects through combination experiments in achieving optimal performance.
The results of the ablation study on key components are shown in Table 4. The removal of the residual-enhanced wavelet decomposition leads to a performance decline (PSNR = 29.67), underscoring its importance in multi-scale feature extraction and noise suppression. Similarly, excluding the linear attention mechanism results in a reduction in reconstruction quality (PSNR = 29.62), highlighting its role in capturing long-range dependencies. The quad-directional scanning strategy also proves critical, as its absence causes performance degradation (PSNR = 29.70) due to impaired modeling of omnidirectional spatial relationships. To elucidate synergistic effects, we test partial combinations: when both residual-enhanced wavelet decomposition and linear attention are incorporated (but without quad-directional scanning), PSNR reaches 30.05. This demonstrates their cooperative effect in joint feature separation and global contextual modeling. Conversely, simultaneously removing both residual-enhanced wavelet decomposition and quad-directional scanning causes a performance drop (PSNR = 29.52), worse than individual removals, indicating compensatory relationships necessary for handling irregular shapes and diagonal features prevalent in remote sensing imagery. The full integration of all components yields the highest performance (PSNR = 30.51), confirming their optimal synergy.
An analysis of hyperparameters in Table 5 reveals that increasing model capacity consistently enhances output quality, with performance scaling proportionally to parameter count. It is important to note that every 4 VRBs (Vision Reconstruction Blocks) form a VRG (Vision Reconstruction Group), hence the number of VRGs equals the number of layers divided by 4 (Number of VRGs = Layers//4). Expanding the embedded dimension from 48 to 128 and network depth from 8 to 20 layers improves representational power, with PSNR increasing from 28.89 to 30.47. However, diminishing returns are observed beyond 128 dimensions and 16 layers, as PSNR gains moderate from 30.18 dB to 30.47 dB despite increased complexity. The configuration with embedded dimension 128 and 16 layers (corresponding to 4 VRGs) achieves a PSNR of 30.18, while further increasing layers to 20 (5 VRGs) yields only a modest improvement to 30.47 dB at a higher computational cost. This indicates that the configuration with embedded dimension 128 and 16 layers (4 VRGs) provides a favorable balance between performance and complexity.
Table 5. Ablation study on hyperparameters evaluated on WHU-RS19 dataset. Every 4 VRBs form a VRG. Hence, Number of VRGs = Layers//4.
5 Conclusion
This study proposes the REW-KVA architecture to address limitations in remote sensing image super-resolution (SR), with key performance and efficiency improvements as follows: First, Residual-Enhanced Wavelet (REW) Decomposition ensures spectral-textural integrity–ablation on the RSI-CB dataset shows that removing it reduces PSNR by 0.84 dB and SSIM by 0.0303, confirming its role in separating low/high-frequency features and suppressing noise. Second, the Linear Attention (complexity
Data availability statement
The original contributions presented in the study are included in the article/supplementary material, further inquiries can be directed to the corresponding author.
Author contributions
RL: Conceptualization, Formal Analysis, Investigation, Methodology, Validation, Visualization, Writing – original draft. HM: Conceptualization, Data curation, Investigation, Validation, Writing – review and editing. XH: Funding acquisition, Project administration, Resources, Writing – review and editing.
Funding
The authors declare that financial support was received for the research and/or publication of this article. This project(or paper) was supported by the special fund of Qinghai University Innovation Workshop (under the General Reform and High-Quality Development of Undergraduate Education and Teaching Fund). Grant Number: GF-2025ZJ-17.
Conflict of interest
The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.
Generative AI statement
The authors declare that no Generative AI was used in the creation of this manuscript.
Any alternative text (alt text) provided alongside figures in this article has been generated by Frontiers with the support of artificial intelligence and reasonable efforts have been made to ensure accuracy, including review by the authors wherever possible. If you identify any issues, please contact us.
Publisher’s note
All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article, or claim that may be made by its manufacturer, is not guaranteed or endorsed by the publisher.
References
Chen, J., Chen, J., Liao, A., Cao, X., Chen, L., Chen, X., et al. (2021). A review of remote sensing for environmental monitoring in China: progress and challenges. Sci. Remote Sens. 4, 100032.
Chen, X., Wang, X., Zhang, W., Kong, X., Qiao, Y., Zhou, J., et al. (2024). Hat: hybrid attention transformer for image restoration
Chung, M., Jung, M., and Kim, Y. (2023). Enhancing remote sensing image super-resolution guided by bicubic-downsampled low-resolution image. Remote Sens. 15, 3309. doi:10.3390/rs15133309
de França e Silva, N. R., Chaves, M. E. D., Luciano, A. C. D. S., Sanches, I. D., de Almeida, C. M., and Adami, M. (2024). Sugarcane yield estimation using satellite remote sensing data in empirical or mechanistic modeling: a systematic review. Remote Sens. 16, 863. doi:10.3390/rs16050863
Dong, C., Loy, C. C., He, K., and Tang, X. (2015). Image super-resolution using deep convolutional networks. In IEEE Trans. Pattern Anal. Mach. Intell., (IEEE), vol. 38, 295–307. doi:10.1109/tpami.2015.2439281
Gu, A., Goel, K., and Ré, C. (2021). Efficiently modeling long sequences with structured state spaces. arXiv preprint arXiv:2111.00396.
Guo, H., Li, J., Dai, T., Ouyang, Z., Ren, X., and Xia, S.-T. (2024). Mambair: a simple baseline for image restoration with state-space model. 222, 241. doi:10.1007/978-3-031-72649-1_13
Katharopoulos, A., Vyas, A., Pappas, N., and Fleuret, F. (2020). “Transformers are RNNs: fast autoregressive transformers with linear attention,” in Proceedings of the 37th international conference on machine learning. Editors H. D. III, and A. Singh (PMLR), 5156–5165.
Kim, J., Lee, J. K., and Lee, K. M. (2016). “Accurate image super-resolution using very deep convolutional networks,” in Proceedings of the IEEE conference on computer vision and pattern recognition, Las Vegas, NV, USA, 27-30 June 2016 (IEEE), 1646–1654.
Kumar, S., Meena, R. S., Sheoran, S., Jangir, C. K., Jhariya, M. K., Banerjee, A., et al. (2022). “Chapter 5 - remote sensing for agriculture and resource management,” in Natural resources conservation and advances for sustainability. Editors M. K. Jhariya, R. S. Meena, A. Banerjee, and S. N. Meena (Elsevier), 91–135. doi:10.1016/B978-0-12-822976-7.00012-0
Li, H., Dou, X., Tao, C., Wu, Z., Chen, J., Peng, J., et al. (2020). Rsi-cb: a large-scale remote sensing image classification benchmark using crowdsourced data. Sensors 20, 1594. doi:10.3390/s20061594
Li, J., Pei, Y., Zhao, S., Xiao, R., Sang, X., and Zhang, C. (2020). A review of remote sensing for environmental monitoring in China. Remote Sens. 12, 1130. doi:10.3390/rs12071130
Li, B., Zhao, H., Wang, W., Hu, P., Gou, Y., and Peng, X. (2025). Mair: a locality- and continuity-preserving mamba for image restoration. 7491–7501. doi:10.1109/cvpr52734.2025.00702
Liang, J., Cao, J., Sun, G., Zhang, K., Van Gool, L., and Timofte, R. (2021). “Swinir: image restoration using swin transformer,” in Proceedings of the IEEE/CVF international conference on computer vision, Montreal, BC, Canada, 11-17 October 2021 (IEEE), 1833–1844.
Lim, B., Son, S., Kim, H., Nah, S., and Lee, K. M. (2017). “Enhanced deep residual networks for single image super-resolution,” in Proceedings of the IEEE conference on computer vision and pattern recognition workshops, Honolulu, HI, USA, 21-26 July 2017 (IEEE), 136–144.
Liu, Y., Tian, Y., Zhao, H., Liu, Y., Wang, Y., Yuan, L., et al. (2024a). Vmamba: visual state space model. arXiv Preprint arXiv:2401.
Liu, Y., Tian, Y., Zhao, Y., Yu, H., Xie, L., Wang, Y., et al. (2024b). “Vmamba: visual state space model,” in Advances in neural information processing systems. Editors A. Globerson, L. Mackey, D. Belgrave, A. Fan, U. Paquet, J. Tomczaket al. (Curran Associates, Inc.), 37, 103031–103063. doi:10.52202/079017-3273
Lockhart, K., Sandino, J., Amarasingam, N., Hann, R., Bollard, B., and Gonzalez, F. (2025). Unmanned aerial vehicles for real-time vegetation monitoring in antarctica: a review. Remote Sens. 17, 304. doi:10.3390/rs17020304
Lu, R., Li, C., Li, D., Zhang, G., Huang, J., and Li, X. (2025). Exploring linear attention alternative for single image super-resolution. arXiv preprint arXiv:2502, 00404.
Luan, X., Fan, H., Wang, Q., Yang, N., Liu, S., Li, X., et al. (2025). Fmambair: a hybrid state-space model and frequency domain for image restoration. IEEE Trans. Geoscience Remote Sens. 63, 1–14. doi:10.1109/TGRS.2025.3526927
Luo, W., Li, Y., Urtasun, R., and Zemel, R. (2016). “Understanding the effective receptive field in deep convolutional neural networks,” in Advances in neural information processing systems. Editors D. Lee, M. Sugiyama, U. Luxburg, I. Guyon, and R. Garnett (Curran Associates, Inc.), 29.
Luo, J., Han, L., Gao, X., Liu, X., and Wang, W. (2023). Sr-feinr: continuous remote sensing image super-resolution using feature-enhanced implicit neural representation. Sensors 23, 3573. doi:10.3390/s23073573
Mathieu, R., Freeman, C., and Aryal, J. (2007). Mapping private gardens in urban areas using object-oriented techniques and very high-resolution satellite imagery. Landsc. Urban Plan. 81, 179–192. doi:10.1016/j.landurbplan.2006.11.009
Muhmad Kamarulzaman, A. M., Wan Mohd Jaafar, W. S., Mohd Said, M. N., Saad, S. N. M., and Mohan, M. (2023). Uav implementations in urban planning and related sectors of rapidly developing nations: a review and future perspectives for Malaysia. Remote Sens. 15, 2845. doi:10.3390/rs15112845
Peng, B., Alcaide, E., Anthony, Q., Al-Ghamdi, A., Fan, B., Gao, L., et al. (2023). Rwkv: reinventing rnns for the transformer era. arXiv preprint arXiv:2305.13048.
Peng, B., Goldstein, D., Anthony, Q., Albalak, A., Alcaide, E., Biderman, S., et al. (2024). Eagle and finch: rwkv with matrix-valued states and dynamic recurrence. ArXiv abs/2404, 05892.
Platel, A., Sandino, J., Shaw, J., Bollard, B., and Gonzalez, F. (2025). Advancing sparse vegetation monitoring in the arctic and antarctic: a review of satellite and uav remote sensing, machine learning, and sensor fusion. Remote Sens. 17, 1513. doi:10.3390/rs17091513
Song, Y., Sun, L., Bi, J., Quan, S., and Wang, X. (2025). Drgan: a detail recovery-based model for optical remote sensing images super-resolution. IEEE Trans. Geoscience Remote Sens. 63, 1–13. doi:10.1109/TGRS.2024.3512528
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., et al. (2017). “Attention is all you need,” in Advances in neural information processing systems, 5998–6008.
Wang, Y., Yuan, W., Xie, F., and Lin, B. (2024). Esatsr: enhancing super-resolution for satellite remote sensing images with state space model and spatial context. Remote Sens. 16, 1956. doi:10.3390/rs16111956
Wuhan University, T. P. (2010). Whu-rs19 dataset. Available online at: https://captain-whu.github.io/BED4RS/.High-resolutionsatelliteimagedatasetwith19sceneclasses.
Yang, Y., and Newsam, S. D. (2010). “Bag-of-visual-words and spatial extensions for land-use classification,” in 18th ACM SIGSPATIAL international symposium on advances in geographic information systems, ACM-GIS 2010, November 3-5, 2010, San Jose, CA, USA, proceedings. Editors D. Agrawal, P. Zhang, A. E. Abbadi, and M. F. Mokbel (ACM), 270–279.
Yang, Z., Li, J., Zhang, H., Zhao, D., Wei, B., and Xu, Y. (2025). Restore-rwkv: efficient and effective medical image restoration with rwkv. IEEE J. Biomed. Health Inf., 1–14. doi:10.1109/jbhi.2025.3588555
Zhang, Y., Li, K., Li, K., Wang, L., Zhong, B., and Fu, Y. (2018). “Image super-resolution using very deep residual channel attention networks,” in Proceedings of the European conference on computer vision (ECCV), 286–301.
Zhang, Z., Liu, J., and Wang, L. (2022). “Swinfir: rethinking the swinir for image restoration and enhancement,” in 2022 IEEE international conference on multimedia and expo (ICME) (IEEE), 1–6.
Zhang, D., Huang, F., Liu, S., Wang, X., and Jin, Z. (2023). Swinfir: revisiting the swinir with fast fourier convolution and improved training for image super-resolution.
Zhang, Y., Zheng, P., Zeng, C., Xiao, B., Li, Z., and Gao, X. (2025). Jointly rs image deblurring and super-resolution with adjustable-kernel and multi-domain attention. IEEE Trans. Geoscience Remote Sens. 63, 1–16. doi:10.1109/TGRS.2024.3515636
Zhao, W., Wang, L., and Zhang, K. (2024). Mambair: a simple and efficient state space model for image restoration. arXiv preprint arXiv:2403.09963.
Zhu, L., Liao, B., Zhang, Q., Wang, X., Liu, W., and Wang, X. (2024). Vision mamba: efficient visual representation learning with bidirectional state space model.
Keywords: remote sensing imagery, super-resolution, linear attention, receptive field, computational efficiency
Citation: Lu R, Miao H and Hai X (2026) Efficient remote sensing image super-resolution with residual-enhanced wavelet and key-value adaptation. Front. Remote Sens. 6:1718058. doi: 10.3389/frsen.2025.1718058
Received: 03 October 2025; Accepted: 13 November 2025;
Published: 13 January 2026.
Edited by:
Rui Li, University of Warwick, United KingdomReviewed by:
Bing He, Chengdu University of Information Technology, ChinaZhunruo Feng, Chang’an University, China
Copyright © 2026 Lu, Miao and Hai. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.
*Correspondence: Xin Hai, eGluLmhhaUBxaHUuZWR1LmNu
Xin Hai3*