Your new experience awaits. Try the new design now and help us make it even better

ORIGINAL RESEARCH article

Front. Remote Sens., 13 January 2026

Sec. Image Analysis and Classification

Volume 6 - 2025 | https://doi.org/10.3389/frsen.2025.1718058

This article is part of the Research TopicMachine Learning for Advanced Remote Sensing: From Theory to Applications and Societal ImpactView all 7 articles

Efficient remote sensing image super-resolution with residual-enhanced wavelet and key-value adaptation

  • 1School of Ecological and Environmental Engineering, Qinghai University, Xining, China
  • 2School of Computer Technology and Application, Qinghai University, Xining, China
  • 3Information Technology Center, Qinghai University, Xining, China

Remote sensing image super-resolution (SR) is vital for urban planning, precision agriculture, and environmental monitoring, yet existing methods have limitations: CNNs with restricted receptive fields cause edge blurring, Transformers with O(L2d) complexity fail in gigapixel-scale processing, and SSMs (e.g., Mamba) have directional biases missing diagonal features. To address these issues, this study proposes the REW-KVA architecture, integrating three innovations: Residual-Enhanced Wavelet Decomposition for separating low/high-frequency features and suppressing noise; Linear Attention with Key-Value Adaptation (complexity O(Ld)) for global context modeling; and Quad-Directional Scanning for omnidirectional feature capture. Validated on five datasets (DFC 2019, OPTIMAL-31, RSI-CB, WHU-RS19, UCMD), REW-KVA achieves state-of-the-art PSNR (29.17 dB on DFC 2019, 31.08 dB on RSI-CB) and SSIM (0.8958 on DFC 2019, 0.9442 on RSI-CB). It reduces memory reads by 35%, parameters by 42% (vs. SwinIR), and processes 1024×1024 images in 0.47 s (3.2× faster than SwinIR), serving as a deployable solution for resource-constrained platforms.

1 Introduction

Remote sensing imagery (Platel et al., 2025) enables critical capabilities across urban planning (Mathieu et al., 2007; Muhmad Kamarulzaman et al., 2023), precision agriculture (Kumar et al., 2022; de França e Silva et al., 2024), and environmental monitoring (Chen et al., 2021; Li J. et al., 2020; Lockhart et al., 2025). However, significant practical challenges impede its full potential. A substantial portion of satellite-collected data is underutilized; for instance, satellites often capture data only during limited time windows each day that meet ground needs, leaving much of the data considered invalid “waste footage” due to issues like cloud occlusion and atmospheric scattering. In agricultural monitoring, obtaining timely and objective information over large areas is difficult with ground surveys alone; yet, the spatial resolution of satellite data can be coarse in regions with fragmented farmland. In marine environments, high-resolution satellite image data comes at a significant cost, and models for deriving information, such as water depth in shallow optical zones, face constraints such as unclear maximum detection depths and error propagation when applied on a global scale. These challenges highlight a critical gap: the disparity between the vast amount of data collected and the ability to extract sufficiently detailed, accurate, and actionable information for precision applications, such as vegetation health mapping and infrastructure inspection. Super-resolution (SR) techniques are thus vital for enhancing spatial detail without physical hardware upgrades (Song et al., 2025; Zhang et al., 2025; Chung et al., 2023; Luo et al., 2023), directly addressing these practical barriers of data underutilization and cost-effective quality enhancement.

Remote sensing imagery presents distinctive hierarchical structures spanning macro-scale patterns (e.g., urban layouts and forest canopies) to micro-features (e.g., vegetation textures and isolated water bodies), posing substantial modeling challenges. Convolutional Neural Network (CNN)-based approaches with residual blocks (Lim et al., 2017) suffer from limited receptive fields (Liang et al., 2021), manifesting as edge blurring in urban boundary reconstruction. Transformer-based methods (Zhang et al., 2018) incur O(L2d) computational complexity (Vaswani et al., 2017), creating prohibitive bottlenecks for gigapixel-scale imagery processing while maintaining only local round-shaped receptive fields. State-Space Model (SSM)-based architectures achieve linear complexity (O(Ld2)) (Gu et al., 2021; Liu et al., 2024b) but reveal fundamental limitations: Cross-shaped receptive fields inadequately model diagonal spatial relationships prevalent in natural topography, and static state dynamics cannot adapt processing intensity to heterogeneous regions, causing overfitting on dominant land covers while undersampling rare micro-features. These limitations, as shown in Table 1 manifest operationally as artificial texture generation in agricultural parcels, boundary inaccuracies along riparian zones, and spectral distortions in seasonal vegetation transitions.

Table 1
www.frontiersin.org

Table 1. Computational comparison of SR architectures.

Additionally, computational overhead remains prohibitive for edge deployment. Existing efficient methods (Wang et al., 2024) require L sequential steps without parallelization during large-area analysis, while irregular memory access patterns misalign with geographical feature geometries. These inefficiencies yield <30% hardware utilization on parallel architectures, increasing operational costs and delaying critical applications.

To resolve these challenges, we present an efficient architecture with three foundational innovations:

1. Residual-Enhanced Wavelet (REW) Decomposition replaces standard convolutional features with multi-scale wavelet representations. This explicitly separates high-frequency details (e.g., edges, textures) from low-frequency structural information during feature extraction, enabling targeted enhancement of critical high-frequency components while suppressing spectral noise in homogeneous regions.

2. Linear Attention Mechanism with channel-mixing properties overcomes quadratic complexity barriers of conventional attention. Inspired by efficient sequence modeling paradigms, especially Receptance Weighted Key-Value (Lu et al., 2025; Peng et al., 2023; Yang et al., 2025), our approach employs parallelizable linear attention (O(L) complexity) with channel-wise feature interaction. This expands effective receptive fields to global scope while maintaining compatibility with GPU/TPU acceleration frameworks.

3. Quad-Directional Scanning Strategy replaces unidirectional state propagation with simultaneous spatial traversals along horizontal/vertical/diagonal axes. This captures omnidirectional contextual relationships essential for modeling irregular geological formations and infrastructure networks, overcoming the geometric constraints of cross-shaped or sequential scanning patterns (Zhu Q. et al., 2024).

Comprehensive validation demonstrates our method’s advantages: 20% faster training, 35% faster inference, and a 35% reduction in memory reading versus state-of-the-art methods, while achieving state-of-the-art metrics. The architecture significantly expands border effective receptive fields (Figure 1) (Luo et al., 2016). This enables deployable high-precision SR for ecological monitoring on resource-constrained platforms.

Figure 1
Four heatmaps labeled (a) SRCNN/VDSR, (b) SwinIR/HAT, (c) MambaIR/MambaIRv2, and (d) REW-KVA (Ours) are shown. Each displays color gradients from pink to red, with varying patterns and intensity centered differently. A vertical color bar on the right indicates intensity from 0.00 to 1.00.

Figure 1. Effective receptive field (ERF) visualizations (Luo et al., 2016) demonstrating enhanced global coverage versus constrained patterns in prior efficient models.

2 Related work

2.1 Evolution from CNN to transformer for super-resolution

Convolutional Neural Networks (CNNs) established the foundation for deep learning-based super-resolution by leveraging their innate capacity for hierarchical feature extraction through local receptive fields. Pioneering architectures like SRCNN (Dong et al., 2015) demonstrated that even simple three-layer convolutional frameworks could significantly outperform traditional interpolation methods, achieving PSNR improvements of 0.6–1.2 dB by learning mapping functions from low to high-resolution patches. Subsequent innovations addressed fundamental limitations through residual learning (EDSR) to enable very deep networks, recursive connections (DRCN) for parameter efficiency, and channel attention mechanisms (RCAN) to prioritize informative features. For remote sensing applications, where computational constraints often dictate deployment feasibility, lightweight adaptations became essential–FSRCNN reduced parameters by 70% through deconvolutional layers and feature shrinking, while ESPCN achieved real-time performance via sub-pixel convolution operations. These designs strategically balanced computational constraints with performance requirements, enabling practical deployment on edge devices for agricultural monitoring and infrastructure inspection. Multi-branch CNNs further improved high-frequency detail preservation: MSFSR incorporated wavelet decomposition to separate structural and textural components, enhancing reconstruction of vegetation boundaries and building edges by 2.3 dB MAE compared to single-scale networks. However, the intrinsic architectural limitations of CNNs persisted: local receptive fields fundamentally constrained long-range dependency modeling, causing geometric distortions in large-area remote sensing mosaics and spectral inconsistencies in hyperspectral data cubes where distant pixels exhibit correlated properties. Despite progressive architectural refinements, the fundamental trade-off between receptive field size and computational efficiency remained unresolved in pure CNN architectures, motivating the exploration of alternative paradigms.

2.2 Transformer-based approaches with global receptive fields

The emergence of Vision Transformers addressed CNN’s locality constraints by introducing global self-attention mechanisms that could capture long-range dependencies across entire images. This represented a paradigm shift from the localized feature extraction of CNNs to holistic image understanding. Initial hybrid models like TTSR strategically combined convolutional backbones for local feature extraction with transformer blocks for global feature refinement, improving texture synthesis in complex forest canopy reconstruction by 1.7 dB PSNR over pure CNNs. SwinIR (Liang et al., 2021) subsequently introduced shifted window attention to reduce the computational complexity from quadratic O(n2) to linear O(n) relative to sequence length, enabling practical processing of gigapixel-scale orthophotos that were previously computationally prohibitive. For multispectral and hyperspectral remote sensing data, cross-attention architectures achieved unprecedented spectral-spatial fusion by correlating visible and thermal bands across the entire image extent. Recent domain-specific innovations have further enhanced transformers for aerial imagery: HAT-Transformer incorporated hybrid aerial attention to prioritize geometrically distorted regions, while HiResNet fused cross-platform features for improved generalization across sensors and acquisition conditions. Nevertheless, these global modeling capabilities came with significant computational costs–transformer models incur prohibitive memory demands during large-scale hyperspectral cube processing (>100 GB for 512px inputs) due to the self-attention mechanism that requires computing pairwise relationships between all patches. Additionally, their isotropic attention distributes computation equally across homogeneous and heterogeneous regions, reducing efficiency in resource-constrained scenarios where computational resources could be better allocated to semantically important areas. Despite architectural improvements like windowed attention, computational complexity remains problematic for real-time applications, particularly with the very high-resolution data typical in remote sensing.

2.3 State space models as efficient alternatives

State Space Models (SSMs) emerged as computationally efficient alternatives to transformers, leveraging selective state spaces for linear-time sequence modeling while maintaining global receptive fields. The core innovation lies in modeling sequence data as a system with an evolving hidden state that captures relevant context, similar to recurrent networks but with training efficiency comparable to CNNs. Mamba’s hardware-aware architecture achieved 5× faster inference than transformers while maintaining competitive performance on various sequence modeling tasks (Liu et al., 2024a), addressing a critical limitation in processing long remote sensing sequences. In remote sensing super-resolution, VMamba adapted bidirectional scanning for spatial data, enhancing reconstruction of linear features like roads and waterways through directional state propagation that maintains contextual continuity (Zhu L. et al., 2024). Subsequent improvements incorporated convolutional embeddings to preserve local texture details, reducing spectral distortion in vegetation indices by 12% compared to transformer-based approaches. The selective state space mechanism allows Mamba to dynamically focus on relevant image regions while ignoring redundant information, providing an adaptive receptive field that transcends the fixed local windowing of CNNs without the quadratic cost of standard attention. However, fundamental limitations persist for remote sensing applications: unidirectional state propagation can induce information loss during large-area traversal, potentially causing seam artifacts in image mosaics where overlapping regions require bidirectional context (Zhao et al., 2024). Additionally, while SSMs excel at capturing long-range dependencies along their scanning paths, they may underperform in modeling complex two-dimensional relationships prevalent in geological formations, where dependencies exist in multiple directions simultaneously. These architectural constraints represent ongoing research challenges for SSM adoption in remote sensing super-resolution.

2.4 RWKV-inspired models for vision applications

Inspired by efficient recurrent architectures, RWKV mechanisms offer linear attention alternatives without the quadratic bottlenecks of standard attention, representing a further evolution toward efficient global modeling (Peng et al., 2023). Unlike conventional attention that computes pairwise token interactions, channel-mixing RWKV employs wavelet-guided feature aggregation where current features interact with multi-scale weighted contextual information (incorporating residual enhancement) (Peng et al., 2024). This formulation maintains global receptive fields while reducing complexity to O(Ld), enabling parallel processing of video streams and large-scale remote sensing sequences. For visual tasks, RWKV-Vision introduced directional scanning across four spatial axes (horizontal, vertical, two diagonals), capturing omnidirectional dependencies essential for irregular geological features that may be overlooked by single-direction scanning approaches. The channel-mixing block dynamically modulates feature responses through gated projections, balancing preservation of low-frequency structures with enhancement of high-frequency details critical for visual clarity. Recent adaptations integrate frequency-domain processing: MS-RWKV incorporates wavelet decomposition before channel mixing, separating high-frequency edges from low-frequency terrain features to reduce artifacts in orthophoto generation (Luan et al., 2025). Hardware-aware implementations exploit tensor parallelism during multi-directional scanning, achieving 85% GPU utilization – 2.3× higher than SSM-based frameworks. These developments position RWKV-inspired models as promising solutions for real-time super-resolution where computational efficiency and global context are equally critical, particularly for applications like disaster response monitoring that require both high accuracy and rapid processing.

The progression from CNN to Transformer to SSM and RWKV-inspired architectures reflects an ongoing pursuit to balance three critical objectives: local feature precision, global dependency modeling, and computational efficiency. Each architectural evolution has addressed specific limitations of its predecessors while introducing new trade-offs, driving the field toward more capable and practical super-resolution solutions for remote sensing applications.

3 Methodology

Building upon the innovations introduced in Section 1, our methodology formalizes three core technical advancements specifically designed for remote sensing super-resolution. The Residual-Enhanced Wavelet decomposition explicitly handles multi-scale feature separation critical for terrain textures, while the Linear Attention with Key-Value Adaptation achieves global context modeling with linear complexity. The Quad-Directional Scanning Strategy fundamentally overcomes geometric constraints in conventional state-space models by establishing omnidirectional feature propagation paths. Collectively, these innovations enable efficient high-precision reconstruction of gigapixel-scale aerial imagery under strict computational constraints.

3.1 REW-KVA model

The overall structure of the REW-KVA model is shown in Figure 2. We denote the LR input image as ILRRB×C×H×W, and the HR target image as IHRRB×C×H×W.

Figure 2
Diagram illustrating a process flow for image processing. It begins with

Figure 2. Architecture of REW-KVA.

Initially, we use a 3×3 CNN layer, worked as the shallow feature extractor (Extraction), to extract low-level features from the LR input image.

Secondly, we use multiple VRGs to extract high-level features from Fembd1, denoted as Ffeat:

Fembdn=fV RGnfV RGn1fV RG1Fembd1,(1)

where fV RGn is a VRG, Fembd0 is the output of the shallow feature extractor, and fV RGn1(fV RG1(Fembd0)) is the output of the previous VRG.

Thirdly, we use a 3×3 CNN layer and a one-step upsampling operation (Reconstruct) to upsample Fembd to the HR resolution, denoted as FHR:

FHR=upsampleconvfeatFembdn,(2)

where convfeat is a 3×3 CNN layer, Fembdn is the output of the nth VRG, and upsample is a one-step upsampling operation.

Finally, we add the mean and variance back to FHR and obtain the high-quality reconstructed image, denoted as IHR:

IHR=μLR+σLRFHR,(3)

where is the element-wise multiplication operator, μLR and σLR are the mean and variance of the LR input image, respectively.

3.2 Visual REW group

A VRG is a collection of residual blocks that integrates the Residual-Enhanced Wavelet Spatial Mixing (RESM) and Residual-Enhanced Wavelet Channel Mixing (RECM) as its core components. The RESM facilitates the blending of feature maps across different resolutions, while the RECM confines these features to specific channels. RECM replaces traditional MLP by computing cross-channel weights via depthwise convolution. A VRG is composed of multiple Visual REW Residual Blocks (VRBs), and the overall structure of the VRB is shown in Figure 3.

Figure 3
Diagram depicting a three-part architecture. The top section, labeled as the Vision REW Block (VRB), includes LayerNorm, RESM, and RECM components connected by element-wise operations. The middle section, the Residual-Enhanced Wavelet Spatial Mixing Module, shows data flow from origin and transposed inputs through REW-Scan processes, combining at multiple nodes before reaching a Norm output. The bottom part, the Residual-Enhanced Wavelet Channel Mixing Module, involves REW-Channel Shift feeding into operations like Sigmoid and Squared ReLU before outputting through another Norm. Element-wise operations are symbolized by circles.

Figure 3. The architecture of VRB, RESM and RECM.

The VRG receives input from either the output of the previous VRG or the shallow feature extractor, and its output is directed to the subsequent VRG or the process for high-quality reconstruction. Features entering RESM are initially subjected to a shift operation using REW-Channel Shift, yielding a set of shifted feature maps that encapsulate the features of neighboring pixels. These maps are then integrated through a channel-wise linear attention mechanism within the Spatial Mixing operation, which selects significant channels and conducts 2D scan operations to reassmble the feature maps. The outcome is fed back into the input feature map through a residual connection. Following this, the output is directed to RECM, which confines the feature maps to specific channels, with its output also reintegrated via a residual connection. The RECM output is then passed through a 3×3 convolutional layer, added back to the input feature map through a residual connection, and advanced to the next residual block or the high-quality reconstruction phase.

3.3 Residual-enhanced wavelet (REW) decomposition

Remote sensing imagery exhibits complex hierarchical structures where low-frequency components (e.g., large-scale topographic contours and uniform regions) and high-frequency details (e.g., vegetation boundaries, urban edges, and fine textures) are intrinsically interwoven. Traditional convolutional feature extraction methods often inadequately preserve the spectral-textural separation, leading to blurred details or amplified noise in homogeneous areas. The Residual-Enhanced Wavelet (REW) decomposition addresses this limitation by replacing standard convolutional features with multi-scale wavelet representations based on the Biorthogonal (Bior) wavelet transform. This framework explicitly separates high-frequency details from low-frequency structural information during feature extraction, enabling the targeted enhancement of critical high-frequency components while suppressing spectral noise in homogeneous regions through adaptive weighting mechanisms.

The Biorthogonal wavelet transform utilizes a pair of biorthogonal basis functions for decomposition and reconstruction, allowing for linear-phase filters that minimize distortion in image reconstruction. For a discrete signal x[n], the discrete wavelet transform (DWT) with the Bior wavelets is implemented using a filter bank. Let hbior and gbior be the low-pass and high-pass decomposition filters, respectively, which satisfy the biorthogonality conditions. The approximation coefficients cA and detail coefficients cD are computed as:

cAk=nxnhbior2kn,cDk=nxngbior2kn(4)

For two-dimensional images, the transform is applied separably along rows and columns, yielding four subbands: LL (low-low), LH (low-high), HL (high-low), and HH (high-high). The Bior wavelets are particularly suitable for remote sensing applications due to their symmetric filters, which help preserve edge information and reduce phase distortion.

In the REW decomposition, the process begins by applying a 2D discrete wavelet transform using Bior wavelets to the input feature maps. Specifically, for an input image ILR, a convolutional layer is first applied, followed by the DWT:

L,H=DWTbiorconv3×3ILR(5)

Here, L denotes the low-frequency approximation coefficients (LL subband) representing structural information, and H represents the aggregated high-frequency detail coefficients (combining LH, HL, and HH subbands). This separation allows the model to process structural and textural features independently, enhancing robustness to atmospheric turbulence and sensor noise.

To adaptively enhance relevant features, learnable spectral weighting parameters are introduced. These weights are derived from the low-frequency component to modulate the high-frequency details, ensuring context-aware amplification of critical textures. The weighting parameters γ are computed as:

γ=σMLPGAPL(6)

where GAP() denotes global average pooling, MLP() is a multi-layer perceptron, and σ() is the sigmoid activation function. This mechanism prioritizes high-frequency components in regions with complex textures (e.g., urban areas) while suppressing noise in homogeneous zones (e.g., water bodies).

The residual-enhanced reconstruction integrates the weighted components via an inverse discrete wavelet transform (IDWT) with the dual filters of the Bior wavelet, and incorporates a residual connection to preserve original spatial signatures. The output features are computed as:

FREW=IDWTbiorL,γH+conv1×1ILR(7)

The residual connection ensures retention of low-level geospatial information essential for change detection and multi-temporal analysis. This formulation provides key advantages such as explicit separation of structural and textural features to reduce interference from non-stationary noise in low-altitude acquisitions, frequency-adaptive weighting to mitigate spectral distortions during seasonal transitions, and a residual pathway to maintain spatial fidelity for downstream tasks. The enhanced features FREW serve as input to hierarchical processing stages, improving model performance in capturing multi-scale remote sensing patterns.

3.4 Linear attention with key-value adaptation

Remote sensing imagery necessitates modeling kilometer-scale spatial dependencies while preserving spectral fidelity, a challenge exacerbated by the quadratic complexity of conventional self-attention mechanisms when processing gigapixel mosaics. Traditional linear attention (Katharopoulos et al., 2020) addresses this by reducing computational complexity from O(L2d) to O(Ld) through kernel-based transformations and associative matrix multiplication reorganizations. The standard formulation replaces the softmax operation with a kernel function ϕ() (e.g., exponential or ReLU-based), allowing the attention output to be computed as:

AttentionQ,K,V=ϕQϕKTVϕQϕKT,(8)

where Q, K, and V are query, key, and value matrices derived from input features. This approach avoids explicitly calculating the L×L attention matrix by first computing the key-value interaction ϕ(K)TV (a d×d matrix) and then combining it with ϕ(Q), yielding linear complexity in sequence length L. However, traditional linear attention often struggles with numerical instability in state accumulation, limited adaptivity to heterogeneous data distributions (e.g., urban-rural boundaries in spectral imagery), and an inability to dynamically modulate feature interactions based on local context, which can lead to over-smoothing or loss of high-frequency details in remote sensing applications.

The proposed Key-Value Adaptation (KVA) technique extends this foundation by introducing a gated, dual-path state propagation mechanism that enables dynamic, context-aware modulation of key-value interactions. The core innovation lies in replacing the static cumulative sum operations in conventional linear attention with contextualized state transitions, where decay factors are dynamically computed from local features to enhance spectral adaptivity and numerical stability. Specifically, KVA formalizes key-value projections with depthwise convolutions for local texture encoding:

K=LayerNormC3×3XWk+bk,V=GeLUC3×3XWv+bv,Q=LayerNormXWq,(9)

where C3×3 denotes depthwise convolution capturing local spatial patterns, and W matrices project features into latent representations. Unlike traditional linear attention, which uses fixed decay or uniform weighting, KVA employs dual state variables–NtRC×d×d for numerator (key-value interactions) and DtRC×d for denominator (normalization factors)–that undergo gated decay per channel group:

Nt=ΛtNt1+expKt+uVt,Dt=ΛtDt1+expKt+u.(10)

Here, u is a learnable bias amplifying current token contributions, and the dynamic decay matrix Λt implements channel-wise forgetting via an exponential double transformation: Λt=expexpFgate(Qt,Kt). The gating function Fgate=σC1×1(Qt)+C1×1(Kt) uses sigmoid activation and 1D convolutions to generate input-dependent decay rates, allowing the model to attenuate or preserve historical states based on spectral-spatial cues.

The essence of Key-Value Adaptation resides in its dynamic, data-dependent modulation of the attention process through two core mechanisms: (1) the gated decay matrix Λt, which replaces fixed exponential decay with a learnable, channel-wise forgetting mechanism driven by query-key interactions, and (2) the dual-state normalization, which stabilizes outputs by maintaining separate numerator and denominator pathways. This enables the attention output to be computed as a normalized ratio:

Oattn=QtNtQtDt,(11)

eliminating the need for explicit softmax while ensuring numerical stability. The decay matrix Λt acts as the primary adaptation vehicle, allowing the model to adjust retention rates per channel based on local spectral characteristics–for instance, suppressing long-range dependencies in homogeneous regions like water bodies while preserving them across urban boundaries. This adaptability is further enhanced by multi-directional scanning as illustrated in Figure 4 along {,,,} axes, which captures omnidirectional contexts without diagonal limitations.

Figure 4
A colorful hot air balloon is ascending against a blue sky scattered with clouds. The image is overlaid with a grid of horizontal green arrows and vertical red arrows, dividing it into segments.

Figure 4. Visual illustration for QScanning.

KVA’s output integrates multi-head processing via head-specific receptance vectors and a gating mechanism for high-frequency enhancement:

Omulti=C1×1ParStackh=1HShrhG,(12)

where r(h)=Wr(h)Q is the receptance vector per head, and G=ReLU(C1×1(X))2 employs squared activation to amplify discriminative features. By coupling dynamic state transitions with spectral-aware gating, KVA achieves constant-memory decoding suitable for streaming long sequences, while its dual-path design inherently normalizes attention outputs, mitigating instability issues in conventional linear attention. The key innovation is thus the shift from passive cumulative sums to active, context-driven state updates, where key-value interactions are adapted in real-time to spatial and spectral heterogeneities, ensuring precise dependency modeling in gigapixel remote sensing imagery.

3.5 Quad-directional scanning strategy

Unidirectional state propagation in SSMs induces catastrophic forgetting of cross-region dependencies in large-area mosaics. Our quad-directional scanning overcomes this by simultaneous spatial traversals along four geometric axes: horizontal (), reverse horizontal (), vertical (), and reverse vertical (). The omnidirectional state update is formalized as:

St=dDαdΦdXt,St1d(13)

where D={,,,} and Φd denotes direction-specific state transitions. The fusion weights αd adapt to local spatial frequency:

αd=Softmax1CWαHPFdXt(14)

with HPFd being directionally-tuned high-pass filters. The state transition incorporates terrain-adaptive decay:

Φd=λXtSt1d+1λXtWdXt(15)

where the decay gate λ responds to edge density:

λXt=σWλXt1|Ω|(16)

For hardware-aware implementation, we employ kernel fusion to compute all four directions in parallel:

SSSSt=ΛSSSSt1+WWWWXt(17)

This quadruples the effective receptive field while reducing memory access latency by 58% compared to sequential scanning. The directional adaptation significantly reduces boundary artifacts along linear features (roads, waterways) and diagonal terrain structures prevalent in remote sensing.

3.6 REW-Channel Shift

Our REW-Channel Shift leverages multi-scale convolutional kernels, as illustrated in Figure 5 (1×1, 3×3, 5×5, and 7×7) to aggregate information at different receptive field sizes. Each convolutional kernel is depthwise and dilated, capturing both local and global features. The outputs are combined using learnable weights to produce a unified representation. Of these convolutions, Conv1×1(X) captures point-wise features, Conv3×3(X) extracts local spatial details, Conv5×5(X) enhances mid-range dependencies and Conv7×7(X) expands the receptive field for global context. Also, learnable weights wx,w1,w2,w3,w4,wq modulate the contribution of each convolution and Wavelet-based Multi-directional Shift (REWShift):

Y=wxw1w2w3w4wqTXConv1×1XConv3×3XConv5×5XConv7×7XREWShiftX,(18)

Figure 5
Diagram showing the architecture of Omni-Quad Shift. It includes five convolutional layers: Conv1x1, Conv3x3, Conv5x5, Conv7x7, and QShift, each connected to weighted nodes labeled W1 through W4, Wq, and Wx, which merge into a series of operations.

Figure 5. The architecture of omni-quad shift.

After obtaining the output tensor Y, we rearrange the output back into the original spatial format.

4 Experiments

4.1 Datasets and implementation details

4.1.1 Datasets

We evaluate the proposed method on five widely used remote sensing image datasets to ensure comprehensive validation across diverse scenarios. The DFC2019 dataset from the IEEE Data Fusion Contest 2019 contains 2,783 multi-temporal high-resolution satellite images acquired by the WorldView-3 satellite (Saux et al., 2019), provided as 1024 × 1024 pixel tiles with sub-meter spatial resolution, including 2,783 training images and 50 test images. The OPTIMAL-31 dataset comprises 1,860 images collected from Google Maps, covering 31 categories with 60 images per category, each sized 256 × 256 pixels. The RSI-CB dataset includes two subsets, RSI-CB256 and RSI-CB128 (Li H. et al., 2020), with spatial resolutions ranging from 0.3 to 3 m; the former contains over 24,000 images across 35 categories at 256 × 256 pixels, while the latter has over 36,000 images across 45 categories at 128 × 128 pixels. The WHU-RS19 dataset consists of high-resolution satellite images exported from Google Earth, featuring a spatial resolution up to 0.5 m and containing 19 classes of meaningful scenes, with approximately 50 samples per class (Wuhan University, 2010). The UCMD (UC Merced Land-Use Dataset) is a land-use research dataset derived from USGS National Map Urban Area Imagery (Yang and Newsam, 2010), containing 21 land classes with 100 images per class, each with a 0.3-m pixel resolution and size of 256 × 256 pixels. These datasets collectively represent a wide spectrum of remote sensing challenges, including multi-temporal analysis, fine-grained classification, and large-scale land-use mapping.

4.1.2 Training settings

All experiments are conducted on a system equipped with an NVIDIA A800 40 GB GPU. Models are trained using the Adam optimizer with a batch size of 16, the beta parameters of [0.9, 0.999], and an initial learning rate of 1e-4, which is halved every 50 epochs. The training process runs for 300 epochs with a weight decay of 1e-4 to prevent overfitting. Data augmentation techniques, including random horizontal flipping, rotation by 90-degree multiples, and color jittering, are applied to enhance generalization. The loss function combines L1 loss with SSIM loss in a 0.7:0.3 ratio to balance pixel-level accuracy and structural preservation. Training starts from scratch without pretrained weights to ensure a fair comparison across methods.

4.1.3 Compared models

We evaluate our model against representative methods spanning three architectural paradigms under the same training settings described in Section 4.1.2. Among CNN-based approaches, we include SRCNN (Dong et al., 2015), which pioneered the use of three-layer convolutional networks for image super-resolution, and VDSR (Kim et al., 2016), which introduced residual learning to enable deeper architectures and to improve convergence. For Transformer-based methods, we consider SwinIR (Liang et al., 2021), which employs shifted window attention to reduce computational overhead, along with its enhanced variant SwinFIR (Zhang et al., 2023), which incorporates frequency-domain processing. We also evaluate HAT (Chen et al., 2024), a hybrid attention transformer that integrates channel attention mechanisms. In the category of state-space models, we include MambaIR (Guo et al., 2024), which adapts selective state spaces for image restoration, and MaIR (Li et al., 2025), which introduces locality-preserving mechanisms into Mamba-based architectures. This comprehensive comparison allows for a more systematic evaluation of our proposed method.

4.1.4 Metrics

We employ two full-reference image quality assessment metrics to quantitatively evaluate the super-resolution performance. The Peak Signal-to-Noise Ratio (PSNR) measures the ratio between the maximum possible power of a signal and the power of corrupting noise, calculated as:

PSNR=10log10MAXI2MSE(19)

where MAXI represents the maximum possible pixel value (255 for 8-bit images), and MSE denotes the mean squared error between the reconstructed and ground truth images. The Structural Similarity Index (SSIM) assesses perceptual quality by comparing luminance, contrast, and structure:

SSIMx,y=2μxμy+C12σxy+C2μx2+μy2+C1σx2+σy2+C2(20)

where μx, μy are local means, σx, σy are standard deviations, σxy is the cross-covariance, and C1, C2 are stabilization constants. Higher values for both metrics indicate better reconstruction quality.

The Learned Perceptual Image Patch Similarity (LPIPS) metric measures perceptual distance in a deep feature space extracted by pretrained networks such as VGG. It is defined as:

LPIPSx,y=l1HlWlh,wwlϕlxhwϕlyhw22(21)

where ϕl() denotes the feature maps of layer l with spatial dimensions Hl×Wl, and wl are learned weights. Lower LPIPS values correspond to higher perceptual similarity.

The Natural Image Quality Evaluator (NIQE) is a no-reference metric that estimates naturalness by comparing local statistical features of the test image with a model of pristine natural scenes. It is computed as:

NIQEx=μxμnTΣx+Σn1μxμn(22)

where (μx,Σx) and (μn,Σn) represent the mean and covariance of the natural scene statistics (NSS) features from the test and reference natural image models, respectively. Lower NIQE indicates more natural visual appearance.

The Spectral Angle Mapper (SAM) quantifies spectral distortion by measuring the angle between the reconstructed and reference pixel vectors:

SAMx,y=arccosi=1Cxiyii=1Cxi2i=1Cyi2(23)

where C is the number of spectral channels. A smaller SAM value denotes higher spectral fidelity.

4.2 Results

Comprehensive comparisons against seven state-of-the-art super-resolution methods in 8× reconstruction are conducted to validate the effectiveness of the proposed REW-KVA architecture. The quantitative results in Table 2 show consistent and substantial improvements across all five datasets and evaluation metrics. Our method achieves the highest PSNR and SSIM values, along with the lowest LPIPS, NIQE, and SAM scores in every evaluation scenario, demonstrating both superior reconstruction fidelity and perceptual quality. The progressive performance gains from traditional CNN-based methods (SRCNN, VDSR) to more recent Transformer architectures (SwinIR, SwinFIR) (Zhang et al., 2022) and SSM-based approaches (MambaIR, MaIR) highlight the continuous evolution of remote sensing image super-resolution techniques. Notably, REW-KVA exhibits clear advantages in perceptual metrics (LPIPS: 0.2284–0.2947 across datasets) and no-reference quality scores (NIQE: 7.5318–7.8916), indicating improved human visual perception and naturalness preservation.

Table 2
www.frontiersin.org

Table 2. Quantitative comparison of 8× super-resolution performance on five remote sensing datasets. PSNR (higher better), SSIM (higher better), LPIPS (lower better), NIQE (lower better) and SAM (lower better) are reported for each method.

For qualitative results, as shown in Figure 6, our model achieves visibly sharper reconstructions and enhanced textural consistency compared to competing methods. The improvements are especially pronounced on complex datasets such as UCMD, RSI-CB, and WHU-RS19, where fine-grained details–such as vegetation boundaries, building edges, and road networks–are better preserved. In contrast, on relatively homogeneous datasets like DFC 2019, REW-KVA maintains stable quantitative advantages, suggesting strong robustness across diverse spatial and spectral conditions. This consistent behavior is crucial for high-precision remote sensing applications such as urban planning and environmental monitoring.

Figure 6
A series of three sets of images compare different super-resolution models. Each set includes images of a building, parking lot, and airport. The first image in each set is labeled

Figure 6. Qualitative results in UCMD, RSI-CB and WHU-RS19.

The superior performance primarily arises from the synergistic integration of the Residual-Enhanced Wavelet (REW) decomposition, Linear Attention mechanism with Key-Value Adaptation, and Quad-Directional Scanning strategy. The REW decomposition explicitly separates high-frequency textures from low-frequency structural information, enabling targeted enhancement of critical details without amplifying noise. The linear attention mechanism efficiently expands the receptive field to a global scope, capturing long-range dependencies necessary for large-scale geographical structures. Meanwhile, the quad-directional scanning strategy models omnidirectional spatial dependencies, effectively mitigating the diagonal information loss inherent in previous SSM-based designs. Collectively, these complementary modules contribute to the observed performance gains, validating their synergistic role in improving both accuracy and efficiency across diverse remote sensing scenarios.

4.3 Efficiency study

To evaluate the practical deployment potential, we conduct comprehensive efficiency analysis on NVIDIA A800 40 GB GPU, measuring memory reads, cache hits, parameter count, activation memory, and inference latency. As shown in Table 3, our method demonstrates superior computational efficiency compared to existing approaches. The optimized architecture achieves 35% fewer memory reads and 28% higher cache hit rate than the closest competitor, indicating better hardware utilization and data locality. The parameter count is reduced by 42% compared to Transformer-based SwinIR, while maintaining higher performance, demonstrating the effectiveness of our design choices in eliminating redundant computations.

Table 3
www.frontiersin.org

Table 3. Efficiency comparison on NVIDIA A800 40 GB GPU for 1024×1024 input resolution.

The inference speed tests reveal that our method processes 1024 × 1024 images in 0.47 s, which is 3.2 × faster than SwinIR and 1.8 × faster than MambaIR. This acceleration stems from the parallelizable linear attention mechanism and efficient wavelet decomposition, which avoid the sequential processing bottlenecks of recurrent-style architectures. The quad-directional scanning strategy further contributes to efficiency by enabling simultaneous multi-axis processing without the need for multiple passes over the input data. These efficiency gains are particularly significant for real-time applications such as disaster monitoring and rapid mapping, where both accuracy and speed are critical operational requirements.

4.4 Ablation study

We conduct a systematic ablation study to evaluate the contributions of key components and hyperparameters in our architecture. The results demonstrate that each element plays a distinct yet complementary role, and we further investigate their synergistic effects through combination experiments in achieving optimal performance.

The results of the ablation study on key components are shown in Table 4. The removal of the residual-enhanced wavelet decomposition leads to a performance decline (PSNR = 29.67), underscoring its importance in multi-scale feature extraction and noise suppression. Similarly, excluding the linear attention mechanism results in a reduction in reconstruction quality (PSNR = 29.62), highlighting its role in capturing long-range dependencies. The quad-directional scanning strategy also proves critical, as its absence causes performance degradation (PSNR = 29.70) due to impaired modeling of omnidirectional spatial relationships. To elucidate synergistic effects, we test partial combinations: when both residual-enhanced wavelet decomposition and linear attention are incorporated (but without quad-directional scanning), PSNR reaches 30.05. This demonstrates their cooperative effect in joint feature separation and global contextual modeling. Conversely, simultaneously removing both residual-enhanced wavelet decomposition and quad-directional scanning causes a performance drop (PSNR = 29.52), worse than individual removals, indicating compensatory relationships necessary for handling irregular shapes and diagonal features prevalent in remote sensing imagery. The full integration of all components yields the highest performance (PSNR = 30.51), confirming their optimal synergy.

Table 4
www.frontiersin.org

Table 4. Ablation study on model components evaluated on RSI-CB dataset.

An analysis of hyperparameters in Table 5 reveals that increasing model capacity consistently enhances output quality, with performance scaling proportionally to parameter count. It is important to note that every 4 VRBs (Vision Reconstruction Blocks) form a VRG (Vision Reconstruction Group), hence the number of VRGs equals the number of layers divided by 4 (Number of VRGs = Layers//4). Expanding the embedded dimension from 48 to 128 and network depth from 8 to 20 layers improves representational power, with PSNR increasing from 28.89 to 30.47. However, diminishing returns are observed beyond 128 dimensions and 16 layers, as PSNR gains moderate from 30.18 dB to 30.47 dB despite increased complexity. The configuration with embedded dimension 128 and 16 layers (corresponding to 4 VRGs) achieves a PSNR of 30.18, while further increasing layers to 20 (5 VRGs) yields only a modest improvement to 30.47 dB at a higher computational cost. This indicates that the configuration with embedded dimension 128 and 16 layers (4 VRGs) provides a favorable balance between performance and complexity.

Table 5
www.frontiersin.org

Table 5. Ablation study on hyperparameters evaluated on WHU-RS19 dataset. Every 4 VRBs form a VRG. Hence, Number of VRGs = Layers//4.

5 Conclusion

This study proposes the REW-KVA architecture to address limitations in remote sensing image super-resolution (SR), with key performance and efficiency improvements as follows: First, Residual-Enhanced Wavelet (REW) Decomposition ensures spectral-textural integrity–ablation on the RSI-CB dataset shows that removing it reduces PSNR by 0.84 dB and SSIM by 0.0303, confirming its role in separating low/high-frequency features and suppressing noise. Second, the Linear Attention (complexity O(Ld)) balances global context and efficiency, reducing parameters by 42% and memory reads by 35% compared to SwinIR, thereby resolving large-scale processing bottlenecks. Third, Quad-Directional Scanning enhances feature capture, reducing linear feature artifacts by 12% and improving diagonal structure SSIM by 0.0224. Fourth, REW-KVA achieves practical deployment advantages: processing 1024 × 1024 images in 0.47 s (3.2 × faster than SwinIR) with an 84.7% cache hit rate, enabling real-time edge tasks. However, this research is aimed solely at optical remote sensing imagery and is not applied to multi-spectral and hyperspectral settings. Future work will focus on this.

Data availability statement

The original contributions presented in the study are included in the article/supplementary material, further inquiries can be directed to the corresponding author.

Author contributions

RL: Conceptualization, Formal Analysis, Investigation, Methodology, Validation, Visualization, Writing – original draft. HM: Conceptualization, Data curation, Investigation, Validation, Writing – review and editing. XH: Funding acquisition, Project administration, Resources, Writing – review and editing.

Funding

The authors declare that financial support was received for the research and/or publication of this article. This project(or paper) was supported by the special fund of Qinghai University Innovation Workshop (under the General Reform and High-Quality Development of Undergraduate Education and Teaching Fund). Grant Number: GF-2025ZJ-17.

Conflict of interest

The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Generative AI statement

The authors declare that no Generative AI was used in the creation of this manuscript.

Any alternative text (alt text) provided alongside figures in this article has been generated by Frontiers with the support of artificial intelligence and reasonable efforts have been made to ensure accuracy, including review by the authors wherever possible. If you identify any issues, please contact us.

Publisher’s note

All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article, or claim that may be made by its manufacturer, is not guaranteed or endorsed by the publisher.

References

Chen, J., Chen, J., Liao, A., Cao, X., Chen, L., Chen, X., et al. (2021). A review of remote sensing for environmental monitoring in China: progress and challenges. Sci. Remote Sens. 4, 100032.

Google Scholar

Chen, X., Wang, X., Zhang, W., Kong, X., Qiao, Y., Zhou, J., et al. (2024). Hat: hybrid attention transformer for image restoration

Google Scholar

Chung, M., Jung, M., and Kim, Y. (2023). Enhancing remote sensing image super-resolution guided by bicubic-downsampled low-resolution image. Remote Sens. 15, 3309. doi:10.3390/rs15133309

CrossRef Full Text | Google Scholar

de França e Silva, N. R., Chaves, M. E. D., Luciano, A. C. D. S., Sanches, I. D., de Almeida, C. M., and Adami, M. (2024). Sugarcane yield estimation using satellite remote sensing data in empirical or mechanistic modeling: a systematic review. Remote Sens. 16, 863. doi:10.3390/rs16050863

CrossRef Full Text | Google Scholar

Dong, C., Loy, C. C., He, K., and Tang, X. (2015). Image super-resolution using deep convolutional networks. In IEEE Trans. Pattern Anal. Mach. Intell., (IEEE), vol. 38, 295–307. doi:10.1109/tpami.2015.2439281

CrossRef Full Text | Google Scholar

Gu, A., Goel, K., and Ré, C. (2021). Efficiently modeling long sequences with structured state spaces. arXiv preprint arXiv:2111.00396.

Google Scholar

Guo, H., Li, J., Dai, T., Ouyang, Z., Ren, X., and Xia, S.-T. (2024). Mambair: a simple baseline for image restoration with state-space model. 222, 241. doi:10.1007/978-3-031-72649-1_13

CrossRef Full Text | Google Scholar

Katharopoulos, A., Vyas, A., Pappas, N., and Fleuret, F. (2020). “Transformers are RNNs: fast autoregressive transformers with linear attention,” in Proceedings of the 37th international conference on machine learning. Editors H. D. III, and A. Singh (PMLR), 5156–5165.

Google Scholar

Kim, J., Lee, J. K., and Lee, K. M. (2016). “Accurate image super-resolution using very deep convolutional networks,” in Proceedings of the IEEE conference on computer vision and pattern recognition, Las Vegas, NV, USA, 27-30 June 2016 (IEEE), 1646–1654.

CrossRef Full Text | Google Scholar

Kumar, S., Meena, R. S., Sheoran, S., Jangir, C. K., Jhariya, M. K., Banerjee, A., et al. (2022). “Chapter 5 - remote sensing for agriculture and resource management,” in Natural resources conservation and advances for sustainability. Editors M. K. Jhariya, R. S. Meena, A. Banerjee, and S. N. Meena (Elsevier), 91–135. doi:10.1016/B978-0-12-822976-7.00012-0

CrossRef Full Text | Google Scholar

Li, H., Dou, X., Tao, C., Wu, Z., Chen, J., Peng, J., et al. (2020). Rsi-cb: a large-scale remote sensing image classification benchmark using crowdsourced data. Sensors 20, 1594. doi:10.3390/s20061594

PubMed Abstract | CrossRef Full Text | Google Scholar

Li, J., Pei, Y., Zhao, S., Xiao, R., Sang, X., and Zhang, C. (2020). A review of remote sensing for environmental monitoring in China. Remote Sens. 12, 1130. doi:10.3390/rs12071130

CrossRef Full Text | Google Scholar

Li, B., Zhao, H., Wang, W., Hu, P., Gou, Y., and Peng, X. (2025). Mair: a locality- and continuity-preserving mamba for image restoration. 7491–7501. doi:10.1109/cvpr52734.2025.00702

CrossRef Full Text | Google Scholar

Liang, J., Cao, J., Sun, G., Zhang, K., Van Gool, L., and Timofte, R. (2021). “Swinir: image restoration using swin transformer,” in Proceedings of the IEEE/CVF international conference on computer vision, Montreal, BC, Canada, 11-17 October 2021 (IEEE), 1833–1844.

CrossRef Full Text | Google Scholar

Lim, B., Son, S., Kim, H., Nah, S., and Lee, K. M. (2017). “Enhanced deep residual networks for single image super-resolution,” in Proceedings of the IEEE conference on computer vision and pattern recognition workshops, Honolulu, HI, USA, 21-26 July 2017 (IEEE), 136–144.

CrossRef Full Text | Google Scholar

Liu, Y., Tian, Y., Zhao, H., Liu, Y., Wang, Y., Yuan, L., et al. (2024a). Vmamba: visual state space model. arXiv Preprint arXiv:2401.

Google Scholar

Liu, Y., Tian, Y., Zhao, Y., Yu, H., Xie, L., Wang, Y., et al. (2024b). “Vmamba: visual state space model,” in Advances in neural information processing systems. Editors A. Globerson, L. Mackey, D. Belgrave, A. Fan, U. Paquet, J. Tomczaket al. (Curran Associates, Inc.), 37, 103031–103063. doi:10.52202/079017-3273

CrossRef Full Text | Google Scholar

Lockhart, K., Sandino, J., Amarasingam, N., Hann, R., Bollard, B., and Gonzalez, F. (2025). Unmanned aerial vehicles for real-time vegetation monitoring in antarctica: a review. Remote Sens. 17, 304. doi:10.3390/rs17020304

CrossRef Full Text | Google Scholar

Lu, R., Li, C., Li, D., Zhang, G., Huang, J., and Li, X. (2025). Exploring linear attention alternative for single image super-resolution. arXiv preprint arXiv:2502, 00404.

Google Scholar

Luan, X., Fan, H., Wang, Q., Yang, N., Liu, S., Li, X., et al. (2025). Fmambair: a hybrid state-space model and frequency domain for image restoration. IEEE Trans. Geoscience Remote Sens. 63, 1–14. doi:10.1109/TGRS.2025.3526927

CrossRef Full Text | Google Scholar

Luo, W., Li, Y., Urtasun, R., and Zemel, R. (2016). “Understanding the effective receptive field in deep convolutional neural networks,” in Advances in neural information processing systems. Editors D. Lee, M. Sugiyama, U. Luxburg, I. Guyon, and R. Garnett (Curran Associates, Inc.), 29.

Google Scholar

Luo, J., Han, L., Gao, X., Liu, X., and Wang, W. (2023). Sr-feinr: continuous remote sensing image super-resolution using feature-enhanced implicit neural representation. Sensors 23, 3573. doi:10.3390/s23073573

PubMed Abstract | CrossRef Full Text | Google Scholar

Mathieu, R., Freeman, C., and Aryal, J. (2007). Mapping private gardens in urban areas using object-oriented techniques and very high-resolution satellite imagery. Landsc. Urban Plan. 81, 179–192. doi:10.1016/j.landurbplan.2006.11.009

CrossRef Full Text | Google Scholar

Muhmad Kamarulzaman, A. M., Wan Mohd Jaafar, W. S., Mohd Said, M. N., Saad, S. N. M., and Mohan, M. (2023). Uav implementations in urban planning and related sectors of rapidly developing nations: a review and future perspectives for Malaysia. Remote Sens. 15, 2845. doi:10.3390/rs15112845

CrossRef Full Text | Google Scholar

Peng, B., Alcaide, E., Anthony, Q., Al-Ghamdi, A., Fan, B., Gao, L., et al. (2023). Rwkv: reinventing rnns for the transformer era. arXiv preprint arXiv:2305.13048.

Google Scholar

Peng, B., Goldstein, D., Anthony, Q., Albalak, A., Alcaide, E., Biderman, S., et al. (2024). Eagle and finch: rwkv with matrix-valued states and dynamic recurrence. ArXiv abs/2404, 05892.

Google Scholar

Platel, A., Sandino, J., Shaw, J., Bollard, B., and Gonzalez, F. (2025). Advancing sparse vegetation monitoring in the arctic and antarctic: a review of satellite and uav remote sensing, machine learning, and sensor fusion. Remote Sens. 17, 1513. doi:10.3390/rs17091513

CrossRef Full Text | Google Scholar

Saux, B. L., Yokoya, N., Hänsch, R., and Brown, M. (2019). Data fusion contest 2019 (dfc2019).

Google Scholar

Song, Y., Sun, L., Bi, J., Quan, S., and Wang, X. (2025). Drgan: a detail recovery-based model for optical remote sensing images super-resolution. IEEE Trans. Geoscience Remote Sens. 63, 1–13. doi:10.1109/TGRS.2024.3512528

CrossRef Full Text | Google Scholar

Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., et al. (2017). “Attention is all you need,” in Advances in neural information processing systems, 5998–6008.

Google Scholar

Wang, Y., Yuan, W., Xie, F., and Lin, B. (2024). Esatsr: enhancing super-resolution for satellite remote sensing images with state space model and spatial context. Remote Sens. 16, 1956. doi:10.3390/rs16111956

CrossRef Full Text | Google Scholar

Wuhan University, T. P. (2010). Whu-rs19 dataset. Available online at: https://captain-whu.github.io/BED4RS/.High-resolutionsatelliteimagedatasetwith19sceneclasses.

Google Scholar

Yang, Y., and Newsam, S. D. (2010). “Bag-of-visual-words and spatial extensions for land-use classification,” in 18th ACM SIGSPATIAL international symposium on advances in geographic information systems, ACM-GIS 2010, November 3-5, 2010, San Jose, CA, USA, proceedings. Editors D. Agrawal, P. Zhang, A. E. Abbadi, and M. F. Mokbel (ACM), 270–279.

CrossRef Full Text | Google Scholar

Yang, Z., Li, J., Zhang, H., Zhao, D., Wei, B., and Xu, Y. (2025). Restore-rwkv: efficient and effective medical image restoration with rwkv. IEEE J. Biomed. Health Inf., 1–14. doi:10.1109/jbhi.2025.3588555

PubMed Abstract | CrossRef Full Text | Google Scholar

Zhang, Y., Li, K., Li, K., Wang, L., Zhong, B., and Fu, Y. (2018). “Image super-resolution using very deep residual channel attention networks,” in Proceedings of the European conference on computer vision (ECCV), 286–301.

Google Scholar

Zhang, Z., Liu, J., and Wang, L. (2022). “Swinfir: rethinking the swinir for image restoration and enhancement,” in 2022 IEEE international conference on multimedia and expo (ICME) (IEEE), 1–6.

Google Scholar

Zhang, D., Huang, F., Liu, S., Wang, X., and Jin, Z. (2023). Swinfir: revisiting the swinir with fast fourier convolution and improved training for image super-resolution.

Google Scholar

Zhang, Y., Zheng, P., Zeng, C., Xiao, B., Li, Z., and Gao, X. (2025). Jointly rs image deblurring and super-resolution with adjustable-kernel and multi-domain attention. IEEE Trans. Geoscience Remote Sens. 63, 1–16. doi:10.1109/TGRS.2024.3515636

CrossRef Full Text | Google Scholar

Zhao, W., Wang, L., and Zhang, K. (2024). Mambair: a simple and efficient state space model for image restoration. arXiv preprint arXiv:2403.09963.

Google Scholar

Zhu, L., Liao, B., Zhang, Q., Wang, X., Liu, W., and Wang, X. (2024). Vision mamba: efficient visual representation learning with bidirectional state space model.

Google Scholar

Zhu, Q., Zhang, G., Zou, X., Wang, X., Huang, J., and Li, X. (2024). Convmambasr: leveraging state-space models and cnns in a dual-branch architecture for remote sensing imagery super-resolution. Remote Sens. 16, 3254. doi:10.3390/rs16173254

CrossRef Full Text | Google Scholar

Keywords: remote sensing imagery, super-resolution, linear attention, receptive field, computational efficiency

Citation: Lu R, Miao H and Hai X (2026) Efficient remote sensing image super-resolution with residual-enhanced wavelet and key-value adaptation. Front. Remote Sens. 6:1718058. doi: 10.3389/frsen.2025.1718058

Received: 03 October 2025; Accepted: 13 November 2025;
Published: 13 January 2026.

Edited by:

Rui Li, University of Warwick, United Kingdom

Reviewed by:

Bing He, Chengdu University of Information Technology, China
Zhunruo Feng, Chang’an University, China

Copyright © 2026 Lu, Miao and Hai. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

*Correspondence: Xin Hai, eGluLmhhaUBxaHUuZWR1LmNu

Disclaimer: All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article or claim that may be made by its manufacturer is not guaranteed or endorsed by the publisher.