SMDe: enhancing hyperspectral image denoising through a spatial-spectral modulated network

Yang, Ke; Yuan, Weixiang; Fang, Beibei; Xia, Huiwu; Xing, Xiaoxue; Shang, Weiwei

doi:10.3389/frsen.2025.1712816

ORIGINAL RESEARCH article

Front. Remote Sens., 12 January 2026

Sec. Multi- and Hyper-Spectral Imaging

Volume 6 - 2025 | https://doi.org/10.3389/frsen.2025.1712816

SMDe: enhancing hyperspectral image denoising through a spatial-spectral modulated network

Ke Yang¹

Weixiang Yuan¹

Beibei Fang¹

Huiwu Xia¹

Xiaoxue Xing^1,2*

Weiwei Shang¹

¹College of Electronic and Information Engineering, Changchun University, Changchun, Jilin, China
²Key Laboratory of Intelligent Rehabilitation and Barrier-free for the Disabled, Ministry of Education, Changchun University, Changchun, Jilin, China

Hyperspectral image denoising is crucial for restoring effective information from noisy data and plays a significant role in various downstream applications. Deep learning based methods have become the mainstream research direction due to their ability to handle complex noise. However, the spatial feature extraction of most existing methods is not comprehensive enough, and the adoption of a fixed spectral reconstruction mode does not fully utilize the spectral information of the original image. To address these issues, we propose a Spatial-Spectral Modulated Network for hyperspectral image denoising (SMDe). It consists of a spatial feature extraction network and a spectral modulation module. For the spatial feature extraction, we construct a hybrid network that combines Mamba and Transformer layers, which can effectively capture global and local spatial information. For the spectral modulation module, we design a discrimination strategy that adaptively preserves or reconstructs spectral information from the original image spectra. This information is then used to modulate spatial features, enhancing the spectral fidelity of the denoised image. Experiments on synthetic datasets show that SMDe outperforms other advanced methods in most noisy scenarios, not only restoring image details but also maintaining excellent spectral consistency. In cross-dataset and real-data evaluations, the denoising results of SMDe also demonstrated strong competitiveness.

1 Introduction

Compared with RGB images, hyperspectral images provide abundant spectral information. This unique characteristic has led to their extensive applications in remote sensing (Li et al., 2024; Arun and Akila, 2023), material identification (Wang et al., 2019; Kang et al., 2022), industrial inspection (Długosz et al., 2023), food quality analysis (Wang et al., 2017; Ahmed et al., 2025), and medical diagnosis (Ma et al., 2024; Saeed et al., 2025). In these applications, the effective utilization of spectral information plays a critical role. For example, in agricultural crop analysis, spectral features are widely exploited for fine-grained crop or seed classification (Chen et al., 2025; Wang et al., 2025). However, during the imaging process, factors including atmospheric disturbance, dark current, and sensor defects inevitably introduce noise, thereby degrading image quality and limiting the reliable extraction and utilization of valuable information. In order to restore information from degraded data, hyperspectral image denoising methods have attracted significant research attention.

Among hyperspectral image denoising methods, traditional approaches typically exploit the intrinsic properties of hyperspectral data to construct degradation models based on handcrafted priors. Typical priors include non-local similarity (Jia et al., 2014; Wang and Li, 2014), spatial–spectral correlation (Wang and Xie, 2018; Fu et al., 2016), low-rankness (Sumarsono and Du, 2015; Lin et al., 2024), and sparsity (Tang and Zhou, 2018; Akhtar et al., 2014). Although traditional methods can achieve image restoration in some noisy scenes, they have limitations when dealing with complex types of noise. In addition, these methods often have complex modeling processes and are difficult to optimize.

In recent years, deep learning–based denoising methods have developed rapidly. These approaches can be broadly divided into CNN-based methods, Transformer-based methods, and Mamba-based methods. CNN-based methods extract image features through convolutional kernels for denoising modeling. For example, HSIDCNN (Yuan et al., 2019) employs three-dimensional convolution of different sizes in both spatial and spectral dimensions, enabling the network to extract spatial and spectral features and thereby learn the mapping from noisy images to clean hyperspectral images. QRNN3D (Wei et al., 2021) also utilizes three-dimensional convolutions for feature extraction and incorporates quasi-recurrent units to model global spectral correlations. In addition to CNN-based methods, Transformer-based methods have become a research focus owing to the strong contextual modeling capability of the self-attention mechanism (Dosovitskiy et al., 2021). For instance, SwinIR (Liang et al., 2021) employs window self-attention (WSA) to build a series of residual blocks, which has been successfully applied to image restoration tasks and achieved state-of-the-art performance. For hyperspectral image denoising, SST (Li et al., 2023) employs WSA and global spectral self-attention (GSSA) to capture both spatial and spectral positional dependencies, and has achieved remarkable results. HSTNet (Yadav et al., 2025) combines 3D CNNs with a spectral transformer to jointly exploit local spectral–spatial feature extraction and long-range spectral dependency modeling, thereby improving hyperspectral image denoising performance. As for Mamba-based methods, they have recently attracted considerable attention due to their global receptive fields with linear complexity (Gu and Dao, 2024; Chen et al., 2024; Zhang et al., 2024). For example, SSUMamba (Fu et al., 2024) extends the original Mamba model with six scanning modes, enabling multi-directional modeling of spatial–spectral correlations in hyperspectral images. MambaIRv2 (Guo et al., 2025), on the other hand, rearranges pixels by taking advantage of the spatial correlation of images and inputs the global information of spatial positions into the state space model, requiring only one sequence scan. And it achieves state-of-the-art performance in RGB image restoration.

Although existing hyperspectral image denoising methods have achieved promising performance, effectively modeling the complex interaction between spatial structures and spectral information remains a critical challenge. Window-based Transformer methods restrict the input to a predefined window size, preventing distant spatial information from directly interacting (Liu et al., 2021). On the other hand, Mamba, as a one-dimensional state-space sequence model, is effective in capturing long-range dependencies but lacks explicit mechanisms to model fine-grained local spatial structures in images. Simply altering the scanning order provides limited improvement for preserving detailed spatial information (Yu and Erichson, 2025; Qu et al., 2025). Moreover, most deep learning–based denoising approaches adopt fixed network parameters after training, implicitly applying the same spectral reconstruction pattern to different inputs and spatial regions. Such a strategy fails to fully exploit the input-dependent spectral characteristics of hyperspectral images.

To address these issues, this paper proposes a spatial–spectral modulation network for hyperspectral images denoising (SMDe). The proposed method integrates Mamba and Transformer to jointly capture global dependencies and local spatial structures. Moreover, we introduce a spectral modulation module that explicitly leverages the original input spectra to adaptively modulate spatial features in a patch-wise manner. By establishing a direct interaction between spatial structures and spectral information, SMDe improves the spatial and spectral consistency of the reconstructed hyperspectral image. The main contributions of this work are summarized as follows:

• We propose SMDe, a Spatial-Spectral Modulated Network for hyperspectral image denoising, which captures comprehensive spatial information while fully leveraging the spectral information from the original images.

• We construct a hybrid spatial feature extraction module combining Mamba and Transformer, enabling the model to capture comprehensive spatial features while modeling long-range dependencies and preserving local information.

• A spectral modulation module is designed to discriminate the input spectral vector and adaptively select a retention or reconstruction strategy, thereby enabling effective use of the spectral information in the original image.

• Experimental results on both synthetic and real hyperspectral data demonstrate the effectiveness of the proposed method for hyperspectral image denoising under various noise conditions.

2 Methodology

2.1 Overall network architecture

The proposed SMDe network consists of two main parts. One is the spatial feature extraction composed of residual networks, and the other is the spectral modulation module. As shown in Figure 1, during spatial feature extraction, the noisy hyperspectral image $X_{i n} \in R^{H \times W \times D}$ is first projected into the feature space through a convolutional layer, generating the shallow feature $F_{0} \in R^{H \times W \times C}$ , where $D$ is the number of spectral bands and $C$ denotes the number of feature channels. A series of spatial feature extraction blocks (SFEB) are then applied, and the final deep feature representation $Y \in R^{H \times W \times C}$ is obtained via residual connection. Finally, the feature channels are mapped back to the spectral band dimension through another convolutional layer, preparing the representation for subsequent spectral modulation. The process can be formulated as shown in Equations 1–4:

F_{0} = Conv (X_{in}) (1)

F_{i} = SFEB (F_{i - 1}), i = 1, \dots, N (2)

Y = Conv (F_{N}) + F_{0} (3)

F e a = Conv (Y) (4)

where $F e a \in R^{H \times W \times D}$ represents the feature map obtained after adjusting the number of channels back to the number of spectral bands. Next, $F e a$ and $X_{in}$ are fed into the spectral modulation module (SMM). First, the two are directly added to produce the feature to be modulated, denoted as ${F e a}_{in}$ . Both $X_{in}$ and ${F e a}_{in}$ are then sliced into corresponding image patches. Each patch from $X_{in}$ is processed by the spectral strategy block (SSB) to generate the spectral vector $S P E$ for modulation. The spectral vector is then used to modulate the corresponding feature patch, and the spatial positions of the patches are restored to reconstruct the final output $Z_{out} \in R^{H \times W \times D}$ . The procedure can be formulated as shown in Equations 5–9:

{F e a}_{in} = X_{in} + F e a (5)

X_{win}, {F e a}_{win} = Window_partition (X_{in}, {F e a}_{in}) (6)

S P E = SSB (X_{win}) (7)

{F e a}_{win}^{'} = Linear (Linear ({F e a}_{win}) ⊙ S P E) + {F e a}_{win} (8)

Z_{out} = Window_reverse ({F e a}_{win}^{'}) (9)

Figure 1

Diagram illustrating two main components: (a) Spatial Feature Extraction, involving convolution and SFEB processes; (b) Spectral Modulation Module, including window partitioning, spectral strategy blocks, and linear transformations. Annotations indicate specific operations like Hadamard product and concatenation, with a legend detailing each component.

Figure 1. Overall architecture of the proposed SMDe network. Given a noisy hyperspectral image as input, the spatial feature extraction module (a) is first employed to extract spatial features from the input image. Then, the spectral modulation module (b) divides both the extracted spatial feature maps and the original input into corresponding image patches. For each patch, a spectral strategy block is used to generate a spectral modulation vector, which adaptively modulates the spatial features. Finally, all modulated patches are concatenated to reconstruct the denoised hyperspectral image.

Finally, the output $Z_{out}$ and the input feature ${F e a}_{in}$ are concatenated along the channel dimension, and feature fusion is performed through a convolutional layer to obtain the final denoised hyperspectral image $Z \in R^{H \times W \times D}$ , as described in Equation 10. The overall workflow is presented in Algorithm 1.

Z = Conv (Concat ({F e a}_{in}, Z_{out})) (10)

2.2 Spatial feature extraction block

As shown in Figure 2, the SFEB is a residual block composed of six spatial feature extraction layers (SFEL). Each layer contains a Mamba layer and a Transformer layer. The Mamba layer employs attentive state space model (ASSM) (Guo et al., 2025) to capture long-range dependencies in the image, while the core of the Transformer layer is window multihead self-attention (WMSA) (Liu et al., 2021), which enhances local information within each window. In addition, the ConvFFN (Fan et al., 2023) is applied as the feedforward network to further strengthen local feature representation.

Figure 2

Diagram showing a neural network architecture with components labeled as Conv for convolution, SFEL for spatial feature extraction layer, LN for layer normalization, ConvFFN for convolutional feed-forward network, WMSA for window multihead self-attention, and ASSM for attentive state space model. The diagram includes processes like window reverse, transformer layer, window partition, and Mamba Layer. A legend explains the abbreviations and color-coded components.

Figure 2. The structure of the SFEB, which is a residual block composed of six sequential SFELs. Each layer consists of a Mamba layer and a Transformer layer.

The Mamba layer is formulated as described in Equations 11, 12:

{\hat{F}}_{m} = ASSM (LN (F_{m})), (11)

F_{m} = ConvFFN (LN ({\hat{F}}_{m})), (12)

where ASSM uses the SGN (Guo et al., 2025) module to unfold the image into a 1D sequence according to semantic proximity, and then performs sequence scanning. The sequence modeling process is expressed as described in Equations 13, 14:

h_{i} = \bar{A} h_{i - 1} + \bar{B} x_{i}, (13)

y_{i} = (C + P) h_{i} + D x_{i}, (14)

where $\bar{A} = \exp (Δ A)$ is the state transition matrix, $\bar{B} = {(Δ A)}^{- 1} (\exp (Δ A) - I) (Δ B)$ is the input matrix, $C$ is the output matrix, $D$ is the feedforward matrix, and $P$ denotes the Dynamic Prompt. $P$ is generated by the pre-defined prompt pool constructed with the SGN module, aggregating global information of the input image. During the state-space scanning process, $P$ provides each position with globally aware guidance, enabling the model to capture long-range dependencies effectively in a single scan.

The Transformer layer formulated as described in Equations 15, 16:

{\hat{F}}_{t} = WMSA (LN (F_{t})), (15)

F_{t} = ConvFFN (LN ({\hat{F}}_{t})), (16)

where WMSA models the spatial relationships within local windows. The attention computation is given as described in Equation 17:

Attention (Q, K, V) = Softmax (\frac{Q K^{⊤}}{\sqrt{d_{h}}}) V, (17)

where $Q$ , $K$ , and $V$ are the query, key, and value obtained via linear projection of the input, and $d_{h}$ is the number of attention heads.

The ConvFFN block is formulated as described in Equations 18, 19:

\hat{F} = GELU (Linear (F)), (18)

F = Linear (GELU (DWConv (\hat{F})) + \hat{F}), (19)

where DWConv denotes a depthwise separable convolution with a kernel size of $5 \times 5$ .

2.3 Spectral strategy block

As shown in Figure 3, to adaptively select between retaining or reconstructing the regional pooling spectral vectors, the SSB first applies a forward difference along the channel dimension to the pooled spectral vector $s p e$ . The resulting difference vector $d i f f$ is then concatenated with $s p e$ and fed into a discriminator to obtain a discrimination value $α$ . Finally, the pooled spectral vectors are processed differently according to the value of $α$ . The procedure can be formulated as described in Equations 20, 21:

α = Discriminator (Concat (s p e, d i f f)) (20)

S P E = \{\begin{cases} Linear (BiGRU (s p e)), & if α \leq Threshold \\ s p e, & if α > Threshold \end{cases} (21)

where the discriminator is a three-layer fully connected network. The first two layers use LeakyReLU as the activation function, while the final layer employs Sigmoid to output the discrimination value. An empirical threshold of 0.5 is used. If the discrimination value exceeds the threshold, the original spectral vector is retained; otherwise, a bidirectional GRU maps the spectral vectors to the feature space, followed by a linear layer to reconstruct the spectral vectors.

Figure 3

Flowchart of a neural network model: Multiple 3D input matrices undergo pooling to form spectral vectors (spe), which are differenced (diff) and processed by a discriminator outputting α. Depending on whether α exceeds a threshold, spectrals are processed through BiGRU and Linear layers, or directly output. Components include FC (Fully Connected), LeakyReLU, and Sigmoid layers within the discriminator. The diagram explains original and reconstructed spectral vectors (spe, spe') and shows the architectural structure.

Figure 3. The SSB determines whether to retain or reconstruct the input spectral vectors, which are then used for subsequent modulation.

Algorithm 1

Algorithm 1.

3 Experiment Setup

3.1 Datasets

The ICVL (Arad and Ben-Shahar, 2016) hyperspectral image dataset contains 201 images with a spectral range of 400 nm–700 nm. Each image has 31 bands with a wavelength interval of 10 nm. According to the Settings in QRNN3D (Wei et al., 2021), we select 100 images from the ICVL hyperspectral image dataset to build the training set, extract image blocks of size 64 $\times$ 64, and perform data augmentation using rotation, flipping, and scaling. Furthermore, we directly conduct synthetic experiment tests on the CAVE (Yasuma et al., 2010) dataset and the Harvard (Chakrabarti and Zickler, 2011) dataset using the parameters trained on the ICVL dataset. Both the CAVE and Harvard datasets have 31 bands. The spectral range of the CAVE dataset is from 400 nm to 700 nm, while that of the Harvard dataset is from 420 nm to 720 nm. To evaluate denoising performance in real-world conditions, we also use the Urban (Kalman and III, 1997) dataset for testing.

3.2 Synthetic noise denoising experiments

We define nine types of synthetic noise scenarios:

Cases 1–4: Gaussian white noise with zero mean and standard deviations of 30, 50, 70, and a blind setting (randomly selected from 30 to 70).

Case 5: Different bands are corrupted by Gaussian noise with randomly selected deviations ranging from 10 to 70. This type of noise is also known as non-i. i.d noise.

Case 6: Based on Case 5, add strip noise to 5%–15% of the columns in one-third of the randomly selected bands.

Case 7: Based on Case 5, add deadline noise to 5%–15% of the columns in one-third of the randomly selected bands.

Case 8: Based on Case 5, add impulse noise ranging in intensity from 0.1 to 0.7 to one-third of the randomly selected bands.

Case 9: Each band is interfered with at random by at least one of the types of noise in Cases 5-8.

For evaluation, 50 images from the ICVL test set are used for the first four Gaussian noise scenarios, while the remaining 51 images are used for the last five complex noise scenarios. The CAVE datasets are employed for synthetic experiments in the first four Gaussian noise scenarios, whereas the Harvard datasets are used for the last five complex noise scenarios. All test images are of size $512 \times 512$ . For comparison, SMDe is evaluated against six state-of-the-art methods, including HSIDCNN (Yuan et al., 2019), QRNN3D (Wei et al., 2021), T3SC (Bodrito et al., 2021), SST (Li et al., 2023), SSUMamba (Fu et al., 2024), and MambaIRv2 (Guo et al., 2025).

3.3 Real-world denoising experiments

The Urban dataset contains 210 bands with a spectral range of 400–2500 nm and a spectral resolution of 10 nm. Due to water vapor absorption and atmospheric effects, some bands in the dataset suffer from real noise degradation. For testing, we extract hyperspectral image patches containing noisy bands, sampled at 31 bands. The spatial resolution of the image patch is 304 $\times$ 304.

3.4 Network training

We implement our model using the PyTorch framework and optimize it with the AdamW optimizer by minimizing the mean squared error (MSE) between the outputs and the ground truths. A multi-stage learning rate scheduling strategy is adopted, where the learning rate is initialized at $1 \times 1 0^{- 4}$ and gradually decrease to $1 \times 1 0^{- 6}$ . The batch size is set to 8, and incomplete batches are discarded during training. Data shuffling is performed at the end of each epoch. All models are trained for 100 epochs on a single NVIDIA RTX 4060 Ti GPU (16 GB). Notably, MambaIRv2, originally designed for RGB image restoration, is adapted to hyperspectral image denoising by adjusting the number of input channels.

4 Results and discussions

4.1 Denoising results for synthetic noise on the ICVL datasets

To evaluate the denoising performance of the proposed method, we conduct experiments on nine synthetic noise scenarios of the ICVL datasets. The objective metrics are reported in Table 1. Figures 4, 5 present the results and corresponding spectral curves in the Gaussian blind denoising scenario (Case4), while Figures 6, 7 show the results and spectral curves in the complex noise scenario (Case9).

Table 1

Table 1. Denoising results with different noise on ICVL. Case 1–4 correspond to Gaussian noise with $σ = 30,50,70$ , and blind, respectively. Case 5–9 represent non-i.i.d., stripe, deadline, impulse, and mixture complex noise. The best results in each row are highlighted in bold, and the second-best results are underlined.

Figure 4

A series of images shows a figurine on a table with a red inset highlighting a section. Each image represents different noise reduction methods: Noisy, HSIDCNN, QRNN3D, T3SC, SST, SSUMamba, MambaIRv2, SMD, and Ground Truth. The images demonstrate varying levels of clarity and color accuracy.

Figure 4. Denoising visualization comparison for simulated Gaussian noise (case 4) on the ICVL. The pseudocolor image consists of bands (9, 14, 31). Zoom in for a better view of the difference.

Figure 5

Eight line graphs comparing reflectance across spectral bands. Each graph represents a different model: HSIDCNN, QRNN3D, T3SC, SST, SSUMamba, MambaIRv2, and SMDe, with a ground truth for reference. A solid blue line and dashed red line show data trends, which are similar across all graphs with variation at specific bands. Each graph has axes labeled with reflectance and spectral band.

Figure 5. Reflectance curve of Gaussian denoising results at point (450, 450) on the ICVL.

Figure 6

Comparison of images showcasing noise reduction techniques. The top-left image labeled

Figure 6. Denoising visualization comparison for simulated complex noise (case 9) on the ICVL. The pseudocolor image consists of bands (5, 16, 29). Zoom in for a better view of the difference.

Figure 7

Graphs compare spectral band reflectance levels for various models: HSIDCNN, QRNN3D, T3SC, SST, SSUMamba, MambaIRv2, SMDe, and Ground Truth. Each graph shows a blue line for the model and a red dashed line, except Ground Truth, which has no red line. All follow similar trends, peaking between spectral bands twenty and thirty.

Figure 7. Reflectance curve of complex denoising results at point (450, 450) on the ICVL.

As shown in Table 1, SMDe achieves the best results in Case2, Case4, Case6, Case8, and Case9, and also obtains suboptimal results in the remaining scenarios. Compared with the second-best method MambaIRv2, SMDe achieves an average PSNR improvement of 0.2 dB across the nine noise scenarios. Moreover, SMDe consistently outperforms all competing methods in terms of the SAM metric across all scenarios, indicating that SMDe maintains good spectral consistency under different denoising conditions.

For Gaussian blind denoising (Case4), as shown in Figure 4, all methods achieve very good denoising effects. In particular, the spectral response curves in Figure 5 show that our method produces results closest to the ground truth. For complex noise denoising (Case9), Figure 6 illustrates that HSIDCNN, QRNN3D, and T3SC suffer from excessive smoothing, while SST, SSUMamba, and MambaIRv2 exhibit obvious stripes. In contrast, SMDe provides the best visual quality. As shown in Figure 7, the spectral curves of SMDe and MambaIRv2 are the closest to the ground truth. In summary, the denoising results on Gaussian and complex noise demonstrate that SMDe can effectively restore image details while maintaining good spectral consistency across different noise scenarios.

4.2 Denoising results for synthetic noise on the CAVE datasets

To evaluate cross-dataset generalization, we conduct experiments on the CAVE dataset with four synthetic Gaussian noise scenarios, where all methods were tested using the parameters trained on the ICVL dataset. As shown in Table 2, the objective metrics demonstrate that for noise levels of 30 (Case 1), 50 (Case 2), and blind denoising (Case 4), all models achieve comparable performance, with SST obtaining the best result for Case 1 and SMDe achieving the best results for Case 2 and Case 4. For noise level 70 (Case 3), the performance of all methods except SMDe and MambaIRv2 drops significantly. Specifically, compared with Case 2 (noise level 50), SMDe and MambaIRv2 show a moderate PSNR decrease of about 1 dB, whereas the other methods suffer a much larger drop of approximately 3 dB.

Table 2

Table 2. Denoising results with different noise on CAVE. Case 1–4: Gaussian noise with $σ = 30,50,70$ , and blind. The best results in each row are in bold, the second best results are underlined.

As shown in Figure 8, for the denoising results of Case 3, SMDe preserves clear structural details, whereas other methods suffer from over-smoothing and blurred boundaries. Furthermore, as observed in Figure 9, the spectral response curves of SMDe, QRNN3D, and SSUMamba are closer to the ground truth than those of the other methods. Overall, these results indicate that SMDe remains highly competitive for denoising on datasets different from the training domain.

Figure 8

Comparison of eight images, each showing a scene with a chart and a puppet. The images are labeled as Noisy, HSIDCNN, QRNN3D, T3SC, SST, SSUMamba, MambaIRv2, SMD, and Ground Truth. Each has a red zoomed-in section highlighting the differences in noise reduction techniques applied to the image.

Figure 8. Denoising visualization comparison for simulated Gaussian noise (case 3) on the CAVE. The pseudocolor image consists of bands (5, 17, 26). Zoom in for a better view of the difference.

Figure 9

Comparison of spectral reflectance for six different models: HSIDCNN, QRNN3D, T3SC, SST, SSUMamba, and MambaIRv2, each showing a solid blue line for ground truth and dashed red lines for model estimations. The SMD results are shown with solid blue lines. Reflectance is plotted against spectral band, ranging from 0 to 30.

Figure 9. Reflectance curve of Gaussian denoising results at point (200, 320) on the CAVE.

4.3 Denoising results for synthetic noise on the harvard datasets

To further evaluate cross-dataset generalization, we conduct experiments on the Harvard dataset with five complex noise scenarios, where all methods were also tested using the parameters trained on the ICVL dataset. As shown in Table 3, SMDe achieves the highest PSNR in the non-i. i.d. (Case 5), deadline (Case 7), and impulse (Case 8) noise scenarios, and obtains the second-highest PSNR in the mixture noise scenario (Case 9). MambaIRv2 attains the highest PSNR for mixture noise (Case 9) while ranking second in the other scenarios. SST produces the best result for stripe noise (Case 6).

Table 3

Table 3. Denoising results with different noise on Harvard. Case 5–9: non-i.i.d., stripe, deadline, impulse, and mixture complex noise. The best results in each row are in bold, the second best results are underlined.

Figure 10 shows the denoised images for Case 9. The results of HSIDCNN and T3SC appear over-smoothed, while QRNN3D still contains noticeable noise, and SSUMamba lacks vertical structural details. The results of SST, MambaIRv2, and SMDe are visually similar. Figure 11 presents the spectral response curves. It can be seen that the reconstructions by SMDe and SSUMamba are closest to the ground truth. Overall, these observations indicate that SMDe maintains strong denoising performance for complex noise across different datasets.

Figure 10

Comparison of image denoising methods applied to a brick wall with a window. The top row includes images labeled Noisy, HSIDCNN, QRNN3D, T3SC, and SST, while the bottom row includes SSUMamba, MambaIRv2, SMDe, and Ground Truth. Each image has a zoomed-in section highlighting differences in denoising quality.

Figure 10. Denoising visualization comparison for simulated complex noise (case 9) on the Harvard. The pseudocolor image consists of bands (5, 14, 26). Zoom in for a better view of the difference.

Figure 11

Eight line graphs compare reflectance across spectral bands for different models: HSIDCNN, QRNN3D, T3SC, SST, SSUMamba, MambaIRv2, SMDe, and Ground Truth. Each graph displays a blue and red line, except for Ground Truth, which only shows a blue line. Reflectance increases with spectral bands in all graphs.

Figure 11. Reflectance curve of complex denoising results at point (285, 180) on the Harvard.

4.4 Denoising results for real noise on the urban datasets

To evaluate the effectiveness of SMDe in removing real noise, we conduct experiments on the Urban. Figure 12 presents the denoising results of the 104th band. HSIDCNN and T3SC exhibit varying degrees of over-smoothing, while QRNN3D still contains noticeable noise. SSUMamba produces the brightest results, but the image boundaries are unclear. SMDe and MambaIRv2 preserve the image details most effectively.

Figure 12

Grayscale satellite images are displayed in two rows, each showing a section of an urban area with a prominent rectangular red outline highlighting specific regions. The images are labeled

Figure 12. Denoising visualization comparison for real noise on the Urban. Zoom in for a better view of the difference.

4.5 Ablation study

To validate the effectiveness of each component in our model, we conduct ablation experiments on the Case 9 noise test set of the ICVL dataset as the benchmark. We gradually add Mamba Layer, Transformer Layer and SMM to the model for training. To evaluate the effectiveness of the SSB, we also compare results with and without its use. Table 4 shows the objective metrics of these experiments, showing that the final model achieves the best denoising performance.

Table 4

Table 4. The results of ablation study.

4.6 Computational complexity and runtime analysis

To further evaluate the efficiency and practicality of the proposed method, we analyze the computational complexity and runtime performance of different methods. All experiments are conducted on hyperspectral images of size $512 \times 512 \times 31$ under the mixture noise scenario of the ICVL dataset. Table 5 reports the GFLOPs, number of parameters, inference time, and PSNR for each method.

Table 5

Table 5. Comparisons of Params, GFLOPs, inference time, and PSNR OF different methods as the input size is 512 $\times$ 512 $\times$ 31.

HSIDCNN and QRNN3D achieve relatively fast inference speed, but rely on higher computational cost or exhibit limited denoising performance. SST and SSUMamba demonstrate strong restoration capability at the expense of substantially increased computational complexity and runtime. In contrast, the proposed SMDe achieves the highest PSNR with relatively low GFLOPs and a moderate number of parameters, indicating a favorable balance between denoising performance and computational efficiency.

5 Conclusion

In this paper, we proposed a spatial–spectral modulation network (SMDe) for hyperspectral image denoising, consisting of two main components: spatial feature extraction and spectral modulation. A residual network combining Mamba and Transformer is employed to extract spatial features, while a spectral modulation module adaptively determines whether spectral information requires reconstruction. The extracted spatial and spectral information are fused to generate the denoised image. Experimental results on both synthetic and real hyperspectral datasets demonstrate that SMDe outperforms state-of-the-art approaches in most noisy scenarios, effectively restoring image details.

Nevertheless, the current method relies on regional pooled spectra via image window segmentation, which may limit the accuracy of spectral information extraction. And the spectral modulation module may not fully capture subtle spectral variations in complex or highly noisy regions, potentially limiting the accuracy of spectral reconstruction. Future work will focus on integrating region-based semantic segmentation for patch division, exploring adaptive spectral reconstruction strategies, and improving the generalization and computational efficiency of the model to broaden its applicability to diverse hyperspectral datasets.

Data availability statement

The original contributions presented in the study are included in the article/Supplementary Material, further inquiries can be directed to the corresponding author.

Author contributions

KY: Conceptualization, Data curation, Investigation, Methodology, Software, Validation, Visualization, Writing – original draft, Writing – review and editing. WY: Data curation, Validation, Visualization, Writing – review and editing. BF: Formal Analysis, Writing – review and editing. HX: Investigation, Visualization, Writing – review and editing. XX: Funding acquisition, Supervision, Writing – review and editing. WS: Supervision, Writing – review and editing.

Funding

The author(s) declared that financial support was received for this work and/or its publication. This work was partially supported by the Project of the Jilin Provincial Department of Science and Technology (No. 20220101133JC).

Conflict of interest

The author(s) declared that this work was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Generative AI statement

The author(s) declared that generative AI was not used in the creation of this manuscript.

Any alternative text (alt text) provided alongside figures in this article has been generated by Frontiers with the support of artificial intelligence and reasonable efforts have been made to ensure accuracy, including review by the authors wherever possible. If you identify any issues, please contact us.

Publisher’s note

All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article, or claim that may be made by its manufacturer, is not guaranteed or endorsed by the publisher.

Supplementary material

The Supplementary Material for this article can be found online at: https://www.frontiersin.org/articles/10.3389/frsen.2025.1712816/full#supplementary-material

References

Ahmed, M. T., Monjur, O., Khaliduzzaman, A., and Kamruzzaman, M. (2025). A comprehensive review of deep learning-based hyperspectral image reconstruction for agri-food quality appraisal. Artif. Intell. Rev. 58, 96. doi:10.1007/s10462-024-11090-w

CrossRef Full Text | Google Scholar

Akhtar, N., Shafait, F., and Mian, A. (2014). Sparse spatio-spectral representation for hyperspectral image super-resolution. Comput. Vis. – ECCV 2014, 63–78. doi:10.1007/978-3-319-10584-0_5

CrossRef Full Text | Google Scholar

Arad, B., and Ben-Shahar, O. (2016). “Sparse recovery of hyperspectral signal from natural rgb images,” in Computer Vision – 14th European Conference, ECCV 2016, Proceedings, 19–34. doi:10.1007/978-3-319-46478-7_2

CrossRef Full Text | Google Scholar

Arun, A. S., and Akila, A. S. (2023). Land-cover classification with hyperspectral remote sensing image using cnn and spectral band selection. Remote Sens. Appl. Soc. Environ. 31, 100986. doi:10.1016/j.rsase.2023.100986

CrossRef Full Text | Google Scholar

Bodrito, T., Zouaoui, A., Chanussot, J., and Mairal, J. (2021). A trainable spectral-spatial sparse coding model for hyperspectral image restoration. Adv. Neural Inf. Process. Syst. 34, 5430–5442. Available online at: https://arxiv.org/abs/2111.09708.

Google Scholar

Chakrabarti, A., and Zickler, T. (2011). “Statistics of real-world hyperspectral images,” in Proc. IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), 193–200. doi:10.1109/cvpr.2011.5995660

CrossRef Full Text | Google Scholar

Chen, K., Chen, B., Liu, C., Li, W., Zou, Z., and Shi, Z. (2024). Rsmamba: remote sensing image classification with state space model. IEEE Geoscience Remote Sens. Lett. 21, 1–5. doi:10.1109/LGRS.2024.3407111

CrossRef Full Text | Google Scholar

Chen, G., Li, G., Jin, S., and Bai, L. (2025). Dacnet: depth-aware convolutional network for corn hyperspectral image classification. Eng. Res. Express 7, 045231. doi:10.1088/2631-8695/ae1368

CrossRef Full Text | Google Scholar

Długosz, J., Dao, P. B., Staszewski, W. J., and Uhl, T. (2023). Damage detection in composite materials using hyperspectral imaging. Eur. Workshop Struct. Health Monit., 463–473. doi:10.1007/978-3-031-07258-1_48

CrossRef Full Text | Google Scholar

Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., et al. (2021). “An image is worth 16x16 words: transformers for image recognition at scale,” in International Conference on Learning Representations.

Google Scholar

Fan, Q., Huang, H., Guan, J., and He, R. (2023). Rethinking local perception in lightweight vision transformer.

Google Scholar

Fu, Y., Zheng, Y., Sato, I., and Sato, Y. (2016). Exploiting spectral-spatial correlation for coded hyperspectral image restoration. Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), 3727–3736. doi:10.1109/CVPR.2016.405

CrossRef Full Text | Google Scholar

Fu, G., Xiong, F., Lu, J., and Zhou, J. (2024). Ssumamba: spatial-spectral selective state space model for hyperspectral image denoising. IEEE Trans. Geosci. Remote Sens. 62, 1–14. doi:10.1109/TGRS.2024.3446812

CrossRef Full Text | Google Scholar

Gu, A., and Dao, T. (2024). Mamba: linear-time sequence modeling with selective state spaces.

Google Scholar

Guo, H., Guo, Y., Zha, Y., Zhang, Y., Li, W., Dai, T., et al. (2025). “Mambairv2: attentive state space restoration,” in Proceedings of the Computer Vision and Pattern Recognition Conference (CVPR), 28124–28133. doi:10.1109/cvpr52734.2025.02619

CrossRef Full Text | Google Scholar

Jia, M., Gong, M., Zhang, E., Li, Y., and Jiao, L. (2014). Hyperspectral image classification based on nonlocal means with a novel class-relativity measurement. IEEE Geosci. Remote Sens. Lett. 11, 1300–1304. doi:10.1109/LGRS.2013.2292823

CrossRef Full Text | Google Scholar

Kalman, L. S., and III, E. M. B. (1997). Classification and material identification in an urban environment using hydice hyperspectral data. Imaging Spectrom. III (SPIE) 3118, 57–68. doi:10.1117/12.283843

CrossRef Full Text | Google Scholar

Kang, X., Wang, Z., Duan, P., and Wei, X. (2022). The potential of hyperspectral image classification for oil spill mapping. IEEE Trans. Geosci. Remote Sens. 60, 1–15. doi:10.1109/TGRS.2022.3205966

CrossRef Full Text | Google Scholar

Li, M., Fu, Y., and Zhang, Y. (2023). Spatial-spectral transformer for hyperspectral image denoising. Proc. AAAI Conf. Artif. Intell. 37, 1368–1376. doi:10.1609/aaai.v37i1.25221

CrossRef Full Text | Google Scholar

Li, Z., Chen, G., Li, G., Zhou, L., Pan, X., Zhao, W., et al. (2024). Dbanet: dual-branch attention network for hyperspectral remote sensing image classification. Comput. Electr. Eng. 118, 109269. doi:10.1016/j.compeleceng.2024.109269

CrossRef Full Text | Google Scholar

Liang, J., Cao, J., Sun, G., Zhang, K., Gool, L. V., and Timofte, R. (2021). “Swinir: image restoration using swin transformer,” in 2021 IEEE/CVF International Conference on Computer Vision Workshops (ICCVW), 1833–1844. doi:10.1109/ICCVW54120.2021.00210

CrossRef Full Text | Google Scholar

Lin, P., Sun, L., Wu, Y., and Ruan, W. (2024). Hyperspectral image denoising via correntropy-based nonconvex low-rank approximation. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 17, 6841–6859. doi:10.1109/JSTARS.2024.3373466

CrossRef Full Text | Google Scholar

Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., et al. (2021). “Swin transformer: hierarchical vision transformer using shifted windows,” in 2021 IEEE/CVF International Conference on Computer Vision (ICCV), 9992–10002. doi:10.1109/ICCV48922.2021.00986

CrossRef Full Text | Google Scholar

Ma, L., Pruitt, K., and Fei, B. (2024). A hyperspectral surgical microscope with super-resolution reconstruction for intraoperative image guidance. Proc. SPIE Int. Soc. Opt. Eng. 12930, 129300Z. doi:10.1117/12.3008789

PubMed Abstract | CrossRef Full Text | Google Scholar

Qu, H., Ning, L., An, R., Fan, W., Derr, T., Liu, H., et al. (2025). A survey of mamba

Google Scholar

Saeed, A., Hadoux, X., and van Wijngaarden, P. (2025). Hyperspectral retinal imaging biomarkers of ocular and systemic diseases. Eye 39, 667–672. doi:10.1038/s41433-024-03135-9

PubMed Abstract | CrossRef Full Text | Google Scholar

Sumarsono, A., and Du, Q. (2015). Hyperspectral image classification with low-rank subspace and sparse representation. Proc. IEEE Int. Geosci. Remote Sens. Symp. (IGARSS), 2864–2867. doi:10.1109/IGARSS.2015.7326412

CrossRef Full Text | Google Scholar

Tang, S., and Zhou, N. (2018). Local similarity regularized sparse representation for hyperspectral image super-resolution. Proc. IEEE Int. Geosci. Remote Sens. Symp. (IGARSS), 5120–5123. doi:10.1109/IGARSS.2018.8518168

CrossRef Full Text | Google Scholar

Wang, R., and Li, H.-C. (2014). Nonlocal similarity regularization for sparse hyperspectral unmixing. Proc. IEEE Int. Geosci. Remote Sens. Symp. (IGARSS), 2926–2929. doi:10.1109/IGARSS.2014.6947089

CrossRef Full Text | Google Scholar

Wang, X., and Xie, W. (2018). “An adaptive spatio-spectral domain correlation parallel framework for hyperspectral image classification,” in Proc. IEEE Int. Conf. Signal Process. (ICSP), 350–354. doi:10.1109/ICSP.2018.8652407

CrossRef Full Text | Google Scholar

Wang, X., Rohani, N., Manerikar, A., Katsagellos, A., Cossairt, O., and Alshurafa, N. (2017). “Distinguishing nigerian food items and calorie content with hyperspectral imaging,” in New trends image anal. Process. – ICIAP 2017, 462–470. doi:10.1007/978-3-319-70742-6_45

CrossRef Full Text | Google Scholar

Wang, H., Wang, H., Yu, W., and Li, H. (2019). “Research on wood species recognition method based on hyperspectral image texture features,” in Proc. Int. Conf. Mech. Control Comput. Eng. (ICMCCE), 413–4133. doi:10.1109/ICMCCE48743.2019.00099

CrossRef Full Text | Google Scholar

Wang, B., Chen, G., Wen, J., Li, L., Jin, S., Li, Y., et al. (2025). Ssatnet: spectral-spatial attention transformer for hyperspectral corn image classification. Front. Plant Sci. 15, 1458978. doi:10.3389/fpls.2024.1458978

PubMed Abstract | CrossRef Full Text | Google Scholar

Wei, K., Fu, Y., and Huang, H. (2021). 3-d quasi-recurrent neural network for hyperspectral image denoising. IEEE Trans. Neural Netw. Learn. Syst. 32, 363–375. doi:10.1109/TNNLS.2020.2978756

PubMed Abstract | CrossRef Full Text | Google Scholar

Yadav, D. P., Kumar, D., Jalal, A. S., and Sharma, B. (2025). Hyperspectral image denoising through hybrid spectral transformer network. Adv. Space Res. 76, 6673–6693. doi:10.1016/j.asr.2025.09.028

CrossRef Full Text | Google Scholar

Yasuma, F., Mitsunaga, T., Iso, D., and Nayar, S. K. (2010). Generalized assorted pixel camera: postcapture control of resolution, dynamic range, and spectrum. IEEE Trans. Image Process. 19, 2241–2253. doi:10.1109/TIP.2010.2046811

PubMed Abstract | CrossRef Full Text | Google Scholar

Yu, A., and Erichson, N. B. (2025). Block-biased mamba for long-range sequence processing.

Google Scholar

Yuan, Q., Zhang, Q., Li, J., Shen, H., and Zhang, L. (2019). Hyperspectral image denoising employing a spatial–spectral deep residual convolutional neural network. IEEE Trans. Geosci. Remote Sens. 57, 1205–1218. doi:10.1109/TGRS.2018.2865197

CrossRef Full Text | Google Scholar

Zhang, Y., He, X., Zhan, C., and Li, J. (2024). Visual state space model for image deraining with symmetrical scanning. Symmetry 16, 871. doi:10.3390/sym16070871

CrossRef Full Text | Google Scholar

Keywords: deep learning, denoising, hyperspectral image, mamba, neural network, transformer

Citation: Yang K, Yuan W, Fang B, Xia H, Xing X and Shang W (2026) SMDe: enhancing hyperspectral image denoising through a spatial-spectral modulated network. Front. Remote Sens. 6:1712816. doi: 10.3389/frsen.2025.1712816

Received: 25 September 2025; Accepted: 24 December 2025;
Published: 12 January 2026.

Edited by:

Qiangqiang Yuan, Wuhan University, China

Reviewed by:

Dhirendra Prasad Yadav, GLA University, India
GongChao Chen, Henan Institute of Science and Technology, China

Copyright © 2026 Yang, Yuan, Fang, Xia, Xing and Shang. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

*Correspondence: Xiaoxue Xing , eGluZ3h4QGNjdS5lZHUuY24=

Disclaimer: All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article or claim that may be made by its manufacturer is not guaranteed or endorsed by the publisher.