Residual state-space networks with cross-scale fusion for efficient underwater vision reconstruction

Xiong, Nei; Zhang, Yuhan

doi:10.3389/frsen.2025.1703239

ORIGINAL RESEARCH article

Front. Remote Sens., 18 November 2025

Sec. Data Fusion and Assimilation

Volume 6 - 2025 | https://doi.org/10.3389/frsen.2025.1703239

This article is part of the Research TopicAdvanced Artificial Intelligence for Remote Sensing: Methods and Applications in Earth and Environmental MonitoringView all articles

Residual state-space networks with cross-scale fusion for efficient underwater vision reconstruction

Nei Xiong¹

Yuhan Zhang²*

¹School of Management Department, Capital Normal University, Beijing, China
²Department of Medical Imaging, Zhuhai Campus, Zunyi Medical University, Zhuhai, China

Underwater vision is inherently difficult due to wavelength-dependent light absorption, non-uniform illumination, and scattering, which collectively reduce both perceptual quality and task utility. We propose a novel architecture (ResMambaNet) that addresses these challenges through explicit decoupling of chromatic and structural cues, residual state-space modeling, and cross-scale feature alignment. Specifically, a dual-branch design separately processes RGB and Lab representations, promoting complementary recovery of color and spatial structures. A residual state-space module is then employed to unify local convolutional priors with efficient long-range dependency modeling, avoiding the quadratic complexity of attention. Finally, a cross-attention–based fusion with adaptive normalization aligns multi-scale features for consistent restoration across diverse conditions. Experiments on standard benchmarks (EUVP and UIEB) show that the proposed approach establishes new state-of-the-art performance, improving colorfulness, contrast, and fidelity metrics by large margins, while maintaining only $\sim$ 0.5M parameters. These results demonstrate the effectiveness of residual state-space modeling as a principled framework for underwater image enhancement.

1 Introduction

Underwater image enhancement (UIE) constitutes a fundamental problem in computer vision, with broad implications for marine resource exploration, ecological monitoring, and autonomous underwater robotics (Berman et al., 2020). Compared with terrestrial vision, underwater platforms must operate under far more stringent visual perception requirements, where both the imaging process and the computational models are strongly affected by the unique physics of light propagation in water. Specifically, wavelength-dependent absorption, scattering, and spatially varying illumination severely degrade captured images, leading to pronounced color casts, low-light visibility, and loss of structural fidelity (Berman et al., 2020).

Existing UIE techniques are generally divided into two major categories: traditional model-based methods and data-driven deep learning approaches (Yuan et al., 2025). Traditional methods exploit interpretable optical models (e.g., the Jaffe–McGlamery formulation) in combination with color correction, histogram equalization, or multi-scale fusion strategies. While effective in shallow or low-turbidity scenarios, their reliance on simplified assumptions often results in poor robustness under complex aquatic conditions (Karypidis et al., 2022); (Hu et al., 2022). Deep learning methods, on the other hand, learn end-to-end mappings via architectures such as WaterNet Li et al. (2019), U-Net–based autoencoders (Hashisho et al., 2019), and more recent variants that incorporate attention mechanisms or lightweight designs. These approaches significantly enhance sharpness and perceptual quality, showing promising generalization in challenging scenarios (Hashisho et al., 2019); (Li et al., 2019); (Zhu et al., 2025). However, they typically demand large-scale annotated datasets and computational resources, while their black-box nature limits interpretability (see, e.g., Zhu et al., 2025). To address these shortcomings, a growing line of research integrates physical priors into deep architectures by embedding underwater imaging models as explicit constraints or learnable components, thereby improving both effectiveness and interpretability (Tao et al., 2024).

From the standpoint of traditional physics-based modeling, factors that degrade underwater image quality (e.g., light-intensity attenuation curves and backscattering components) are explicitly parameterized and compensated through model-driven calibration. For example, early dehazing formulations and the dark channel prior were adapted to underwater imaging for color correction and turbidity removal. In contrast, non-physical approaches enhance imagery by directly learning an image-to-image mapping, most commonly via deep learning or other machine-learning models (Zhang et al., 2023). In recent years, deep learning methods grounded in convolutional neural networks (CNNs) have been widely adopted for underwater image enhancement. (Li et al., 2020a) proposed a multi-input/single-output (MISO) architecture that feeds both traditionally preprocessed images and raw underwater captures into a CNN, leveraging shallow-level multi-source cues to improve color rendition and contrast (Zhang et al., 2023). While CNN-based approaches can be effective, their limitations in complex underwater environments are well documented (Xu et al., 2025). To address these issues, researchers have incorporated Transformer-style channel self-attention and pixel-fusion mechanisms to capture global context while preserving local details (Shen et al., 2023). Considering the costs of human intervention and annotation, unsupervised or semi-supervised techniques based on generative adversarial networks (GANs) have emerged as a prominent direction. Through generator–discriminator adversarial training, these methods can learn robust enhancement mappings in the absence of paired clean references by exploiting learned priors. For instance, (Islam et al., 2020) trained adversarial networks with unpaired underwater/clean imagery to strengthen style learning, and (Li et al., 2018) used WaterGAN to synthesize underwater training data before training CNNs to improve sharpness and clarity (Zhang et al., 2023). Despite their ability to recover realistic details and color distributions, GANs are prone to instability and mode collapse during training, yielding inconsistent outputs. To mitigate these drawbacks, diffusion models have recently been introduced into the UIE literature (Shi and Wang 2024). (Shi and Wang 2024) proposed a Content-Preserving Diffusion Model (CPDM) built on a pre-trained backbone; at each diffusion step, they inject the difference between the raw underwater image and its noisy counterpart as a conditioning signal to compensate low-level features toward the source, thereby preserving original information while improving adaptability to complex underwater degradations (Shi and Wang 2024). Within the above body of work, relatively few studies have jointly optimized image quality and the computational budget required for edge deployment. Some researchers have explored model compression techniques—such as network pruning and knowledge distillation—to shrink model size and enable effective operation on resource-constrained platforms. However, in practice, these lightweight variants have not consistently demonstrated markedly superior visual performance compared with more complex counterparts. Consequently, a key next-step objective, and the focus of this study, is to improve underwater image quality while simultaneously reducing the computational burden at the edge. To address edge resource constraints, this paper proposes the Res-Mamba network. In contrast to existing Mamba-based methods, Res-Mamba integrates the following three primary innovations:

1. Architectural Innovation. ResMambaNet adopts a three-stage, dual-branch, multi-modal pathway from shallow to deep layers. In the shallow stage, RGB and Lab inputs are processed in parallel to extract complementary cues. In the middle and deep stages, each branch performs residual state-space sequence modeling and upsampling, ensuring faithful multi-scale propagation from local detail to global context. This design enables early separation and handling of color and fine structures while maintaining efficient feature fusion and spatial consistency at deeper levels, thereby markedly improving enhancement under spatially non-uniform illumination and turbid underwater conditions.

2. Res-Mamba Module. At each stage, we replace conventional purely convolutional or Transformer blocks with the proposed Res-Mamba module to realize efficient long-range dependency modeling via a state-space sequence model (SSM). Integrated through residual connections, Res-Mamba preserves local convolutional features while capturing global context with linear-time complexity (in sequence length). This design improves network scalability and substantially strengthens information exchange across channels and color spaces.

3. Fusion Mechanism with Cross-Attention. After shallow feature extraction, ResMambaNet—to the best of our knowledge, for the first time in underwater image enhancement—introduces a cross-attention mechanism to achieve complementary coupling between the RGB and Lab shallow features. We then apply Adaptive Instance Normalization (AdaIN) for alignment and fusion, performing element-wise fusion and channel-wise normalization at the shallow, middle, and deep stages. Finally, dedicated fusion sub-networks (shallow_fuse, mid_fuse, and deep_fuse) refine the aggregated features. This dynamic dual-stage fusion strategy synchronizes color and structural alignment and achieves an effective balance between detail preservation and color fidelity across multiple scales.

2 Related works

2.1 Underwater image enhancement

Underwater imagery is unavoidably degraded during acquisition by physical effects such as absorption and scattering, leading to pronounced color casts, low contrast, and blurred fine details. Prior studies have shown that attenuation in the red band is markedly higher than in the blue–green spectrum, causing a ubiquitous bluish–green color shift, concurrent luminance reduction, and weakened textures and edges. These degradations not only diminish perceptual image quality but also substantially impair the performance of downstream tasks, including underwater object detection, depth estimation, and semantic segmentation. To address these issues, current research on underwater image enhancement can be broadly categorized into two families: traditional (model-based) approaches and deep learning–based methods (Almutiry et al., 2024); (Yuan et al., 2025). Traditional approaches can be further subdivided into heuristic, non-physical methods and physics-based image-formation models (Li et al., 2025). The former directly manipulate pixel intensities—e.g., histogram equalization, Retinex, and multi-scale/image fusion—to improve perceived contrast and saturation (Galdran et al., 2015) guan2023diffwater. Although these techniques are computationally efficient and thus attractive for resource-constrained edge devices, they do not account for wavelength- and space-dependent attenuation in water and are therefore prone to under- or over-enhancement, limiting their effectiveness in complex underwater scenes (Raveendran et al., 2021). By contrast, physics-based methods—such as the Underwater Dark Channel Prior (UDCP) (Galdran et al., 2015) and illumination/attenuation models—explicitly model absorption and scattering and restore images by estimating scene parameters. Leveraging environmental physics, they can recover details and color to a certain extent; however, they typically require intricate parameter estimation (e.g., transmission and global illumination) and are sensitive to scene assumptions, which undermines stability in real-world applications (Berman et al., 2020) trucco2006self. Moreover, the associated computational burden further complicates deployment on edge platforms with limited resources (Zhang et al., 2025). In recent years, deep learning–based methods have achieved notable advances in underwater image enhancement (Islam et al., 2020). Convolutional neural network (CNN) models—such as the multi-scale Water-Net—effectively integrate diverse enhancement operations to restore image quality (Li et al., 2020a) li2021underwater. Several CNN variants further exploit features from multiple color spaces (e.g., RGB and HSV) and employ attention mechanisms for feature fusion, thereby improving luminance and chromatic enhancement (Li et al., 2021) cong2024comprehensive. Generative adversarial networks (GANs) have additionally boosted perceptual quality; for example, physics-guided approaches such as PUGAN leverage imaging priors to steer restoration (Fu et al., 2022a). Nevertheless, these models often involve complex architectures with large parameter counts and substantial computational demands, which hinder real-time deployment on resource-constrained edge devices (Huang et al., 2023). With the emergence of Transformer architectures, a number of studies have capitalized on self-attention’s strong modeling capacity to significantly improve enhancement performance—for instance, U-shaped Transformers (Peng et al.) and Phaseformer (Khan et al.) (Peng et al., 2023). However, the quadratic computational complexity of standard self-attention leads to considerable overhead, limiting efficiency on edge platforms (Li et al., 2025). Diffusion-based approaches have also gained traction; for example, CLIP-UIE (Liu et al.) combines diffusion modeling with CLIP-domain knowledge and adopts rapid fine-tuning to improve enhancement quality (Lu et al., 2023). Yet diffusion models are intrinsically compute-intensive and parameter-heavy, which substantially constrains their applicability in edge-computing scenarios with tight resource budgets (Du et al., 2025) tang2023underwater. In summary, although underwater image enhancement (UIE) has achieved a series of notable advances, substantial challenges persist in resource consumption, computational efficiency, and suitability for deployment on edge devices (Tang et al., 2023). How to effectively reduce algorithmic complexity and parameter counts while maintaining—or even improving—enhancement quality, thereby enabling better adaptation to edge-computing environments, remains a pressing open problem in current UIE research (Zhang et al., 2024).

2.2 Edge-computing–based underwater image enhancement

As noted above, recent deep learning–based approaches to underwater image enhancement (e.g., CNNs, GANs, Transformers, and diffusion models) have achieved strong quantitative and subjective performance on standard benchmarks such as UIEB and LSUI (Cong et al., 2024) guan2023diffwater. However, these methods typically presume high-performance servers or GPU-equipped environments, overlooking the stringent constraints on computation, power consumption, and real-time responsiveness in practical underwater operations (Hamdan et al., 2020); (Mittal 2024) comprehensive. Edge computing addresses this gap by offloading inference to embedded devices situated near the data source, thereby enabling localized processing and rapid response. This paradigm mitigates the latency and bandwidth bottlenecks inherent to cloud-centric workflows while simultaneously promoting privacy preservation and improving system robustness. Although research on underwater image enhancement (UIE) under edge constraints has started relatively recently, several studies have demonstrated the feasibility of deploying deep models in low-compute settings. Spanos et al. presented at ISPRS 2024 a physics-guided, real-time restoration framework on Jetson Nano, where TensorRT-based quantization and graph optimizations increased the throughput to 3–4 FPS while keeping power below 10 W, effectively balancing computational efficiency and enhancement quality (Antoniou et al., 2024). Jiang et al. Tian et al. (2025) designed a lightweight Adaptive Trans-ResUNet++ that integrates separable convolutions, attention pruning, and depthwise-separable residual units; compared with a conventional TransUNet, the parameter count was reduced by approximately 60% while preserving comparable PSNR/SSIM on embedded platforms. In addition, an engineering case for underwater cultural heritage monitoring (Shi et al., 2024) combined seabed optical priors with coordinated scheduling across edge devices to achieve in-situ enhancement with remote synchronization, thereby validating the performance and stability of edge–cloud collaboration in field deployments. Building on the above advances, edge computing has, to some extent, mitigated the latency, bandwidth, and power constraints that hinder real-time performance and inflate resource consumption in underwater image enhancement. Nevertheless, research specifically tailored to lightweight enhancement algorithms and resource-efficient management strategies for underwater settings remains limited. In particular, there is insufficient exploration of methods to further reduce algorithmic complexity, optimize network architectures, and balance computational load for deployment under stringent edge constraints.

3 Methods

Underwater image degradation typically stems from two largely independent factors: (i) structural information loss in the spatial domain (e.g., blurred edges or vanished textures) and (ii) perceptual deterioration (e.g., color distortion and reduced contrast). Existing enhancement pipelines often conflate these two mechanisms within a single unified model, which can undermine both generalization and robustness. To address this limitation, we adopt an independent modeling strategy that decomposes the task into more targeted submodules and endows each with an appropriate inductive bias, thereby improving the model’s generalization ability and interpretability. To effectively elevate perceptual quality under underwater conditions, we design four core modules, namely, the Res-Mamba module, the Cross-Melt-A attention module, and the ELA-Norm module, each tailored to a distinct degradation factor in underwater imaging and supported by clear theoretical motivations and architectural choices. The following subsections detail the mathematical formulations and implementation specifics of each module.

3.1 Res-Mamba module

Underwater images typically exhibit a combination of global degradations induced by light propagation (e.g., color casts and contrast attenuation) and local detail losses (e.g., edge blurring and texture weakening). Conventional convolutional neural networks (CNNs) are limited in modeling long-range dependencies, whereas Transformer self-attention can capture global relations but incurs prohibitive quadratic complexity. To reconcile these trade-offs, we construct a Res-Mamba block within a state-space modeling (SSM) framework and integrate it into a residual architecture, yielding the Res-Mamba module. The SSM backbone enables linear-time modeling of long-range interactions across spatial tokens, delivering global consistency, while the residual pathway explicitly fuses local high-frequency cues—preserved by lightweight convolutional operators—into the main stream. This design jointly enhances global perceptual coherence and local structural fidelity under underwater imaging conditions, achieving a balanced improvement in both large-scale color/contrast correction and fine-detail restoration. Given a degraded underwater image $I \in R^{H \times W \times 3}$ , we first unfold its pixels in row-major order to obtain the sequence ${u (k)}_{k = 1}^{N}$ , where $N = H \times W$ . We regard the enhancement process as a nonlinear mapping from the degraded sequence $u (k)$ to the reconstructed sequence $y (k)$ , which can be modeled by continuous-time state–space equations as follows:

\frac{d z (t)}{d t} = F z (t) + G u (t), y (t) = H z (t) + J u (t) . (1)

Here, $z (t)$ denotes the continuous latent state vector; $F$ and $G$ describe the state dynamics and input coupling, respectively; and $H$ and $J$ are output projection matrices that map the state and input to the output $y (t)$ (Equation 1). This linear time-invariant (LTI) system characterizes the degradation and restoration processes in underwater imaging: $z (t)$ can be interpreted as a lag term capturing the medium’s response to variations in light intensity; the eigenvalues of $F$ correspond to propagation attenuation rates; and $G$ and $H$ quantify the influence of the input signal on the state and the output. Since images are inherently discretely sampled, we discretize the continuous model using a zero-order hold (ZOH). Let the sampling period be $Δ$ ; within each discrete step the input is held constant. The resulting discrete-time form is:

z_{k + 1} = Φ z_{k} + Ψ u_{k}, y_{k} = H z_{k} + J u_{k} . (2)

Here, $k$ denotes the discrete-time step index; $Φ = e^{F Δ}$ is the state-transition matrix; and $Ψ = \int_{0}^{Δ} e^{F τ} G d τ$ is the discrete input-effect matrix. An appropriate choice of $Δ$ ensures that the discrete system closely approximates the original continuous model. The above SSM enables long-range dependencies over the pixel sequence to be modeled with linear-time complexity (Equation 2). For color images, we further instantiate the SSM independently for each channel—i.e., for $c \in {R, G, B}$ we use parameters $(F_{c}, G_{c}, H_{c}, J_{c})$ —thereby accounting for wavelength-dependent attenuation in water. The Res-Mamba block integrates the above SSM into a convolutional network. In implementation, we first apply a convolutional transform to the input features to obtain an initial representation, then perform SSM-based sequence modeling, and finally map back to the feature space via convolution and add a residual connection. Let the input feature map be $X \in R^{C \times H \times W}$ ; the computation of the Res-Mamba module can be expressed as:

H_{1} = f_{conv 1} (X), H_{2} = M (H_{1}), Y = X + f_{conv 2} (H_{2}) . (3)

Here, $f_{conv 1} (\cdot)$ and $f_{conv 2} (\cdot)$ denote convolutional operators (each followed by a nonlinear activation), and $M$ denotes the Mamba operator, which feeds the feature map $H_{1}$ into the SSM in pixel-sequence form and maps the reconstructed sequence back to the original 2D feature grid. Equations 3–5 embody a residual-learning paradigm: the $M$ module performs dynamic sequence modeling via a state-recursive mechanism to capture long-range spatial dependencies and global structural patterns in feature space, while the residual connection $Y = X + (\cdot)$ preserves low-frequency background content and structural priors, superimposing the high-order feature increments produced by $M$ . This design restores global consistency in degraded regions without disrupting local details or texture continuity, thereby balancing semantic fidelity with perceptual naturalness. In sum, the Res-Mamba module fuses the SSM’s global modeling capacity with a convolutional residual scaffold, enabling efficient correction of global degradation trends and reinforcement of fine-texture reconstruction in underwater imagery.

The Res-Mamba module relies on several key hyperparameters of the underlying SSM. The latent state size $N_{s}$ controls the capacity to capture long-range spatial dependencies, while the discrete sampling period $Δ$ determines the fidelity of the discrete approximation to the continuous-time dynamics. The pixel sequence length $L = H \times W$ defines the spatial context available to the SSM, and separate channel-wise matrices $(F_{c}, G_{c}, H_{c}, J_{c})$ allow modeling of wavelength-dependent attenuation for each color channel $c \in {R, G, B}$ . Finally, the surrounding convolutional layers—parameterized by the number of channels $C$ and kernel sizes $k_{1} \times k_{1}$ and $k_{2} \times k_{2}$ —fuse the SSM output with local features, balancing global correction with fine-detail preservation. Together, these hyperparameters define the trade-off between global degradation modeling and local structural fidelity in underwater image enhancement. In our implementation, we set $N_{s} = 128$ , $Δ = 0.1$ , $C = 64$ , and $k_{1} = k_{2} = 3$ .

3.2 Cross-melt-A module

We first refine the conventional AdaIN module and design a spatially adaptive feature-fusion block. The core idea is to use feature-map information to dynamically predict convolution kernels, enabling fine-grained convolution in the spatial domain. In a dual-branch architecture, we obtain two streams that respectively emphasize color/contrast and texture detail. The key to improving underwater image quality is to fuse these complementary cues to produce an output that is both color-faithful and detail-sharp. Simple summation or concatenation of the branch outputs is insufficient and may allow one cue to overshadow the other. Therefore, we introduce the Cross-Melt-A mechanism: during fusion, one branch serves as a query to attend to the other, adaptively computing importance weights. This allows selective fusion based on the correlation between color and detail features, mitigating information conflicts and highlighting enhancements in key regions. Let the feature map from the detail-enhancement branch be $F^{d} \in R^{H \times W \times C_{d}}$ and that from the color-enhancement branch be $F^{c} \in R^{H \times W \times C_{c}}$ . For attention computation, we first flatten both into sequence representations: let $N = H \times W$ , reshape $F^{d}$ to $D \in R^{N \times C_{d}}$ and $F^{c}$ to $C \in R^{N \times C_{c}}$ . We then apply linear projections to $D$ and $C$ to obtain the query, key, and value matrices.

Q = D W^{Q}, K = C W^{K}, V = C W^{V} . (4)

where $W^{Q} \in R^{C_{d} \times d}$ , $W^{K} \in R^{C_{c} \times d}$ , and $W^{V} \in R^{C_{c} \times d}$ are learnable projection matrices, and $d$ denotes the dimensionality of the attention latent space. We then compute the attention map of the detail branch over the color branch using scaled dot-product attention:

A = s o f t m a x (\frac{Q K^{⊤}}{\sqrt{d}}) V . (5)

The resulting matrix $A \in R^{N \times d}$ encodes the information aggregated for the detail features from the color features, where $A_{(p, :)}$ denotes the weighted information vector at position $p$ obtained from the global color context. Finally, we fuse the attention output with the original detail features to form a unified representation.

D^{'} = D + A . (6)

Here, $D^{'}$ denotes the fused feature sequence (dimension $N \times d$ ). Equation 6 implements residual cross-modal fusion: it preserves the original detail-branch features $D$ while superimposing the complementary information $A$ from the color branch. This means the network retains the local high-frequency cues from the detail stream and, at the same time, injects global color and contrast priors provided by the color stream, thereby enhancing contrast and color fidelity in target regions. For example, in a dim area that nevertheless contains important targets, the color branch may provide a coarse outline and chromatic distribution, whereas the detail branch supplies edge textures; after cross-attention fusion, the representation simultaneously emphasizes edge sharpness and color accuracy, yielding a more reliable basis for subsequent reconstruction. It is worth noting that the above procedure implements unidirectional attention fusion, wherein the detail branch acts as the query and the color branch provides keys and values. By symmetry, one may introduce the reverse configuration—using color features as queries and detail features as values—to further balance inter-branch information exchange when necessary. However, within our framework, we find that unidirectional cross-attention is sufficient to markedly improve fusion while maintaining manageable model complexity. Through the Cross-Melt-A module, the network adaptively modulates the relative contributions of color and detail enhancement, preserving global chromatic consistency while emphasizing salient structural details. This fusion strategy yields underwater images that are perceptually more natural and exhibit clearer layering.

3.3 Normalization module

To simultaneously correct global color bias and strengthen local details on underwater edge devices, we design the ELA-norm module. It comprises four sequential sub-steps—instance normalization, Efficient Local Attention (ELA), cross-channel attention fusion, and transposed-convolution upsampling—thereby forming a closed-loop pipeline from global style correction to detail reconstruction. First, to address substantial variations in tone and illumination styles across different waters, ELA-norm (Equation 8) applies instance normalization to each channel of every image (Equation 7) immediately after shallow feature extraction:

μ_{c} = \frac{1}{H W} \sum_{i, j} F_{c, i j}, σ_{c}^{2} = \frac{1}{H W} \sum_{i, j} {(F_{c, i j} - μ_{c})}^{2}, (7)

{\hat{F}}_{c, i j} = \frac{F_{c, i j} - μ_{c}}{\sqrt{σ_{c}^{2} + ε}} γ_{c} + β_{c} . (8)

where $γ_{c}$ and $β_{c}$ are learnable affine parameters. When $(γ_{c}, β_{c}) = (1,0)$ , the output channel attains zero mean and unit variance, effectively removing global color bias and providing a unified feature scale for subsequent structural learning. Subsequently, ELA-norm introduces Efficient Local Attention (ELA), applying global average pooling along the horizontal and vertical directions (Equation 9) on the fused feature map $F \in R^{C \times H \times W}$ .

p_{c}^{h} (i) = \frac{1}{W} \sum_{j} F_{c, i j}, p_{c}^{v} (j) = \frac{1}{H} \sum_{i} F_{c, i j} . (9)

Next, lightweight 1D convolutions followed by ReLU and Sigmoid are applied to produce the row- and column-wise attention weights $M_{c}^{h} \in R^{H}$ and $M_{c}^{v} \in R^{W}$ , which are then broadcast along channels to reweight the feature map (Equation 10).

{\tilde{F}}_{c, i j} = F_{c, i j} M_{c}^{h} (i) M_{c}^{v} (j) . (10)

This step captures long-range spatial dependencies across entire rows and columns without incurring a significant computational burden, thereby highlighting salient edges and target regions while suppressing background noise. To further fuse multi-source features, ELA-norm embeds a Cross-Attention Fuse module. Let the two mapped feature streams (e.g., shallow vs. deep, or RGB vs. Lab) be ${\tilde{F}}^{A}$ and ${\tilde{F}}^{B}$ . Each is first channel-recombined via a $1 \times 1$ convolution, then global average pooling followed by a two-layer bottleneck–Sigmoid is applied to produce channel-attention vectors $a$ and $b$ (Equation 11), $σ$ represent the variable coefficients of different channels.

a = σ (W_{2}^{A} R e L U (W_{1}^{A} G A P ({\tilde{F}}^{A}))), b = σ (W_{2}^{B} R e L U (W_{1}^{B} G A P ({\tilde{F}}^{B}))), a, b \in R^{C} . (11)

Finally, channel-wise fusion (Equation 12) is performed with a learnable parameter $γ$ ensuring that the network adaptively allocates contributions between the two branches:

F_{fuse} = γ ({\tilde{F}}^{A} ⊙ a) + (1 - γ) ({\tilde{F}}^{B} ⊙ b), (12)

After fusion, ELA-norm upsamples the feature map to the original resolution using transposed convolution (deconvolution). Unlike bilinear interpolation (Equation 13), deconvolution learns a convolution kernel $k$ :

O (x, y) = \sum_{i, j} {\tilde{F}}_{i, j} k (x - i s, y - j s) . (13)

At high-frequency locations, training adaptively amplifies edge features, while in flat regions it performs smooth interpolation; together with the final global residual $I + O$ and pixel clipping, this achieves precise enhancement at both local and global scales. Overall, ELA-norm coordinates four mathematically grounded components—normalization, spatial attention, cross-channel attention, and learnable upsampling—to eliminate global color-bias discrepancies across underwater images and, on resource-constrained edge devices, to ensure efficient detail reconstruction and real-time performance.

4 Experimental analysis and discussion

To objectively evaluate the proposed model, we conduct experiments on the EUVP (Islam et al., 2020) and UIEB (Li et al., 2020b) datasets. This section details the experimental setting, dataset descriptions, evaluation metrics, and experimental results.

4.1 Experimental environment and setup details

This study employs the PyTorch 1.13 deep learning framework, built on CUDA 11.7 and accelerated with the cuDNN 8.5 library. All training procedures are conducted within a Conda-managed virtual environment. The central processing unit (CPU) used in the experiments is an Intel^® Xeon^® W-2255 with 10 physical cores and 20 threads, featuring a maximum clock frequency of 3.70 GHz.

The model was trained using the Adam optimizer with an initial learning rate of 0.01, which was decayed by a factor of 0.3 every 30 epochs. A total of 100 epochs were performed with a mini-batch size of 4. During training, checkpoints were saved every 10 epochs to enable subsequent recovery and evaluation. The experiments employ two public datasets, EUVP and UIEB. A unified preprocessing pipeline resizes all input images to $256 \times 256$ and converts them to tensor format. For EUVP, the training–test split is 8:2, whereas UIEB provides an independent training and test partition.

4.2 Evaluation metrics

Complex underwater imaging conditions often lead to color distortion, detail blurring, and reduced contrast. Given these heterogeneous degradations, existing evaluation protocols may struggle to comprehensively characterize an enhancement algorithm’s performance across structural restoration, pixel fidelity, and perceptual quality. Accordingly, we adopt common no-reference quality indices for underwater image enhancement, including UCIQE, UIQM, CCF, and FDUM. In ablation studies, SSIM and PSNR are additionally reported as auxiliary references. SSIM primarily assesses the recovery of spatial structural information, effectively reflecting an algorithm’s ability to restore local textures and fine details (Chen et al., 2021), (Liu et al., 2025), (Yan et al., 2025). PSNR focuses on overall pixel-level distortion by quantifying the average error between enhanced and reference images, thereby indicating performance in noise suppression and detail fidelity (Han et al., 2023).

For no-reference evaluation, UCIQE linearly combines chroma difference, saturation, and contrast in the CIELab space to capture global color shift and contrast changes (Raveendran et al., 2021). UIQM standardizes and weights three components—colorfulness, sharpness, and contrast—yielding a measure more aligned with human visual perception. CCF extends the UIQM framework by incorporating a haze component derived from the dark channel prior, improving sensitivity in highly turbid scenes (Hou et al., 2024). FDUM employs frequency-domain transforms to quantify high-frequency detail fidelity and, together with dark-channel–based contrast correction and multivariate regression weighting, can finely capture artifacts and texture loss after enhancement.

4.3 Results analysis

In this section, to assess the effectiveness of the ResMambaNet introduced in this paper, we carried out both quantitative and qualitative experiments. A variety of currently available underwater image enhancement methods were chosen for comparison, such as Fusion (Ancuti et al., 2017), NU²Net (Guo et al., 2023), DiffWater (Guan et al., 2023), DMWater (Tang et al., 2023), U-Shape (Peng et al., 2023), Shallow-UWnet (Naik et al., 2021), UColor (Li et al., 2021), PUIE-Net (Fu et al., 2022b), HCLR-Net (Zhou et al., 2024), UW-DiffPhys (Bach et al., 2024) and Waterdiff (Meisheng Guan et al. 2024). These methods cover widely recognized mainstream techniques for underwater image enhancement, along with algorithms based on advanced neural network architectures.

The quantitative results on the EUVP dataset are summarized in Table 1, where red, blue, and black denote the first-, second-, and third-ranked scores across metrics, respectively. Horizontally, our proposed method (Ours) achieves optimal performance in three out of four evaluated metrics. On the UCIQE metric, both Ours and HCLR-Net attain a score of 0.618 (red), marginally surpassing NU²Net (0.606, blue) and PUIE-Net (0.594, black). For the CCF metric, Ours secures 38.981 (red), significantly outperforming the second-best HCLR-Net (36.929, blue). In UIQM, Ours records 1.45 (blue), while UW-DiffPhys leads with 1.542 (red), followed by HCLR-Net at 1.411 (black). The sole exception is FDUM, where ours scores 0.554 (black), trailing closely behind HCLR-Net’s 0.555 (blue) by a narrow margin of 0.001 (Hamdan et al., 2020). Longitudinally, Ours consistently ranks within the top three across all metrics (two first places, one second, one-third), demonstrating robust competitiveness across multidimensional evaluation criteria.

Table 1

Table 1. Numerical evaluation of the proposed approach against cutting-edge methods using the EUVP dataset. Bolding represents the best indicator result. The red is the highest, the blue is the second highest, and the black font is the third highest.

Quantitative assessments on the UIEB dataset (Table 2) reveal pronounced discrepancies among compared methods. Laterally, Ours achieves the highest UCIQE score ( $\approx$ 0.614, red), slightly exceeding HCLR-Net (0.613, blue) and DMWater (0.603, black). In UIQM, the diffusion-based UW-DiffPhys $_$ mean claims the lead (1.519, red), followed by DiffWater (1.424, blue) and Ours (1.379, black).

Table 2

Table 2. Numerical evaluation of the proposed approach against recent diffusion-based or Transformer-based methods using the UIEB dataset. The red is the highest, the blue is the second highest, and the black font is the third highest.

For color fidelity (CCF), Ours tops the ranking (35.43, red), outperforming UW-DiffPhys $_$ mean (34.23, blue) and DMWater (27.062, black) by approximately 33.5%, showcasing superior color restoration and contrast enhancement capabilities. As illustrated in Figures 1 and 2, compared to DiffWater and DMWater, ours generates results closer to reference images in chroma and contrast, better preserving inherent image content. In detail fidelity (FDUM), UW-DiffPhys $_$ mean leads (0.734, red), with Ours ranking third (0.629, black) and DiffWater second (0.676, blue). Overall, the top three positions across metrics are dominated by these methods, reflecting their relative strengths in distinct evaluation dimensions. Compared to unprocessed raw underwater images (Raw baseline), nearly all models yield improvements. Raw baseline values for UCIQE, UIQM, CCF, and FDUM are 0.520, 1.157, 20.514, and 0.447, respectively. Excluding extreme cases like Shallow-UWnet, all enhancement methods surpass these values. Early approaches such as Fusion (2018) and UColor offer limited gains: Fusion improves UIQM by 16% over Raw but slightly reduces CCF, whereas UColor mitigates color casts yet decreases CCF by 11%, falling below Raw levels.

Figure 1

Diagram illustrating a neural network model architecture. It includes layers such as convolution with ReLU activation for RGB and Lab inputs, followed by cross-modal attention and Res-Mamba modules. AdaiN Fusion and ELA-Norm Conv Block layers are interspersed. The image structure starts with stacks of images labeled with height, width, and channel dimensions, progressing through various processing blocks, and outputting a similarly structured stack of processed images. Each module or block is described with text and shapes, indicating the flow of data through the network.

Figure 1. Qualitative analysis of the proposed approach in comparison with leading-edge methods on the UIEB dataset.

Figure 2

Diagram of a neural network architecture for image processing. It shows the flow from input images with dimensions H×W×C, through two branches: RGB and Lab. Each branch includes a series of convolution and ReLU layers, followed by a Cross-Media Attention module. The outputs are then processed through various blocks, including AdaIN Fusion and ELA-Norm Conv Blocks, transforming the images into outputs with dimensions H×W×C.

Figure 2. Qualitative analysis of the proposed approach in comparison with leading-edge methods on the EUVP dataset.

This indicates trade-offs inherent in conventional algorithms. In contrast, advanced deep learning methods deliver more balanced improvements across all four metrics. For instance, HCLR-Net elevates UCIQE by 18% (0.520 $\to$ 0.613), UIQM by 19%, CCF by 73%, and FDUM by 41%; DiffWater substantially boosts UIQM and FDUM (+23% and +51%, respectively); UW-DiffPhys $_$ mean excells in UIQM, CCF, and FDUM (+31.28%, +66.8%, +64.2%) alongside a 16% UCIQE gain and 31% CCF improvement, emphasizing its dominance in color and detail domains; DMWater enhances UCIQE/UIQM/CCF by 16%/18%/32% but only achieves a 24% FDUM increase. Notably, Ours exhibits the most substantial and comprehensive improvements relative to Raw: UCIQE increases from 0.520 to 0.614 (+1%), UIQM from 1.157 to 1.379 (+19%), CCF from 20.514 to 35.428 (+74%), and FDUM from 0.447 to 0.629 (+41%). Thus, while most enhancement models improve objective quality over Raw, Ours delivers the largest and most holistic advancements. Although the proposed method demonstrates strong performance in most scenarios and can effectively remove color casts while enhancing the clarity of underwater images, certain limitations remain. As shown in Figure 1, residual haze pixels are still visible in the first row, and the overall color tone in the seventh row exhibits a slight deviation. These artifacts may stem from the limited feature extraction capability of the shallow layers, which constrains the network’s ability to fully capture fine-grained and low-saliency information.

On the EUVP dataset, Ours leads all four no-reference quality metrics: UCIQE reaches 0.618 (tied with HCLR-Net), UIQM stands at 1.450 (second), CCF peaks at 38.981 (first), and FDUM registers 0.554 (third, just below HCLR-Net’s 0.555). By contrast, NU²Net performs adequately in UCIQE (0.61) and FDUM (0.52) but lags in CCF and UIQM; legacy methods like Fusion (2018) and U-Shape score low across most metrics (e.g., U-Shape’s UCIQE: 0.408), indicating limited enhancement capacity. Notably, UW-DiffPhys ranks first in UIQM and FDUM but underperforms overall compared to Ours. Broadly, methods synergistically optimizing color correction (high UCIQE/CCF) and detail sharpening (high UIQM/FDUM) yield superior composite quality, with Ours serving as a paradigmatic exemplar.

Rankings shift relatively on the UIEB dataset. Ours retains top positions in UCIQE (0.614) and CCF (35.428), validating its sustained color and contrast correction capabilities across diverse underwater scenes. As shown in Figure 1, when processing severely color-distorted underwater images, Ours not only removes haze but also effectively restores true dynamics of aquatic scenes (Figure 2). Taking the “blue-like red algae” example in Figure 3 (top-down), severer color casts amplify Ours’ superiority over alternatives. In UIQM, diffusion-based UW-DiffPhys $_$ mean claims the highest score ( $\approx$ 1.519, red), followed by DiffWater (1.424, blue) and Ours (1.379, black). UW-DiffPhys $_$ mean’s leadership in UIQM (1.519) and FDUM (0.734) highlights its advantage in sharpness and detail enhancement on this dataset. DiffWater follows with 1.424 (UIQM, second) and 0.676 (FDUM, second), while Ours trails at 1.379 (UIQM, third) and 0.629 (FDUM, third), nonetheless outperforming most other methods. Some methods (e.g., DMWater) show marked improvements on UIEB versus EUVP, potentially reflecting diffusion models’ adaptability to complex lighting and haze conditions; conversely, color-balancing focuses like Fusion (2018) and UColor remain mid-tier (e.g., Fusion’s CCF: 20.053), exposing limitations under low visibility. Laterally, Ours maintains consistent advantages in color fidelity and contrast, whereas diffusion/fusion counterparts prioritize sharpness and detail, illustrating algorithm-dataset alignment disparities.

Figure 3

A grid of underwater images comparing different image processing techniques. Each column represents a different method, with the last column labeled

Figure 3. The structural diagram of ResMambaNet: proposed in this paper. This framework processes the image through two branches of color and detail, respectively.

Longitudinally analyzing Ours’ cross-metric performance: UCIQE–Ranked first on both EUVP and UIEB, evidencing exceptional global color bias correction and contrast enhancement; UIQM–Peaks on EUVP (1.450) but ranks third on UIEB (1.379), behind DiffWater, highlighting a balance between color preservation and sharpness; CCF–Dominates both datasets, proving excellence in underwater color cast removal and natural color recovery; FDUM–Third on EUVP (0.554) and second on UIEB (0.629), matching top-tier methods (e.g., HCLR-Net, DiffWater) in detail/texture fidelity.

Averaging ranks across four metrics, Ours achieves an average rank of 1.75 on EUVP (two firsts, one second, one-third) and 2.0 on UIEB (two firsts, two-thirds), both lowest among competitors, signifying sustained high performance across metrics and datasets. Close contenders include HCLR-Net (avg. rank 2.0 on EUVP) and UW-DiffPhys $_$ mean (avg. rank 1.33–1.5 on UIEB), while other methods exhibit higher average ranks (≥3.5), indicative of greater performance volatility. The calculation parameters of our method are shown in Table 3, and our method can balance computing power and performance well.

Table 3

Table 3. Comparison of different UIE models in terms of GFLOPs (G) and parameters (M). The downward arrow indicates that the lower the indicator, the better. Bolding represents the best indicator result.

In summary, Ours attains state-of-the-art or near-state-of-the-art performance in four critical dimensions: color correction, haze suppression, contrast enhancement, and detail fidelity. It demonstrates strong consistency and stability across both EUVP and UIEB datasets. Compared to HCLR-Net and UW-DiffPhys $_$ mean, Ours frequently leads in color fidelity (CCF) and global color-contrast integration (UCIQE), while closely following in sharpness (UIQM) and detail preservation (FDUM), thereby striking an optimal balance between visual quality and quantitative metrics.

4.4 Ablation experiments

As shown in Table 4, the ablation study validates the contribution of both the cross-melt attention fusion strategy and the Res-Mamba module. In particular, the fusion between the RGB and Lab branches is achieved through a two-stage mechanism: first, each branch is enhanced by its own Res-Mamba block to capture long-range dependencies, and then the features are integrated by cross-melt attention, which aligns the complementary color (Lab) and structural (RGB) representations across scales. The fused features are further refined by AdaIN-based normalization, ensuring adaptive balance between the two modalities. Removing this RGB–Lab fusion step causes the largest performance drop, especially on UCIQE (from 0.618 to 0.595) and SSIM (from 0.763 to 0.736), highlighting the necessity of multi-branch alignment for perceptual quality enhancement. Excluding the Res-Mamba module leads to a moderate decrease in PSNR (21.17 $\to$ 20.90) and SSIM (0.763 $\to$ 0.741), underscoring the role of long-range dependency modeling before fusion. The full model consistently achieves the best balance across UCIQE, PSNR, and SSIM, while incurring only marginally higher GFLOPs (253.53 vs. $\sim$ 252). These results confirm that RGB–Lab fusion via cross-melt attention and Res-Mamba blocks are complementary, jointly strengthening both accuracy and efficiency in ResMambaNet.

Table 4

Table 4. Ablation study of various modules and loss functions on the EUVP dataset.

5 Conclusion

This work presented ResMambaNet, a lightweight framework for underwater image enhancement. The network integrates three key designs: a dual-branch pathway that processes RGB and Lab features in parallel and progressively fuses them to decouple color and structure; a Res-Mamba module that couples local convolution with linear-time state-space modeling for efficient long-range dependency capture; and a cross-attention + AdaIN fusion strategy across multiple scales to align color statistics and structural cues. Comprehensive experiments on EUVP and UIEB demonstrate that ResMambaNet achieves state-of-the-art or near state-of-the-art performance. It consistently improves UCIQE, UIQM, CCF, and FDUM by 4%, 19%, 74%, and 41%, respectively, while maintaining compactness with only 0.46M parameters and 253.53 GFLOPs (Table 3). These results confirm that principled fusion of color/structure cues and efficient long-range modeling deliver substantive perceptual benefits at low computational cost. In summary, ResMambaNet advances the accuracy–efficiency trade-off of underwater image enhancement and provides a practical foundation for real-time underwater perception in exploration, inspection, and robotics.

For future work, one promising direction is to extend ResMambaNet from still images to video sequences, where temporal consistency and real-time constraints are critical for underwater robotics. Another avenue is to investigate adaptive training strategies that generalize across varying water types and lighting conditions, reducing the need for dataset-specific fine-tuning. Moreover, integrating ResMambaNet into multi-modal systems (e.g., combining optical images with sonar or LiDAR data) could further enhance robustness in challenging underwater environments.

Data availability statement

The original contributions presented in the study are included in the article/supplementary material, further inquiries can be directed to the corresponding author.

Author contributions

NX: Data curation, Investigation, Methodology, Software, Validation, Visualization, Writing – original draft, Writing – review and editing. YZ: Formal Analysis, Investigation, Visualization, Writing – original draft, Writing – review and editing.

Funding

The author(s) declare that no financial support was received for the research and/or publication of this article.

Conflict of interest

The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Generative AI statement

The author(s) declare that no Generative AI was used in the creation of this manuscript.

Any alternative text (alt text) provided alongside figures in this article has been generated by Frontiers with the support of artificial intelligence and reasonable efforts have been made to ensure accuracy, including review by the authors wherever possible. If you identify any issues, please contact us.

Publisher’s note

All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article, or claim that may be made by its manufacturer, is not guaranteed or endorsed by the publisher.

References

Almutiry, O., Iqbal, K., Hussain, S., Mahmood, A., and Dhahri, H. (2024). Underwater images contrast enhancement and its challenges: a survey. Multimedia Tools Appl. 83, 15125–15150. doi:10.1007/s11042-021-10626-4