Your new experience awaits. Try the new design now and help us make it even better

ORIGINAL RESEARCH article

Front. Environ. Sci., 02 January 2026

Sec. Environmental Informatics and Remote Sensing

Volume 13 - 2025 | https://doi.org/10.3389/fenvs.2025.1689855

This article is part of the Research TopicUnderwater Visual Signal Processing in the Data-Driven EraView all 7 articles

A machine learning-driven framework for enhancing underwater visual signal processing in marine ecosystem economic monitoring and anthropogenic impact assessment

Wang Minglong
Wang Minglong1*Zhu FengZhu Feng1Jian HuJian Hu2
  • 1School of Computer Science, Xijing University, Xi’an, Shaanxi, China
  • 2School of Information Engineering, North China University of Technology, Beijing, China

Introduction: Recent advancements in underwater monitoring technologies have highlighted the critical need for intelligent systems capable of addressing the unique visual challenges of marine environments. Optical distortions, ecological variability, and dynamic biological behaviors pose significant obstacles to conventional image processing methods, often leading to suboptimal signal interpretations that undermine environmental monitoring and assessments of anthropogenic impacts. Traditional methodologies, primarily adapted from terrestrial computer vision, fail to adequately account for spectral attenuation, scattering effects, and the ecological semantics that are intrinsic to underwater scenes, thereby limiting their effectiveness in tasks such as marine species tracking, seafloor habitat mapping, and anomaly detection.

Methods: To overcome these limitations, we introduce a machine learning-based framework that integrates physics-aware visual modeling with ecological adaptivity. This framework comprises the bio-optical attenuation neural extractor (BOANE) and the context-aware marine signal enhancement (CAMSE) modules. The BOANE employs spectral-adaptive convolutional units and depth-aware feature modulation to correct radiance distortions and encode biologically relevant visual information. CAMSE enhances this by dynamically adjusting parameters based on real-time ecological priors and optical conditions, incorporating flow-stabilized feature alignment, confidence-aware semantic filtering, and biologically informed regularization.

Results and Discussion: The experimental results demonstrate substantial improvements in signal clarity, temporal consistency, and ecological interpretability on challenging underwater datasets, establishing a robust approach for data-driven underwater visual signal processing. By embedding optical physics and ecological semantics into the computational pipeline, this framework sets a new standard for adaptive, semantically aware analysis of marine imagery, thus enabling high-fidelity monitoring of marine ecosystems in complex and variable underwater environments.

1 Introduction

Marine ecosystems and the assessment of anthropogenic impacts are crucial for maintaining environmental sustainability, preserving biodiversity, and guiding policy decisions. Underwater visual signal processing plays a fundamental role in this domain by enabling the identification, classification, and tracking of marine species, pollutants, and human activities. However, challenges such as light absorption, scattering, color distortion, and limited visibility severely degrade the image quality, making accurate analysis difficult (Fu et al., 2024). These factors not only reduce detection and classification accuracy but also hinder long-term ecological studies and real-time monitoring (Luxem et al., 2022). Furthermore, the need for scalable, automated systems that can operate effectively across diverse underwater environments is growing due to the increasing volume of visual data generated by underwater sensors and autonomous vehicles (Hendricks et al., 2020). A robust, adaptive visual processing framework is therefore essential—not only to enhance the fidelity of visual information but also to enable comprehensive environmental assessments that can inform conservation strategies and impact mitigation (Johnston et al., 2018).

Initial efforts to improve underwater imaging focused on manually designed algorithms tailored to specific challenges such as color correction, edge detection, and object recognition. Techniques such as histogram equalization, retinex-based methods, and light attenuation modeling were commonly employed to enhance the image quality (Wang et al., 2023). These approaches offered the advantage of interpretability, as their decision-making processes could be traced and refined by domain experts. However, their performance was often limited by their inability to adapt to the diverse and dynamic conditions of underwater environments, such as varying turbidity and lighting (Gronwald et al., 2021). Additionally, the reliance on handcrafted feature extraction restricted their scalability when applied to large and heterogeneous datasets, highlighting the need for more flexible and adaptive solutions (Tang et al., 2019).

To address these limitations, research workers began leveraging statistical models capable of learning patterns from structured data. Algorithms such as support vector machines, decision trees, and k-nearest neighbors were employed for tasks such as marine species classification and anomaly detection (Quellec et al., 2017). These methods introduced a degree of adaptability by utilizing features that captured the texture, shape, and color patterns specific to underwater imagery. Ensemble techniques and semi-supervised learning further enhanced their utility, particularly in scenarios with limited labeled data (Waldén et al., 2015). Despite these advancements, the effectiveness of these models was constrained by their dependence on manual feature engineering and their sensitivity to noise and distortions that are inherent in underwater images (Ma et al., 2024). As underwater visual scenes grew increasingly complex, the demand for more autonomous and expressive feature extraction methods became evident (Liu et al., 2023).

In recent years, deep learning has emerged as a transformative approach for underwater visual signal processing, offering end-to-end learning capabilities that eliminate the need for manual feature extraction. Convolutional neural networks (CNNs), autoencoders, and generative adversarial networks (GANs) have been successfully applied to tasks such as image enhancement, object detection, and semantic segmentation (Montgomery et al., 2016). Pretrained models such as ResNet, VGG, and U-Net have further improved performance by transferring knowledge from large-scale datasets, enabling applications such as coral reef mapping and fish species classification (Zheng and Zhang, 2022). Although these models have demonstrated remarkable success, they often require substantial computational resources and large labeled datasets for fine-tuning (Wan et al., 2021). Moreover, challenges related to domain adaptation and explainability persist, underscoring the need for solutions that balance performance, generalizability, and interpretability in critical environmental monitoring applications (Liu et al., 2020).

Based on the limitations of symbolic AI’s rigidity, machine learning’s reliance on manual features, and deep learning’s data and computational demands, we propose a unified framework that integrates advanced machine learning paradigms with underwater-specific adaptations to improve visual signal processing. This approach aims to combine the robustness of deep learning with the interpretability and adaptability of hybrid models to address the unique challenges of underwater environments. It not only enhances the image quality and feature extraction under varying light and turbidity conditions but also improves the classification and anomaly detection accuracy in multi-domain marine datasets. By incorporating domain adaptation techniques, multimodal learning, and efficient network architectures, the framework can operate effectively across diverse marine settings. Furthermore, it allows for scalable deployment on edge devices used in autonomous underwater vehicles. The ultimate goal is to bridge the gap between high-performance visual analysis and practical applicability in ecological monitoring and anthropogenic impact assessment.

The framework introduces a hybrid architecture that fuses data-driven learning with domain-specific signal priors to enhance underwater image quality and interpretation accuracy.

It is designed for high adaptability, supporting multiple underwater scenarios with varying visibility and lighting, and it is optimized for real-time processing on resource-constrained devices.

Experimental results show significant improvements in image clarity, object detection rates, and classification performance across heterogeneous marine datasets.

2 Related work

2.1 Data-driven enhancement techniques

Data-driven enhancement techniques have become pivotal in underwater visual signal processing, particularly for marine ecosystem monitoring and anthropogenic impact assessment. Early methodologies relied on traditional image processing techniques, including histogram equalization and adaptive filtering, to address challenges such as light absorption and scattering (Waldén et al., 2015). These approaches were later augmented by learning-based models that leverage large datasets of underwater imagery to infer corrective transformations (Ma et al., 2024). Deep learning architectures, such as convolutional neural networks and autoencoders, have demonstrated efficacy in restoring images affected by turbidity and wavelength-dependent attenuation (Liu et al., 2023). Generative adversarial networks have further advanced underwater dehazing by incorporating physics-based priors into loss functions, ensuring structural integrity of benthic features while suppressing noise (Zheng and Zhang, 2022). Large-scale datasets such as UIEB and RUIE, collected across diverse marine habitats, have provided benchmarks for evaluating the enhancement models (Wan et al., 2021). Synthetic augmentation techniques, including random attenuation coefficients and scattering kernel simulations, have been employed to improve model resilience under varying conditions (Liu et al., 2020). Deep residual networks equipped with attention mechanisms have shown promise in focusing on ecologically significant regions, such as coral reef structures and fish schools (Shen et al., 2019). Multiscale architectures that combine low-level corrective filters with high-level semantic encoders have enabled the reconstruction of enhanced imagery that is both physically plausible and semantically consistent (Serner et al., 2018). Unsupervised and self-supervised learning paradigms have emerged as alternatives when paired ground truth data are unavailable, utilizing noisy pseudo-labels or cycle-consistency losses to guide enhancement objectives (Shih, 2017). Domain adaptation techniques have further minimized domain shifts, enabling the application of trained models across diverse oceanic regions (Stuelcken et al., 2016). Empirical studies have evaluated enhancement effectiveness using perceptual quality indices and task-specific metrics, such as detection accuracy, to ensure ecological relevance (Gardner et al., 2015). These advancements collectively underscore the importance of data-driven enhancement in underwater visual signal processing. Recent surveys provide comprehensive taxonomies of underwater color correction and enhancement methods, which contextualize the evolution of physics-based and data-driven strategies (Lai et al., 2025). Hybrid fusion approaches tailored to real-world degradation scenarios, such as MHF-UIE, offer complementary design insights to our enhancement modules (Xu et al., 2025).

2.2 Semantic interpretation models

Semantic interpretation models have played a critical role in advancing underwater imagery analysis for marine ecosystem monitoring. These models address challenges such as low contrast, scattering, and variability in object appearance by adapting convolutional neural networks pretrained on terrestrial datasets (He et al., 2024). Transfer learning pipelines have been utilized to accelerate convergence and improve feature abstraction despite domain discrepancies (Tyagi and Yadav, 2022). Domain adaptation techniques, including adversarial alignment, have been utilized to narrow the gap between terrestrial and underwater domains (Kitaguchi et al., 2021). Encoder–decoder architectures, such as U-Net and DeepLab, have been modified to incorporate depth cues, improving object boundary delineation in underwater environments (Tang et al., 2020). Labeled datasets, including the Eilat Reef Underwater Dataset and TrashDB, have supported supervised training for tasks such as coral type classification and debris detection (Goetze et al., 2019). Temporal consistency in video streams has been enhanced through recurrent modules and optical-flow-guided aggregation, reducing flicker and improving detection stability (Feng et al., 2018). Tracking frameworks, such as SORT and DeepSORT, have been adapted for underwater behavior analysis, enabling population estimation and anthropogenic event detection (Takahashi et al., 2017). Graph neural networks have been utilized to encode relationships among species and habitats, informing scene interpretation and enabling hierarchical taxonomy integration (Pipkin et al., 2016). Semantic outputs have been combined with environmental variables, such as salinity and turbidity, to model species distribution and assess the human impact (Ramanathan et al., 2015). Weakly supervised approaches utilizing image-level tags and bounding boxes have reduced annotation burdens while maintaining robust generalization (Zhu et al., 2022). These developments in semantic interpretation have significantly enhanced the fidelity and utility of underwater visual monitoring frameworks.

2.3 Integrated monitoring frameworks

Integrated monitoring frameworks represent a holistic approach of combining enhancement, semantic interpretation, and anthropogenic impact analysis for marine ecosystem applications. These frameworks often cascade enhancement modules with interpretation networks, enabling refined semantic outputs for tasks such as species abundance estimation and habitat change detection (Waldén et al., 2015). Multitask learning architectures integrate loss functions from enhancement and interpretation, fostering synergies that improve both restorative clarity and interpretative accuracy (Ma et al., 2024). Sensor fusion techniques incorporating optical imagery with sonar or hyperspectral data have provided complementary spectral detail for robust assessments of benthic morphology and biomass distribution (Liu et al., 2023). Context-aware models conditioned on environmental covariates, such as depth and salinity, have demonstrated adaptive behavior in underwater scenarios (Zheng and Zhang, 2022). Dynamic scheduling frameworks onboard mobile platforms have optimized real-time evaluation, triggering enhancement activation based on image quality thresholds (Wan et al., 2021). Anthropogenic impact analysis modules have quantified metrics such as debris accumulation rates and bleaching severity scores, which are contextualized with environmental trends (Liu et al., 2020). Longitudinal deployments have generated time-series data for predictive models, informing management interventions such as marine protected area zoning (Shen et al., 2019). Cloud-edge collaborative architectures have balanced computational loads, performing enhancement and interpretation locally while aggregating results in cloud servers (Serner et al., 2018). Citizen science data streams, which interoperate via standardized APIs, have further enriched monitoring networks (Shih, 2017). Reproducible pipelines and open-source codebases have facilitated collaborative extension and standardization across marine science communities (Stuelcken et al., 2016). These integrated frameworks, combining cascade architectures, multimodal fusion, adaptive pipelines, and collaborative infrastructures, have enabled scalable and impactful marine ecosystem monitoring systems (Gardner et al., 2015).

3 Methods

3.1 Overview

The underwater visual domain poses significant challenges for computational perception due to the unique optical properties of the aquatic medium, the biological variability of marine ecosystems, and the dynamic nature of underwater environments. In this paper, we present a novel framework for underwater visual signal processing that directly tackles these challenges by modeling spectral attenuation, spatial variability, and task-specific semantics that are specific to underwater environments.

The proposed framework is systematically developed in the subsequent sections. Section 3.2 formalizes the problem space by defining the geometric and radiometric properties of underwater imagery. This includes the mathematical representation of key visual signal modalities, such as color attenuation profiles, ambient scattering patterns, and biological motion fields, which collectively characterize the input distribution. Section 3.3 presents the bio-opt attenuation neural extractor (BOANE), a model architecture tailored to the underwater domain. The BOANE employs physically informed convolutional operators parameterized by locally inferred water column properties, integrates latent representations of bio-optical variability, and incorporates structured priors derived from marine ecological principles. A distinguishing feature of the BOANE is its spectral-adaptive convolution modules, which dynamically recalibrate feature extraction pipelines based on ambient spectral distortion cues. Section 3.4 introduces the context-aware marine signal enhancement (CAMSE) strategy, an inference-time optimization mechanism that leverages real-time visual priors, including seafloor topology, plankton density, and turbidity indices, to guide the model’s attention and adapt its feature calibration layers. CAMSE enables the model to dynamically adjust its parameters in situ, facilitating robust interpretation of biologically relevant cues, such as bioluminescence patterns and coral fluorescence signatures, under varying illumination conditions.

This modular pipeline encompassing the formal problem definition, model architecture, and adaptive strategy establishes a comprehensive methodology for underwater visual signal extraction. By embedding the principles of underwater light propagation, marine bio-optics, and ecological semantics directly into the computational framework, the proposed approach transcends the limitations of terrestrial vision models. The remainder of this paper elaborates on each component in detail. Section 3.2 provides the mathematical formulation of underwater image formation and ecological signal representation. Section 3.3 details the BOANE architecture, which learns latent representations of spectral attenuation and habitat-specific visual patterns. Section 3.4 describes the CAMSE strategy, which operationalizes ecological adaptivity for downstream tasks, including species tracking, habitat mapping, and behavioral analysis of marine organisms. In this manuscript, several technical terms and acronyms are introduced to describe the proposed framework and its components. To improve clarity and ensure consistency, we define and standardize the use of these terms in Table 1. It provides a glossary of the key terms and their corresponding acronyms.

Table 1
www.frontiersin.org

Table 1. Glossary of technical terms and acronyms.

3.2 Preliminaries

Underwater visual signal processing involves addressing the challenges posed by the interaction of light with water and biological substrates. These interactions result in significant spectral distortion, scattering, and attenuation, which complicate the interpretation of underwater images. This section introduces the mathematical framework for underwater image formation, geometric representation, motion cues, and semantic modeling, which collectively define the problem space.

The underwater image formation model, geometric transformations, and semantic priors are formalized to provide a foundation for subsequent methodological developments.

3.2.1 Underwater image formation model

Let I:ΩR3 represent an RGB underwater image defined over the pixel domain ΩR2. The observed color I(x) at pixel xΩ is modeled as follows Equation 1:

I(x)=J(x)t(x)+B(x)(1t(x)),(1)

where J(x) denotes the scene radiance (true color), B(x) represents the backscattered light, and t(x) is the transmission map. The transmission map is defined as follows Equation 2:

t(x)=exp(βdd(x)),(2)

where βd is the attenuation coefficient, which depends on the wavelength of light, and d(x) is the depth at pixel x. This model captures the effects of light attenuation and scattering in underwater environments.

3.2.2 Spectral decomposition

The attenuation coefficient βd varies with wavelength λ, leading to a wavelength-specific transmission map, which is defined as Equation 3:

t(x,λ)=exp(β(λ)d(x)).(3)

Substituting this into the image formation model yields the following (Equation 4):

Iλ(x)=Jλ(x)exp(β(λ)d(x))+Bλ(x)(1exp(β(λ)d(x))).(4)

This formulation enables the disentanglement of spectral absorption and scattering, which are critical for accurately modeling underwater visual signals.

3.2.3 Geometric representation

The underwater scene is represented as a continuous surface SR3, with semantic labels L:S{1,,C} assigned to C ecological classes. The camera pose TSE(3) defines the transformation from the scene to the image plane. The projection of a 3D point XS onto the image plane is given by Equation 5

π(X)=KR|tX,(5)

where K is the intrinsic camera matrix and R and t represent the rotation and translation of the camera, respectively. This geometric representation links the 3D structure of the scene to its 2D image projection.

3.2.4 Optical flow and motion cues

Biological dynamics in underwater environments are often captured through motion signals. Let It and It+Δt denote two temporally adjacent frames. The optical flow field u(x) satisfies the brightness constancy constraint (Equation 6):

It(x)=It+Δt(x+u(x))+η(x),(6)

where η(x) accounts for the noise and illumination inconsistencies. The divergence of the optical flow field (Equation 7), which is defined as

u(x)=uxx+uyy,(7)

can be used to localize biological motion, such as the movement of marine organisms.

3.2.5 Visual signal priors

To model the statistical properties of underwater ecosystems, a prior distribution over visual patterns is introduced (Equation 8):

P(I|S,L,d)=xΩP(I(x)|L(x),d(x)),(8)

and conditional independence of pixel intensities is assumed given the depth and semantic labels. This probabilistic framework facilitates the integration of depth and semantic information into the modeling process.

3.2.6 Semantic embedding

A feature extractor Φ:ΩRD maps each pixel to a D-dimensional latent space. The posterior probability of a semantic class c at pixel x is computed as follows Equation 9:

P(c|x)=exp(wcΦ(x))cexp(wcΦ(x)),(9)

where wc is the weight vector associated with class c. This embedding enables the classification of pixels into ecological categories.

3.2.7 Depth-aware calibration

To account for spectral distortions caused by depth, a calibration function γ(x) is introduced (Equation 10):

γ(x)=ψI(x),d(x),(10)

where ψ is a neural module that adjusts feature sensitivity based on the local depth and color context. This calibration improves the robustness of feature extraction in underwater environments.

3.2.8 Ecological interaction fields

Interactions between co-occurring marine species or coral structures are modeled using a semantic interaction potential Equation 11:

U(x,x)=AL(x),L(x)expxx22σ2,(11)

where ARC×C is an affinity matrix that encodes the relationships between semantic classes, and σ controls the spatial extent of interactions.

3.2.9 Problem objective

The objective is to estimate the set {J,d,L} that best explains the observed image I under the underwater image formation model (Equation 12):

{Ĵ,d̂,L̂}=argmaxJ,d,LP(I|J,d)P(J)P(d)P(L).(12)

The priors P(J), P(d), and P(L) can be instantiated using learned models or ecological knowledge specific to the marine domain.

3.3 Bio-optical attenuation neural extractor (BOANE)

To address the unique complexities of underwater visual signal processing, a novel model named the bio-optical attenuation neural extractor (BOANE) is proposed. The BOANE is a physically informed, depth-aware convolutional architecture designed to learn robust feature representations from spectrally distorted marine images. Unlike traditional vision models that assume color constancy and homogeneous illumination, the BOANE integrates spectral, spatial, and ecological priors through a multi-stream computational pathway (as shown in Figure 1).

Figure 1
Flowchart illustrating a spectral-aware convolutional framework. An RGB image outputs to three paths: Depth Prior, Attenuation Map, and a main path connecting to Corrected Radiance Field. This leads to Spectral-Compensation Module and Spectral-Aware Convolutional Units. Outputs merge into Multi-Scale Attenuation Decoding, then process into Semantic Prediction through a Dynamic Graph. Ecological and Semantic Integration processes Depth Prior and Attenuation Map via Ecological and Semantic Embedding, contributing to the Dynamic Graph. A fish image represents the input RGB image.

Figure 1. Updated schematic diagram of the bio-optical attenuation neural extractor (BOANE). The diagram illustrates the flow of the proposed model from input RGB images to spectral-aware convolutional units and ecological–semantic integration.

Spectral-aware convolutional framework: the BOANE introduces a spectral-aware convolutional framework to address the challenges posed by wavelength-dependent attenuation in underwater environments.

The network architecture benefits from targeted hyperparameter tuning, which has been shown to significantly influence the performance in medical image analysis domains (Li et al., 2025). As the input, the model takes an RGB image I:ΩR3, an optional depth prior d:ΩR+, and a location-based attenuation map β:ΩR3. The first stage of the network estimates a corrected radiance field Ĵ from the distorted image (Equation 13):

Ĵ(x)=I(x)exp(β(x)d(x))+ϵ,(13)

where denotes element-wise division and ϵ is a small constant to avoid instability. This corrected radiance field serves as the input to the spectral compensation module.

To learn wavelength-sensitive features, the BOANE employs spectral-aware convolutional units (SACUs), which are defined as follows Equation 14:

F(l)(x)=λ{R,G,B}Kλ(l)Ĵλαλ(x),(14)

where αλ(x)=exp(βλ(x)d(x)) is the attenuation weighting and Kλ(l) is the convolutional kernel at layer l for channel λ. This operation generates depth-adjusted spectral feature maps that are sensitive to the attenuation properties of different wavelengths.

Each feature map is further modulated using a learned function Γ(l) conditioned on the spatial context (Equation 15):

F̃(l)(x)=Γ(l)(F(l)(x),d(x),β(x))=F(l)(x)σW(l)[d(x),β(x)]+b(l),(15)

where σ is a sigmoid activation and W(l),b(l) are trainable parameters. This modulation allows the model to adaptively regulate feature magnitudes based on environmental distortions.

Ecological and semantic integration: the ecological and semantic integration component of the BOANE incorporates biologically relevant priors to enhance feature representation. A semantic embedding branch integrates a coarse ecological structure via class-conditioned latent codes ecRD, where c denotes the marine class label. The combined ecological prior is computed as follows Equation 16:

Ψ(x)=c=1CP(c|x)ec,(16)

where P(c|x) is derived from a preliminary classifier trained on annotated marine scenes. The total latent representation at location x becomes (Equation 17):

Φ(x)=ConcatF̃(l)(x),Ψ(x).(17)

To model a long-range semantic structure, the BOANE constructs a dynamic graph G=(V,E) over Φ(x), where the nodes are superpixel centroids and the edges are built using affinity (Equation 18):

Aij=expΦ(xi)Φ(xj)22τ,(18)

with τ as a temperature hyperparameter. Graph convolution is applied as follows Equation 19:

Φ(xi)=jN(i)AijkAikWgΦ(xj),(19)

where Wg is a learnable weight matrix. This graph-based reasoning explicitly captures the spatial and semantic relationships in the scene.

Multi-scale attenuation decoding: the multi-scale attenuation decoding mechanism reconstructs semantically conditioned predictions at multiple resolutions. For semantic decoding, the logit score is computed as follows Equation 20:

sc(x)=wcΦ(x)+bc,(20)

whereas for radiance reconstruction, a spectral inverse decoder is utilized (Equation 21):

Ĵfinal(x)=lζlUp(F̃(l)(x)),(21)

where ζl are learnable fusion weights and Up() denotes resolution upsampling. This multi-scale approach ensures that the model captures both fine-grained and global features.

To enforce semantic–geometric consistency across distorted views, the BOANE applies a cross-frame latent consistency constraint (Equation 22):

Lcons=xΦ(x)Φ(x+u(x))2,(22)

where u(x) is the estimated optical flow from adjacent frames. This constraint ensures that the learned features remain stable across varying perspectives and distortions.

The BOANE integrates these components into a unified architecture, leveraging spectral-aware convolution, ecological priors, and multi-scale decoding to address the challenges of underwater visual signal processing.

3.4 Context-aware marine signal enhancement (CAMSE)

To ensure robust and ecologically faithful underwater visual understanding, we propose the CAMSE strategy. CAMSE is a dynamic inference procedure layered atop the BOANE model that enhances interpretability and stability under uncertain aquatic conditions. The core idea is to adapt the inference pipeline at runtime based on contextual signals—optical properties, ecological priors, and motion fields—without retraining the backbone model. This section formalizes the components of CAMSE, presenting it as a constrained optimization framework with biological and geometric adaptation (as shown in Figure 2).

Figure 2
Flowchart illustrating the BOANE Model process. It starts with a Contextual Vector, leading to a Gating Vector and the BOANE Model. Arrows indicate flow through Motion-Compensated Alignment, Ecological Regularization, and Spectral Refinement and Confidence Filtering. The process involves Confidence Masking and culminates in the Final Inference Objective. Definitions and equations are present for specific components, showing interactions and dependencies.

Figure 2. Updated schematic diagram of the context-aware marine signal enhancement (CAMSE). The diagram shows the flow from contextual vector generation, gating mechanism, motion-compensated alignment, and spectral refinement to ecological regularization, and the final inference objective.

Dynamic contextual integration: CAMSE introduces a mechanism to dynamically integrate contextual priors into the inference process, leveraging optical, biological, and temporal signals. Let C(x) be a contextual vector at location x composed of depth d(x), attenuation β(x), ecological latent Ψ(x), and optical flow u(x) (Equation 23):

C(x)=[d(x),β(x),Ψ(x),u(x)].(23)

The activation of BOANE’s inference modules is conditioned through a gating vector g(x) (Equation 24):

g(x)=σWcC(x)+bc,(24)

where Wc and bc are learned context modulation weights and σ is the sigmoid function. This gating vector adjusts all downstream logits and reconstruction heads. Additionally, CAMSE applies a temporal smoothing of latent representations via motion-compensated alignment (Equation 25):

Φ̃t(x)=αΦt(x)+(1α)Φt1(xut1t(x)),(25)

where α[0,1] balances the current and past features and ut1t is the forward optical flow. This stabilizes predictions over time.

Ecological regularization framework: CAMSE incorporates ecological priors to ensure biologically plausible outputs.

Design principles related to deep learning robustness under noisy or distorted inputs also inform our ecological regularization and temporal filtering mechanisms (Batool et al., 2025). A structured regularization term Leco penalizes local semantic inconsistency based on learned ecological embeddings ec (Equation 26):

Leco=xx1[xx<δ]eL(x)eL(x)2,(26)

where δ defines the neighborhood radius. CAMSE also enforces consistent ecological labeling over sequences using a temporal label consistency constraint Ltemp (Equation 27):

Ltemp=xKLPt(c|x)Pt1(c|xu(x)),(27)

where KL denotes the Kullback–Leibler divergence. This penalty encourages smooth temporal transitions in semantic predictions.

Spectral refinement and confidence filtering: CAMSE performs spectral residual correction by fitting a local attenuation-consistent linear model (Equation 28):

I(x)A(x)Ĵ(x)+b(x),(28)

and it computes residuals (Equation 29):

r(x)=I(x)(A(x)Ĵ(x)+b(x)),(29)

which is used to adjust the final output (Equation 30):

ĴCAMSE(x)=Ĵ(x)+ηr(x),(30)

where η is a residual correction weight tuned via validation. CAMSE also introduces a confidence mask γ(x) to filter low-confidence locations (Equation 31):

γ(x)=maxcsoftmax(sc(x)),(31)

and it refines predictions using (Equation 32):

s̃c(x)=sc(x)1[γ(x)>τ],(32)

where τ is a threshold tuned by species detection accuracy on validation sets.

The final inference objective under CAMSE becomes (Equation 33):

minĴ,LLpred+λ1Leco+λ2Ltemp+λ3Lcons,(33)

where Lpred is the standard cross-entropy loss and reconstruction loss and λi are the regularization coefficients.

4 Experimental setup

4.1 Dataset

The underwater image quality assessment dataset (Jalil et al., 2025) consists of a wide range of underwater images captured under varying environmental and lighting conditions, designed to evaluate and benchmark image quality assessment models in challenging marine settings. It includes images affected by factors such as color distortion, low contrast, turbidity, and light absorption, which are common in real-world underwater scenarios. Each image is annotated with human perceptual scores obtained through subjective studies, enabling supervised learning for quality estimation models. The dataset covers multiple water bodies and depths, providing rich diversity in visual content, and it is widely used in tasks such as enhancement model evaluation, perceptual quality regression, and human visual system modeling for underwater conditions. The images vary significantly in clarity, texture, and illumination, which makes this dataset a reliable resource for training and evaluating both classical and deep-learning-based quality metrics for underwater imagery. This dataset serves as a foundational benchmark for studies addressing visual degradation in underwater media and plays a critical role in developing methods for autonomous underwater vehicle vision and remote-sensing applications in marine research.

Marine biodiversity visual monitoring (Yuan, 2024) provides a comprehensive visual archive of marine life collected through automated and manual monitoring devices placed across coral reefs, seabeds, and pelagic zones. The dataset aims to facilitate biodiversity indexing, species recognition, and ecological behavior modeling by offering annotated video frames and still images of marine organisms. Each sample in the dataset is tagged with taxonomy-level metadata, including the species, genus, and occasionally behavior labels such as feeding, schooling, or territorial interaction. It includes temporal sequences to support video-based recognition and is structured to reflect spatial–temporal distributions of species across multiple marine regions. Variations in lighting, turbidity, and motion blur challenge the recognition models, thereby supporting the development of robust visual systems for real-time underwater monitoring. This dataset has been instrumental in automated marine ecology, enabling systems to assist in environmental conservation, biodiversity assessment, and longitudinal marine ecosystem studies, especially under changing climate and anthropogenic pressures.

The anthropogenic impact detection dataset (Ashok and Latha, 2023) is curated to capture the visual manifestations of human-induced disturbances in oceanic environments, including marine pollution, coral bleaching, plastic accumulation, trawling scars, and other forms of degradation. The dataset includes both synthetic and real-world images captured through remotely operated vehicles, aerial drones, and underwater cameras, which are annotated with semantic labels indicating the type and severity of impact. It supports multi-label classification, segmentation, and change detection tasks. Variations in visual scenes span different geographic locations, time periods, and weather conditions to ensure robustness and generalization in model development. The annotations are derived from marine ecology experts and validated with geospatial data. This dataset plays a crucial role in environmental surveillance systems that monitor and quantify anthropogenic stressors, aiding policy-making and international marine protection strategies. Its complexity and diversity are particularly valuable for developing models capable of detecting subtle visual cues associated with ecological degradation underwater.

The oceanic signal processing benchmark dataset (Ratilal et al., 2022) is a multimodal dataset comprising visual, acoustic, and sensor-based recordings from underwater environments, aimed at supporting research in underwater signal enhancement, fusion, and interpretation. It includes synchronized visual data, sonar returns, and environmental readings such as salinity, temperature, and pressure, which are vital for context-aware processing. The visual data cover multiple underwater terrains, including rocky reefs, sandy beds, and pelagic zones under varying visibility conditions. Each scene is aligned with corresponding acoustic signatures to enable joint modeling of visual–acoustic correlations. The dataset supports tasks such as underwater object detection, signal de-noising, acoustic event recognition, and domain adaptation across sensing modalities. It is a benchmark for testing multi-sensor fusion architectures and signal restoration pipelines under oceanic constraints. The combination of environmental metadata and synchronized signals makes it particularly suitable for developing end-to-end AI systems for real-time underwater navigation, communication, and observation in scientific and industrial applications.

4.2 Experimental details

All experiments were conducted using PyTorch 2.0 with CUDA acceleration enabled on NVIDIA A100 GPUs. The model is trained using the AdamW optimizer with an initial learning rate of 1e-4, β1=0.9, β2=0.999, and a weight decay of 0.01. A cosine annealing scheduler is applied to adjust the learning rate dynamically during training. The batch size is set to 16 per GPU, and gradient accumulation is used to simulate a total batch size of 64. All models are trained for 100 epochs with early stopping based on validation performance to avoid overfitting. For each dataset, a standard 70–15–15 split is used for training, validation, and testing, respectively. During training, we apply random horizontal flipping, color jittering, and Gaussian blur as data augmentation techniques, which are particularly effective for underwater imagery due to varying visual conditions. Input images are resized to 256×256 and normalized using dataset-specific statistics. For evaluation metrics, we adopt the peak signal-to-noise ratio (PSNR), structural similarity index measure (SSIM), mean opinion score (MOS), and learned perceptual image patch similarity (LPIPS) to cover both objective and perceptual image quality. For classification tasks such as impact detection and biodiversity recognition, we use the Top-1 accuracy, F1-score, and mean average precision (mAP) as metrics. The backbone of our model is a modified Swin Transformer that has been pretrained on ImageNet-21K and fine-tuned on the respective datasets. We incorporate a lightweight attention refinement module after each transformer block to enhance feature expressiveness under degraded visual conditions. The fusion strategy for multimodal data in the oceanic signal processing benchmark dataset uses a cross-modal transformer architecture that allows mutual attention between visual and acoustic inputs. During inference, test-time augmentation (TTA) is utilized by averaging predictions over five transformed versions of each input image. Hyperparameter tuning is performed through grid search over the learning rate, attention depth, and regularization strength. To ensure reproducibility, we fix all random seeds and document the exact versions of software dependencies. The implementation is compatible with both single-node multi-GPU training and distributed multi-node environments using PyTorch’s native DDP (distributed data parallel) interface. Ablation studies are conducted on each dataset individually by removing the core components such as the attention module, pretraining, and the multi-scale fusion block to quantify their contribution. We also conduct cross-dataset evaluations to assess the generalization capability by training on one dataset and testing on another. For the marine biodiversity visual monitoring dataset, special care is taken to balance class distributions using weighted sampling, given the long-tail nature of marine species occurrences. In the anthropogenic impact detection dataset, synthetic data augmentation through GAN-generated samples is used to enrich low-frequency categories. All experiments are repeated three times with different seeds, and we report the average results along with the standard deviations to ensure robustness. Our complete training and evaluation pipeline is publicly available and containerized for compatibility and replicability across platforms.

4.3 Comparison with SOTA methods

The proposed approach is compared with several state-of-the-art (SOTA) video understanding methods using all four benchmark datasets. As shown in Table 2, our method achieves substantial improvements over existing models in both the Underwater Image Quality Assessment Dataset and the Marine Biodiversity Visual Monitoring Dataset. On the Underwater Image Quality Assessment Dataset, our approach yields the highest accuracy (89.48%), recall (87.92%), F1-score (86.80%), and AUC (90.42%), outperforming the best baseline (VideoMAE) by 2.71% in accuracy and 2.41% in AUC. This performance gain can be attributed to our adaptive attention refinement module, which is specifically designed to handle visual distortions such as turbidity, color shift, and illumination variation that frequently occur in underwater environments. Compared to TimeSformer and ViViT, which also use transformer-based architectures, our model introduces more effective spatiotemporal representations by incorporating multi-scale feature fusion and modality-aware normalization. For the Marine Biodiversity Visual Monitoring Dataset, our model achieves 91.05% accuracy, surpassing all other methods with a notable margin of 3.16% over VideoMAE. The improvements in the F1-score and recall are similarly significant, indicating our model’s ability to handle the long-tail distribution of species and interclass visual ambiguity. This is reinforced by our dynamic sampling strategy and class-aware contrastive regularization, which together mitigate species imbalance and fine-grained feature confusion. The stronger AUC scores indicate better generalization across visually diverse marine environments. Although traditional models such as SlowFast and I3D maintain competitive performance, they are limited by rigid temporal modeling and insufficient fine-scale adaptability. Our approach, leveraging transformer-based temporal attention and context-enhanced modules, demonstrates resilience under varying underwater noise and inter-frame motion artifacts. This affirms the importance of domain-specific enhancements rather than direct application of general-purpose video models.

Table 2
www.frontiersin.org

Table 2. Comparison of our model with SOTA on the underwater image quality assessment and marine biodiversity visual monitoring datasets.

Extending our analysis to the anthropogenic impact detection and oceanic signal processing benchmark datasets in Table 3, our method again establishes a clear performance lead. On the anthropogenic impact dataset, which contains complex scenes involving pollution, coral bleaching, and structural damage, our method attains 88.81% accuracy and an F1-score of 85.94%, surpassing the closest competitor VideoMAE by 2.87% in accuracy. These gains are largely due to our architecture’s ability to emphasize localized visual disruptions and subtle structural anomalies, which is enabled by our hierarchical attention design. The recall improvement to 86.65% reflects the model’s capacity to identify even small-scale human-induced alterations, which is a critical requirement for marine conservation monitoring. Our cross-modal noise suppression mechanism also plays a crucial role, especially for visually ambiguous instances such as biofouling or low-light plastic presence. On the oceanic signal processing dataset, which introduces acoustic–visual fusion challenges, our model reaches 90.10% accuracy, with a significant AUC of 90.57%, outperforming all baselines. This performance is driven by our cross-modal transformer module, which jointly learns representations from synchronized visual and acoustic signals. Traditional video-based models such as I3D and X3D struggle in this scenario due to their inability to align and integrate multi-modal cues effectively. In contrast, our framework dynamically adjusts inter-modality attention based on environmental metadata, such as salinity and depth, allowing better contextual awareness. Moreover, our use of curriculum learning to progressively expose the model to increasingly noisy and occluded data helps ensure robustness, especially when operating under varying underwater acoustic conditions. In both datasets, our model maintains low variance across runs, indicating stable optimization and consistent generalization, further validating the effectiveness of our training strategy and architectural design.

Table 3
www.frontiersin.org

Table 3. Comparison of our model with SOTA methods on the anthropogenic impact detection and oceanic signal processing benchmark datasets.

The consistent improvements across datasets can be attributed to the core innovations of our method, as outlined in the accompanying method specification. Specifically, our adaptive feature extraction mechanism, modality-aware attention design, and domain-specific augmentation pipeline collectively contribute to the superior performance. Ablation results (discussed separately) confirm that removing these components leads to noticeable drops in all metrics. For instance, the absence of the multi-scale fusion block leads to degraded precision in identifying complex anthropogenic impacts, whereas discarding the acoustic–visual alignment module results in a marked decline in the oceanic dataset performance. These insights reinforce the necessity of our integrated design. Compared to prior methods, our system does not rely solely on pretraining or large-scale backbone expansion but instead emphasizes architectural efficiency and contextual adaptability. Furthermore, the improvements are not dataset-specific; cross-dataset generalization tests reveal minimal performance drops, suggesting high transferability. In summary, the results in Tables 2, 3 strongly demonstrate the effectiveness and generalizability of our method across diverse and challenging underwater video tasks, achieving a new SOTA on all the evaluated benchmarks.

As shown in Figure 3, the models are compared across several key metrics on different datasets, highlighting the effectiveness of the proposed approach.

Figure 3
Bar chart comparing various models and datasets for different metrics. Models on the x-axis include SlowFast, I3D, Timesformer, X3D, VideoMAE, ViViT, and Ours. Datasets are color-coded: Underwater (blue), Marine Biodiversity (green), Anthropogenic Impact (red), and Oceanic Signal Processing (purple). Metrics shown by hatch styles: Accuracy, Recall, F1, AUC. Y-axis represents score percentages from 77.5 to 92.5.

Figure 3. Comparison of all metrics across models and datasets. The performance comparison shows the scores (accuracy, recall, F1, and AUC) for various models across different underwater datasets.

4.4 Ablation study

To evaluate the contributions of the proposed components, we conducted ablation studies by systematically removing three key innovations: the BOANE, the CAMSE, and the ecological regularization framework (ERF). The results are presented in Tables 4, 5. Removing the BOANE (w/o BOANE) leads to the most significant performance degradation across all datasets. For instance, on the underwater dataset, accuracy drops from 89.48% to 85.76%, and the F1-score decreases by 4.12%. This demonstrates the critical role of the BOANE in addressing spectral distortions and depth-dependent attenuation, which are unique to underwater environments. The absence of the BOANE also results in weaker feature representations, as evidenced by reduced recall and AUC metrics on the marine biodiversity dataset.

Table 4
www.frontiersin.org

Table 4. Ablation study results across underwater and marine biodiversity datasets.

Table 5
www.frontiersin.org

Table 5. Ablation study results across anthropogenic and oceanic signal processing datasets.

Disabling CAMSE (w/o CAMSE) results in a moderate but consistent decline in performance. On the anthropogenic impact dataset, accuracy decreases from 88.81% to 86.12%, and the AUC drops by 2.34%. This indicates that CAMSE effectively enhances temporal stability and contextual adaptation, particularly in dynamic underwater conditions. The oceanic signal processing dataset shows similar trends, with notable reductions in recall and the F1-score, highlighting the importance of CAMSE in integrating optical, biological, and temporal signals for robust inference.

The removal of the ERF (w/o ERF) leads to a smaller but measurable drop in performance. On the oceanic dataset, accuracy decreases from 90.10% to 88.47%, and the F1-score drops by 1.42%. This suggests that the ERF contributes to maintaining ecological consistency and semantic coherence, especially in tasks involving biodiversity monitoring. Although the performance degradation is less pronounced compared to that of the BOANE and CAMSE, the results confirm that the ERF enhances generalization by incorporating biologically plausible constraints. Collectively, the full model consistently outperforms all ablated versions, validating the complementary roles of these components in improving underwater vision tasks.

Figure 4 presents the ablation study results, which demonstrate the impact of different model components on performance across multiple underwater datasets.

Figure 4
Bar chart showing an ablation study across datasets and metrics. Four bars per metric represent different models: w/o BOANE (blue), w/o CAMSE (orange), w/o ERF (green), and Ours (purple). Metrics include accuracy, recall, F1 score, and AUC across underwater, marine biodiversity, anthropogenic, and oceanic signal categories. The

Figure 4. Ablation study across datasets and metrics. The figure compares model performance with and without different components (BOANE, CAMSE, and ERF) on various underwater datasets.

To provide a deeper understanding of the ablation study results, the functional role of each module is discussed with reference to the specific challenges of underwater signal processing. The BOANE module addresses spectral distortion caused by wavelength-dependent attenuation. Its removal leads to notable decreases in accuracy and the F1-score, particularly on datasets with strong color shifts, indicating its importance in restoring true radiance and preserving structural clarity. The CAMSE module contributes to temporal consistency and semantic alignment by integrating ecological priors and motion compensation. When CAMSE is excluded, performance degradation is observed in tasks involving dynamic scenes, suggesting that its adaptive inference mechanism is essential for stabilizing predictions and filtering noise. The ERF introduces biologically grounded constraints that improve semantic coherence across spatial and temporal contexts. Without the ERF, the model produces semantically fragmented or ecologically implausible outputs, as reflected in the lowered F1-scores. These visual results in Figure 5 further confirm that each module plays a complementary role in addressing the multifaceted challenges of underwater image processing, and they collectively contribute to the high performance and interpretability of the proposed framework.

Figure 5
Five images of an underwater coral reef are shown side by side. The first is labeled

Figure 5. Qualitative comparison of ablation study results under underwater conditions. From left to right: degraded input, without BOANE, without CAMSE, without ERF, and the full model output. Removing BOANE causes spectral distortion; removing CAMSE introduces temporal instability; removing ERF reduces semantic consistency. The full model restores color fidelity, temporal stability, and ecological plausibility.

To validate the rationale behind the architectural design decisions and to enhance the robustness and interpretability of the BOANE and CAMSE modules, a series of empirical sensitivity studies were conducted. For the SACUs, single-branch, dual-branch, and multi-scale variants were evaluated. The dual-branch structure achieved the best trade-off between accuracy and computational efficiency, which is consistent with prior findings in spectral modeling literature. In the CAMSE module, gating thresholds were initialized using sigmoid activations and tuned across the range of 0.2–0.8; a value of 0.5 consistently delivered optimal filtering performance and temporal stability. For the ERF, a grid search was performed over the coefficients λ1, λ2, and λ3, which control semantic consistency, temporal smoothness, and geometric alignment, respectively. The selected configuration (λ1=0.5, λ2=0.3, λ3=0.2) demonstrated the best generalization ability without over-regularization, particularly in ecologically heterogeneous scenes. Table 6 presents comparative results under different SACU structures, and Table 7 shows the sensitivity analysis for ecological regularization coefficients. These experiments collectively confirm that the final parameter settings provide an optimal balance between task performance and computational efficiency across all evaluated benchmarks.

Table 6
www.frontiersin.org

Table 6. Performance comparison under different SACU structures.

Table 7
www.frontiersin.org

Table 7. Sensitivity analysis on ecological regularization coefficients.

To verify the statistical reliability of the observed performance gains, paired t-tests were performed between the proposed method and each baseline model across all datasets. These tests were conducted over five independent runs using key evaluation metrics, including PSNR, SSIM, accuracy, and F1-score. As summarized in Table 8, the improvements achieved by the proposed framework are statistically significant (p<0.01) in nearly all cases. These findings confirm that the performance gains are not attributable to random variation but are consistently reproducible under diverse underwater conditions. The statistical testing results thereby reinforce the robustness of the method and provide strong empirical support for its superiority over existing baseline models.

Table 8
www.frontiersin.org

Table 8. Statistical significance (paired t-test) of the proposed method versus baselines. Each cell reports the p-value comparing our method to the corresponding baseline (five runs). Statistically significant results (p<0.01) are bolded.

To qualitatively assess the ecological interpretability of the proposed framework, additional visual comparisons have been included in Figure 6. These examples highlight key ecologically relevant features, including coral structures, fish schools, and anthropogenic debris. The input degraded image on the left demonstrates the visual limitations commonly encountered in underwater environments, where spectral distortion and low contrast hinder the visibility of fine structures and small-scale ecological elements. In contrast, the enhanced image produced by the proposed framework restores color fidelity and improves the spatial coherence of biologically meaningful regions. For instance, the coral texture becomes more distinct, the fish school in the upper region becomes clearly identifiable, and plastic-like debris in the background is more detectable. These improvements are crucial for practical applications such as biodiversity monitoring, habitat health assessment, and anthropogenic impact evaluation. By making these features more discernible, the model directly supports the interpretive tasks of marine biologists and ecologists who rely on visual data for quantitative analysis and decision-making. The inclusion of such qualitative examples underscores the practical utility of the proposed system in real-world underwater monitoring scenarios.

Figure 6
Side-by-side comparison of an underwater scene, with the left image labeled

Figure 6. Qualitative ecological benefits of the proposed framework. Enhanced results highlight ecologically relevant features such as fish schools (top-left), coral structures (bottom-center), and debris (top-right), demonstrating improved interpretability for marine monitoring.

5 Discussion

Although the proposed framework achieves strong performance across multiple benchmark datasets, it is important to recognize a key limitation regarding dataset representativeness. The four datasets used in this study—while diverse in geography, depth, and ecological content—are primarily structured research collections acquired under moderately controlled or survey-grade imaging conditions. These datasets do not fully capture the complexity and variability of real-world underwater environments, particularly in operational scenarios such as deep-sea monitoring, autonomous exploration in turbid estuarine zones, or long-term deployments, where biofouling, particulate matter accumulation, or extreme lighting degradation is common. Such conditions introduce severe spectral and geometric distortions that can significantly impair model reliability and inference quality. As a result, the current experimental validation may not fully reflect the challenges encountered during the practical deployment of underwater vision systems in uncontrolled, highly degraded settings. This gap could limit the generalizability of the framework if left unaddressed. To mitigate this limitation, future work will include the simulation of extreme underwater degradation conditions using synthetic image corruption pipelines that model turbidity, occlusions, low visibility, and fouling artifacts. This approach allows for scalable stress testing of the model under adverse conditions without requiring hard-to-acquire real-world data. In addition, testing will be extended to openly available low-quality underwater datasets collected under challenging operational constraints. The integration of domain adaptation strategies will also be pursued to minimize performance loss when transferring from controlled to real-world environments. Together, these enhancements will strengthen the framework’s ecological robustness, making it better suited for real-world applications such as habitat surveillance, pollution tracking, and biodiversity indexing in highly dynamic marine ecosystems.

The integration of security-aware processing mechanisms represents another promising direction. Techniques such as memristive neural security (Gabr et al., 2024), chaotic encryption systems (Gabr et al., 2023), and hyperchaotic neuron-based cryptography (Alexan et al., 2023) may be explored to protect sensitive ecological data collected during underwater monitoring.

6 Conclusions and future work

In this study, we address the pressing challenges of underwater visual signal processing, which is crucial for effective marine ecosystem monitoring and assessing anthropogenic impacts. Traditional image processing methods often fail in aquatic environments due to complex optical distortions and ecological variability. To address these limitations, we developed a machine learning-driven framework composed of two core components: the BOANE and the CAMSE module. The BOANE leverages spectral-adaptive convolutional units and depth-aware feature modulation to correct radiance distortions and highlight biologically relevant visual features. CAMSE builds upon this by incorporating real-time ecological priors and flow-stabilized alignment to enhance temporal and semantic consistency. Experimental results demonstrate substantial improvements in signal clarity, ecological interpretability, and robustness across dynamic underwater conditions, highlighting the effectiveness of our approach.

Although the proposed framework achieves strong performance across diverse benchmarks, there are two key limitations to consider. First, although the system adapts to known ecological priors, it may struggle when deployed in unfamiliar or rapidly changing marine environments where such priors are unavailable or inaccurate. Second, the dual-module architecture introduces considerable computational complexity, which may hinder real-time operation on resource-constrained platforms such as autonomous underwater vehicles (AUVs). To address these challenges and transition from a research prototype to a field-deployable system, several future research directions will be pursued. One promising avenue is the application of unsupervised domain adaptation techniques to enhance ecological generalizability. Given the variability in water composition, illumination, and species distributions across geographic locations, aligning the model to new environments without requiring labeled data is critical. Approaches such as adversarial adaptation, style transfer, and self-supervised learning will be explored to bridge the domain gap between training and deployment settings. In parallel, lightweight model compression strategies—including pruning, quantization-aware training, and knowledge distillation—will be utilized to reduce inference latency and memory usage, enabling efficient deployment on edge devices such as embedded GPU platforms. Incorporating multimodal sensing (such as sonar, hyperspectral imaging, or acoustic data) can enhance robustness in visually degraded or occluded underwater scenes. Cross-modal feature alignment and uncertainty-aware fusion will be the key components of this effort. Together, these enhancements aim to improve the framework’s scalability, adaptability, and ecological reliability, better aligning it with the practical demands of real-time marine ecosystem monitoring and anthropogenic impact assessment in dynamic environments.

Data availability statement

The original contributions presented in the study are included in the article/Supplementary Material; further inquiries can be directed to the corresponding author.

Author contributions

WM: Data curation, Conceptualization, Formal Analysis, Investigation, Funding acquisition, Methodology, Software, Validation, Writing – original draft, Writing – review and editing. ZF: Methodology, Supervision, Project administration, Validation, Resources, Visualization, Formal Analysis, Investigation, Data curation, Writing – original draft, Writing – review and editing. JH: Visualization, Supervision, Funding acquisition, Writing – original draft, Writing – review and editing.

Funding

The author(s) declare that financial support was not received for the research and/or publication of this article.

Conflict of interest

The author declares that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Generative AI statement

The author(s) declare that no Generative AI was used in the creation of this manuscript.

Any alternative text (alt text) provided alongside figures in this article has been generated by Frontiers with the support of artificial intelligence and reasonable efforts have been made to ensure accuracy, including review by the authors wherever possible. If you identify any issues, please contact us.

Publisher’s note

All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article, or claim that may be made by its manufacturer, is not guaranteed or endorsed by the publisher.

References

Alexan, W., Chen, Y.-L., Por, L. Y., and Gabr, M. (2023). Hyperchaotic maps and the single neuron model: a novel framework for chaos-based image encryption. Symmetry 15, 1081. doi:10.3390/sym15051081

CrossRef Full Text | Google Scholar

Ashok, P., and Latha, B. (2023). Absorption of echo signal for underwater acoustic signal target system using hybrid of ensemble empirical mode with machine learning techniques. Multimedia tools Appl. 82, 47291–47311. doi:10.1007/s11042-023-15543-2

CrossRef Full Text | Google Scholar

Batool, S. N., Yang, J., Gilanie, G., Latif, A., Yasin, S., Ikram, A., et al. (2025). Forensic radiology: a robust approach to biological profile estimation from bone image analysis using deep learning. Biomed. Signal Process. Control 105, 107661. doi:10.1016/j.bspc.2025.107661

CrossRef Full Text | Google Scholar

Ben, X., Ren, Y., Zhang, J., Wang, S.-J., Kpalma, K., Meng, W., et al. (2021). Video-based facial micro-expression analysis: a survey of datasets, features and algorithms. IEEE Trans. Pattern Analysis Mach. Intell. 44, 5826–5846. doi:10.1109/tpami.2021.3067464

PubMed Abstract | CrossRef Full Text | Google Scholar

Carlson, V. R., Sheehan, F., and Boden, B. (2016). Video analysis of anterior cruciate ligament (ACL) injuries: a systematic review. JBJS Rev. 4, e5. doi:10.2106/jbjs.rvw.15.00116

PubMed Abstract | CrossRef Full Text | Google Scholar

Cuevas, C., Quilón, D., and García, N. (2020). Techniques and applications for soccer video analysis: a survey. Multimedia tools Appl. 79, 29685–29721. doi:10.1007/s11042-020-09409-0

CrossRef Full Text | Google Scholar

den Hollander, S., Jones, B., Lambert, M., and Hendricks, S. (2018). The what and how of video analysis research in rugby union: a critical review. Sports Med. - Open 4, 27. doi:10.1186/s40798-018-0142-3

PubMed Abstract | CrossRef Full Text | Google Scholar

Feng, W., Ji, D., Wang, Y., Chang, S., Ren, H., and Gan, W. (2018). “Challenges on large scale surveillance video analysis,” in 2018 IEEE/CVF conference on computer vision and pattern recognition workshops (CVPRW).

Google Scholar

Fu, C., Dai, Y., Luo, Y., Li, L., Ren, S., Zhang, R., et al. (2024). Video-mme: the first-ever comprehensive evaluation benchmark of multi-modal llms in video analysis. Computer Vision and Pattern Recognition. Available online at: https://openaccess.thecvf.com/content/CVPR2025/html/Fu_Video-MME_The_First-Ever_Comprehensive_Evaluation_Benchmark_of_Multi-modal_LLMs_in_CVPR_2025_paper.html.

Google Scholar

Gabr, M., Korayem, Y., Chen, Y.-L., Yee, L., Ku, C. S., and Alexan, W. (2023). R 3—rescale, rotate, and randomize: a novel image cryptosystem utilizing chaotic and hyper-chaotic systems. IEEE Access 11, 119284–119312. doi:10.1109/access.2023.3326848

CrossRef Full Text | Google Scholar

Gabr, M., Diab, A., Elshoush, H. T., Chen, Y.-L., Por, L. Y., Ku, C. S., et al. (2024). Data security utilizing a memristive coupled neural network in 3d models. IEEE Access 12, 116457–116477. doi:10.1109/access.2024.3447075

CrossRef Full Text | Google Scholar

Gardner, A., Iverson, G., Quinn, T. N., Makdissi, M., Levi, C. R., Shultz, S., et al. (2015). A preliminary video analysis of concussion in the national rugby league. Brain Inj. 29, 1182–1185. doi:10.3109/02699052.2015.1034179

PubMed Abstract | CrossRef Full Text | Google Scholar

Goetze, J., Bond, T., McLean, D., Saunders, B., Langlois, T., Lindfield, S., et al. (2019). A field and video analysis guide for diver operated stereo-video. Methods Ecol. Evol. 10, 1083–1090. doi:10.1111/2041-210x.13189

CrossRef Full Text | Google Scholar

Gronwald, T., Klein, C., Hoenig, T., Pietzonka, M., Bloch, H., Edouard, P., et al. (2021). Hamstring injury patterns in professional Male football (soccer): a systematic video analysis of 52 cases. Br. J. Sports Med. 56, 165–171. doi:10.1136/bjsports-2021-104769

PubMed Abstract | CrossRef Full Text | Google Scholar

He, B., Li, H., Jang, Y. K., Jia, M., Cao, X., Shah, A., et al. (2024). Ma-lmm: memory-Augmented large multimodal model for long-term video understanding. Computer Vision and Pattern Recognition. Available online at: https://openaccess.thecvf.com/content/CVPR2024/html/He_MA-LMM_Memory-Augmented_Large_Multimodal_Model_for_Long-Term_Video_Understanding_CVPR_2024_paper.html.

Google Scholar

Hendricks, S., Till, K., den Hollander, S., Savage, T., Roberts, S., Tierney, G. J., et al. (2020). Consensus on a video analysis framework of descriptors and definitions by the rugby union video analysis consensus group. Br. J. Sports Med. 54, 566–572. doi:10.1136/bjsports-2019-101293

PubMed Abstract | CrossRef Full Text | Google Scholar

Jalil, B., Maggiani, L., and Valcarenghi, L. (2025). Convolutional neural networks and transformers-based techniques for underwater marine debris classification: a comparative study. IEEE J. Ocean. Eng. 50, 594–607. doi:10.1109/joe.2024.3436769

CrossRef Full Text | Google Scholar

Jin, Y., Li, H., Dou, Q., Chen, H., Qin, J., Fu, C.-W., et al. (2019). Multi-task recurrent convolutional network with correlation loss for surgical video analysis. Med. Image Anal. Available online at: https://www.spiedigitallibrary.org/access-suspended.

Google Scholar

Johnston, J. T., Mandelbaum, B., Schub, D., Rodeo, S., Matava, M., Silvers-Granelli, H. J., et al. (2018). Video analysis of anterior cruciate ligament tears in professional American football athletes. Am. J. Sports Med. 46, 862–868. doi:10.1177/0363546518756328

PubMed Abstract | CrossRef Full Text | Google Scholar

Kitaguchi, D., Takeshita, N., Matsuzaki, H., Igaki, T., Hasegawa, H., and Ito, M. (2021). Development and validation of a 3-dimensional convolutional neural network for automatic surgical skill assessment based on spatiotemporal video analysis. JAMA Netw. Open 4, e2120786. doi:10.1001/jamanetworkopen.2021.20786

PubMed Abstract | CrossRef Full Text | Google Scholar

Lai, Y. L., Ang, T. F., Bhatti, U. A., Ku, C. S., Han, Q., and Por, L. Y. (2025). Color correction methods for underwater image enhancement: a systematic literature review. PloS one 20, e0317306. doi:10.1371/journal.pone.0317306

PubMed Abstract | CrossRef Full Text | Google Scholar

Li, H., Govindarajan, V., Ang, T. F., Shaikh, Z. A., Ksibi, A., Chen, Y.-L., et al. (2025). Mspo: a machine learning hyperparameter optimization method for enhanced breast cancer image classification. Digit. Health 11, 20552076251361603. doi:10.1177/20552076251361603

PubMed Abstract | CrossRef Full Text | Google Scholar

Liu, W., Kang, G., Huang, P.-Y. B., Chang, X., Yu, L., Qian, Y., et al. (2020). “Argus: efficient activity detection system for extended video analysis,” in 2020 IEEE winter applications of computer vision workshops (WACVW).

Google Scholar

Liu, Y., Cun, X., Liu, X., Wang, X., Zhang, Y., Chen, H., et al. (2023). Evalcrafter: benchmarking and evaluating large video generation models. Computer Vision and Pattern Recognition. Available online at: https://openaccess.thecvf.com/content/CVPR2024/html/Liu_EvalCrafter_Benchmarking_and_Evaluating_Large_Video_Generation_Models_CVPR_2024_paper.html.

Google Scholar

Luxem, K., Sun, J. J., Bradley, S. P., Krishnan, K., Yttri, E. A., Zimmermann, J., et al. (2022). Open-source tools for behavioral video analysis: setup, methods, and best practices. eLife 12, e79305. doi:10.7554/elife.79305

PubMed Abstract | CrossRef Full Text | Google Scholar

Ma, X., Wang, Y., Jia, G., Chen, X., Liu, Z., Li, Y.-F., et al. (2024). Latte: latent diffusion transformer for video generation. Trans. Mach. Learn. Res. Available online at: https://arxiv.org/abs/2401.03048.

Google Scholar

Montgomery, C., Blackburn, J., Withers, D., Tierney, G. J., Moran, C., and Simms, C. (2016). Mechanisms of acl injury in professional rugby union: a systematic video analysis of 36 cases. Br. J. Sports Med. 52, 994–1001. doi:10.1136/bjsports-2016-096425

PubMed Abstract | CrossRef Full Text | Google Scholar

Pipkin, A., Kotecki, K., Hetzel, S., and Heiderscheit, B. (2016). Reliability of a qualitative video analysis for running. J. Orthop. Sports Phys. Ther. 46, 556–561. doi:10.2519/jospt.2016.6280

PubMed Abstract | CrossRef Full Text | Google Scholar

Quellec, G., Cazuguel, G., Cochener, B., and Lamard, M. (2017). Multiple-instance learning for medical image and video analysis. IEEE Rev. Biomed. Eng. 10, 213–234. doi:10.1109/rbme.2017.2651164

PubMed Abstract | CrossRef Full Text | Google Scholar

Ramanathan, V., Tang, K., Mori, G., and Fei-Fei, L. (2015). “Learning temporal embeddings for complex video analysis,” in IEEE international conference on computer vision.

Google Scholar

Ratilal, P., Seri, S. G., Mohebbi-Kalkhoran, H., Zhu, C., Schinault, M. E., Radermacher, M. K., et al. (2022). Continental shelf-scale passive ocean acoustic waveguide remote sensing of marine ecosystems, dynamics and directional soundscapes: sensing whales, fish, ships and other sound producers in near real-time. OCEANS 2022, Hampton Roads, 1–7. doi:10.1109/oceans47191.2022.9977054

CrossRef Full Text | Google Scholar

Schurr, S. A., Marshall, A. N., Resch, J., and Saliba, S. (2017). Two-dimensional video analysis is comparable to 3d motion capture in lower extremity movement assessment. Int. J. Sports Phys. Ther. 12, 163–172. Available online at: https://www.mdpi.com/2077-0383/14/5/1687.

PubMed Abstract | Google Scholar

Serner, A., Mosler, A., Tol, J., Bahr, R., and Weir, A. (2018). Mechanisms of acute adductor longus injuries in Male football players: a systematic visual video analysis. Br. J. Sports Med. 53, 158–164. doi:10.1136/bjsports-2018-099246

PubMed Abstract | CrossRef Full Text | Google Scholar

Shen, H., Chen, L., Jin, Y., Zhao, L., Kong, B., Philipose, M., et al. (2019). “Nexus: a gpu cluster engine for accelerating dnn-based video analysis,” in Symposium on operating systems principles.

Google Scholar

Shih, H. (2017). A survey of content-aware video analysis for sports. IEEE Trans. circuits Syst. video Technol. (Print) 28, 1212–1231. doi:10.1109/tcsvt.2017.2655624

CrossRef Full Text | Google Scholar

Stuelcken, M., Mellifont, D., Gorman, A., and Sayers, M. (2016). Mechanisms of anterior cruciate ligament injuries in elite women’s netball: a systematic video analysis. Jurnal sport Sci. 34, 1516–1522. doi:10.1080/02640414.2015.1121285

PubMed Abstract | CrossRef Full Text | Google Scholar

Takahashi, N., Gygli, M., and Gool, L. V. (2017). Aenet: learning deep audio features for video analysis. IEEE Trans. Multimedia. Available online at: https://www.sciencedirect.com/science/article/pii/S1877050924008184.

Google Scholar

Tang, Y., Ding, D., Rao, Y., Zheng, Y., Zhang, D., Zhao, L., et al. (2019). Coin: a large-scale dataset for comprehensive instructional video analysis. Comput. Vis. Pattern Recognit., 1207–1216. doi:10.1109/cvpr.2019.00130

CrossRef Full Text | Google Scholar

Tang, Y., Lu, J., and Zhou, J. (2020). Comprehensive instructional video analysis: the coin dataset and performance evaluation. IEEE Trans. Pattern Analysis Mach. Intell. 43, 3138–3153. doi:10.1109/tpami.2020.2980824

PubMed Abstract | CrossRef Full Text | Google Scholar

Tyagi, S., and Yadav, D. (2022). A detailed analysis of image and video forgery detection techniques. Vis. Comput. 39, 813–833. doi:10.1007/s00371-021-02347-4

CrossRef Full Text | Google Scholar

Waldén, M., Krosshaug, T., Bjørneboe, J., Andersen, T., Faul, O., and Hägglund, M. (2015). Three distinct mechanisms predominate in non-contact anterior cruciate ligament injuries in male professional football players: a systematic video analysis of 39 cases. Br. J. Sports Med. 49, 1452–1460. doi:10.1136/bjsports-2014-094573

PubMed Abstract | CrossRef Full Text | Google Scholar

Wan, S., Xu, X., Wang, T., and Gu, Z. (2021). An intelligent video analysis method for abnormal event detection in intelligent transportation systems. IEEE Trans. intelligent Transp. Syst. (Print) 22, 4487–4495. doi:10.1109/tits.2020.3017505

CrossRef Full Text | Google Scholar

Wang, Z., Liu, C., Zhang, S., and Dou, Q. (2023). “Foundation model for endoscopy video analysis via large-scale self-supervised pre-train,” in International conference on medical image computing and computer-assisted intervention.

Google Scholar

Xu, J., Kiah, M. L. M., Noor, R. M., Por, L. Y., and Wu, Y. (2025). Mhf-uie a multi-task hybrid fusion method for real-world underwater image enhancement. Sci. Rep. 15, 18131. doi:10.1038/s41598-025-02942-8

PubMed Abstract | CrossRef Full Text | Google Scholar

Yuan, L. (2024). Optimized 3d reconstruction in nearshore underwater environments: a cost-effective pre-processing strategy for neural radiance fields implementation. Appl. Comput. Eng. 38, 139–146. doi:10.54254/2755-2721/38/20230544

CrossRef Full Text | Google Scholar

Zheng, Y., and Zhang, H. (2022). Video analysis in sports by lightweight object detection network under the background of sports industry development. Comput. Intell. Neurosci. 2022, 1–10. doi:10.1155/2022/3844770

PubMed Abstract | CrossRef Full Text | Google Scholar

Zhu, H., Wu, W., Zhu, W., Jiang, L., Tang, S., Zhang, L., et al. (2022). “Celebv-hq: a large-scale video facial attributes dataset,” in European conference on computer vision.

Google Scholar

Keywords: underwater monitoring, machine learning framework, bio-optical attenuation neural extractor, context-aware marine signal enhancement, ecological adaptivity

Citation: Minglong W, Feng Z and Hu J (2026) A machine learning-driven framework for enhancing underwater visual signal processing in marine ecosystem economic monitoring and anthropogenic impact assessment. Front. Environ. Sci. 13:1689855. doi: 10.3389/fenvs.2025.1689855

Received: 21 August 2025; Accepted: 29 September 2025;
Published: 02 January 2026.

Edited by:

Huiyu Zhou, University of Leicester, United Kingdom

Reviewed by:

Por Lip Yee, University of Malaya, Malaysia
Kuo Jui HU, National Taiwan University of Science and Technology, Taiwan

Copyright © 2026 Minglong, Feng and Hu. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

*Correspondence: Wang Minglong, cnB1dWR1ZWk2OTUwNkBvdXRsb29rLmNvbQ==

Disclaimer: All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article or claim that may be made by its manufacturer is not guaranteed or endorsed by the publisher.