Feature-gradient enhanced transformer for accurate dissolved oxygen saturation forecasting in marine environments

Tan, Weiyan; Geng, Bing; Bai, XiuGuang

doi:10.3389/fmars.2025.1705996

TECHNOLOGY AND CODE article

Front. Mar. Sci., 12 January 2026

Sec. Ocean Observation

Volume 12 - 2025 | https://doi.org/10.3389/fmars.2025.1705996

This article is part of the Research TopicIntegrating Unmanned Platforms and Deep Learning Technologies for Enhanced Ocean Observation and Risk Mitigation in Ocean EngineeringView all 9 articles

Feature-gradient enhanced transformer for accurate dissolved oxygen saturation forecasting in marine environments

Weiyan Tan^*

Bing Geng

XiuGuang Bai

Research Department, Guangdong Service Center for Veterans, Guangzhou, Guangdong, China

With the increasing severity of marine environmental issues, dissolved oxygen saturation (O₂%) forecasting is not only essential for water quality monitoring and ecological risk assessment, but also for deepening our understanding of how physical and biogeochemical processes jointly shape coastal oxygen dynamics. Using multi-year in situ records from five representative NOAA ocean observation stations spanning distinct hydrographic regimes, this study develops an improved time series forecasting framework based on the Transformer architecture and uses it as a lens to analyze the structure and predictability of O₂% variability. A Feature Pyramid Space Transformation (FPST) is incorporated into the encoder to decompose the observed O₂% time series into multiple temporal scales, enabling the identification of station-dependent contributions from long-term trends, synoptic variability, and short-term fluctuations. In the decoder, a Gradient Attention Mechanism (GAM) explicitly leverages temporal gradients to highlight sharp transitions and turning points in the observational record, thereby revealing how rapid changes and extreme episodes affect the local predictability of O₂%. Experiments on the five buoy datasets show that the proposed framework achieves consistently improved forecasting performance over a range of baseline methods; for example, averaged across all stations, Mean Squared Error (MSE), Mean Absolute Error (MAE), and Mean Absolute Percentage Error (MAPE) are reduced while the coefficient of determination (R²) is increased, indicating a better fit between predicted and observed time series. Further ablation and input-subset experiments demonstrate that FPST and GAM provide complementary benefits and elucidate the relative importance of depth, temperature, and local oxygen concentration as drivers of O₂% dynamics at different sites. Overall, the study offers both a robust forecasting tool and an observation-based characterization of the multi-scale structure and event-driven behavior of dissolved oxygen saturation in coastal ocean environments.

1 Introduction

With the intensification of global climate change and human activities, marine environmental monitoring has become increasingly important in both academic research and societal governance (Huang et al., 2023; Ma et al., 2025). Among various indicators, dissolved oxygen saturation (O₂%) serves as a key metric for assessing water quality and ecosystem health. It reflects critical processes such as hypoxia, eutrophication, and acidification, and plays a crucial role in fisheries resource management and ecological risk warning. Accurately forecasting the dynamic variations of O₂% based on time series data not only enhances our understanding of complex marine processes but also provides scientific support for marine environmental protection and sustainable development (Shao et al., 2023).

However, O₂% time series exhibit significant non-stationarity, multi-scale characteristics, and abrupt variations, driven by multiple factors such as temperature, salinity, tides, and nutrient enrichment (Valera et al., 2020). Traditional time series approaches often struggle to simultaneously capture long-term trends and short-term abrupt changes, leading to limited predictive accuracy (Zhang et al., 2025). Although deep learning methods have made notable progress in modeling capabilities, they still face challenges in handling cross-scale dependencies and rapid fluctuations, while issues of interpretability and generalization remain to be addressed. In particular, existing studies mainly focus on improving prediction accuracy through architectural optimization or attention mechanisms, but rarely consider how to jointly model global temporal dependencies and fine-grained local dynamics within a unified framework. This gap limits the ability of current models to accurately characterize abrupt environmental transitions in O₂% variations. Compared with existing models such as Informer (Zhou et al., 2021) and Autoformer (Wu et al., 2021), the proposed framework focuses more on fine-grained local dynamics and gradient-sensitive information, enabling more adaptive and precise representation of abrupt variations in marine environments. This highlights the model’s capability to bridge the gap between global trend learning and transient event detection in complex O₂% sequences.

To tackle these challenges, this study introduces a Feature Pyramid Space Transformation (FPST) within a Transformer-based forecasting framework to enhance multi-scale topological modeling capacity, and further designs a Gradient Attention Mechanism (GAM) in the decoder to explicitly leverage gradient features for improved responsiveness to abrupt segments. This design aims to jointly capture the long-term trends and local dynamics of O₂%, thereby achieving more robust and efficient forecasting.

The main contributions of this paper are summarized as follows:

● We propose an improved Transformer-based framework for marine environmental monitoring, introducing a Feature Pyramid Space Transformation (FPST) in the encoder to effectively capture the multi-scale characteristics of O₂% time series.

● We design a Gradient Attention Mechanism (GAM) in the decoder, which explicitly models gradient information to enhance sensitivity to abrupt changes and turning points.

● We conduct extensive experiments on NOAA datasets from five representative stations, demonstrating that the proposed method consistently outperforms existing baselines in terms of MSE, MAE, MAPE, and R².

● We perform comparative and ablation studies to validate the independent contributions of FPST and GAM in improving multi-scale modeling and abrupt change detection, showcasing the potential applicability of the proposed method for O₂% forecasting and marine environmental monitoring.

2 Related work

2.1 Research on time series algorithms

In recent years, time series forecasting has gradually become an important research direction in deep learning, with many approaches improving the characterization of long-term dependencies by introducing novel modeling paradigms. Early representative work such as N-BEATS (Oreshkin et al., 2019) leveraged neural basis expansion to achieve interpretable time series decomposition and forecasting, demonstrating strong applicability in both univariate and multivariate tasks. In addition, the Temporal Fusion Transformer (TFT) (Lim et al., 2021) incorporated gating mechanisms and multimodal feature fusion to enable interpretable multi-step forecasting, laying the foundation for subsequent Transformer-based improvements.

As the application of Transformers to long sequence modeling has expanded, researchers have proposed a variety of architectural enhancements to improve efficiency and predictive performance in time series forecasting. Informer (Zhou et al., 2021) introduced a sparse attention mechanism and probabilistic sampling strategy to reduce computational complexity and strengthen long sequence modeling capabilities. Autoformer (Wu et al., 2021) employed an autocorrelation mechanism and series decomposition modules to effectively separate long-term trends from seasonal components. Fedformer (Zhou et al., 2022) further enhanced modeling performance by incorporating frequency-domain decomposition structures for long-term dependency capture.

In more recent studies, scholars have explored extensions of Transformers for time series tasks from perspectives such as two-dimensional modeling and exponential smoothing. TimesNet (Wu et al., 2022) transformed time series into two-dimensional variation representations to capture richer temporal features. Nie et al. proposed Time Series as 64 Words (Nie et al., 2022), which utilized an efficient Transformer architecture to improve both accuracy and stability in long-term forecasting. ETSformer (Woo et al., 2022) integrated exponential smoothing techniques with attention mechanisms to strengthen the modeling of trends and periodic signals. The continued evolution of these methods has driven the development of time series algorithms from structural innovations to cross-domain adaptations, providing valuable theoretical and practical insights for future research.

2.2 Interdisciplinary research on time series forecasting and ocean-related topics

In recent years, the prediction of dissolved oxygen (O₂%) has become an important focus in marine environmental research, as its accuracy is directly related to understanding the ecological metabolic balance and oxygen cycling processes in aquatic systems. Traditional methods, such as statistical or linear models including multiple regression and ARIMA, often fail to capture complex nonlinear and abrupt characteristics. With the advancement of deep learning, researchers have begun introducing neural networks and hybrid architectures to improve predictive performance. Liu et al. (2021) constructed a hybrid neural network model through multi-model integration and multi-factor analysis, achieving multidimensional modeling of marine dissolved oxygen time series. López-Andreu et al. (2023) applied a deep learning framework for joint prediction of chlorophyll-a and dissolved oxygen, verifying the effectiveness of deep models in forecasting marine biochemical parameters. Chen et al. (2024) developed a seawater dissolved oxygen prediction model based on an improved generative adversarial network (GAN), further enhancing robustness in non-stationary sequences. Meanwhile, Sundararaman and Shanmugam (2024) employed satellite remote sensing to estimate global surface ocean dissolved oxygen, providing macroscopic observational support for data-driven models.

In terms of model architecture innovation, Hao (2024) proposed a GRU–N-Beats-based framework for dissolved oxygen prediction, demonstrating superior performance in capturing multi-scale temporal trends. Liu et al. (2024) combined a dynamic adaptive mechanism (DAM) with Bi-GRU for spatiotemporal dissolved oxygen prediction in marine ranching areas. Ma et al. (2024) constructed a CNN–GRU hybrid neural network to enhance local feature extraction capability, while Xu et al. (2025) further proposed a hybrid deep learning framework for real-time prediction and lightweight modeling of dissolved oxygen. These studies collectively indicate that deep learning-based O₂% prediction has significant advantages in nonlinear modeling, spatiotemporal feature extraction, and multi-scale trend analysis. However, existing models still face challenges in cross-scale dependency and abrupt change response modeling, making it difficult to effectively capture sudden hypoxia events caused by coupled physical and biochemical processes. To address these limitations, this study introduces a Feature Pyramid Spatial Transformation and a Gradient Attention Mechanism to achieve high-precision modeling and optimized response to dynamic O₂% variations.

3 Methodology

3.1 Overall model architecture

The proposed architecture follows the Transformer-based encoder–decoder paradigm, while introducing two innovative modules to enhance feature representation and prediction accuracy. In the encoder, the input oceanic time series is first processed through an embedding layer, and then modeled with a multi-head self-attention mechanism to capture long-range temporal dependencies. To further enrich the representation capacity, we design a Feature Pyramid Space Transformation module, which reconstructs the latent space in a multi-scale topological manner. This enables the model to simultaneously learn fine-grained variations and global trends, thereby improving its ability to capture the multi-resolution dynamics of O₂% prediction.

In the decoder, the model integrates a Gradient Attention Mechanism, which explicitly leverages gradient information of the features. By extracting local variation signals and adaptively fusing them through a multilayer perceptron, the decoder can highlight feature responses in rapidly changing regions while maintaining stability in fluctuating environments. This gradient-guided attention process seamlessly complements the standard attention mechanism, making feature selection more adaptive. Overall, the synergy between encoder and decoder achieves an effective integration of multi-scale representation learning and gradient-sensitive feature optimization, providing a robust and efficient solution for O₂% forecasting in marine environmental monitoring. The overall model architecture is shown in Figure 1.

Figure 1

Flowchart illustrating a machine learning model architecture. It starts with input data leading into a multi-head deformable self-attention module, followed by feature pyramid space transformation. Features feed into a gradient attention mechanism, with add and norm steps, resulting in output. Diagram includes color-coded legend for input data, self-attention, features, and normalization. Arrows indicate data flow.

Figure 1. Overall architecture of the proposed Transformer-based encoder–decoder framework for O₂% forecasting. The encoder incorporates a Feature Pyramid Space Transformation for multi-scale representation learning, while the decoder employs a Gradient Attention Mechanism to enhance gradient-sensitive feature modeling and ensure stable predictions in dynamic marine environments.

3.2 Feature pyramid space transformation

In the proposed model, the Feature Pyramid Space Transformation (FPST) is designed to enhance the encoder’s capacity to represent the dynamics of marine O₂% across different temporal granularities through multi-scale structural reconstruction and topological mapping. This module first performs linear embedding on the input features, then generates multi-scale representations layer by layer, and finally employs topological transformation to achieve effective cross-scale fusion. The overall architecture of the FPST module is illustrated in Figure 2.

Figure 2

Diagram illustrating a feature pyramid space transformation. On the left, a backbone section with layers labeled S3 to F5, connected by arrows. The central section shows a feature pyramid structure with upsampling and downsampling processes and concatenation steps leading to linear transformations. On the right, data flows into grids and nodes, ultimately producing an output.

Figure 2. Schematic structure of the Feature Pyramid Space Transformation (FPST) module. This module constructs a pyramid representation through multi-scale downsampling, upsampling, and feature concatenation, and further incorporates topological space mapping to achieve cross-scale feature fusion, thereby generating more expressive temporal representations.

Let the original input time series be

\begin{array}{l} X = {x_{1}, x_{2}, \dots, x_{T}}, x_{t} \in ℝ^{d}, & (1) \end{array}

where T denotes the number of time steps and d is the feature dimension. This formulation serves as the basis for subsequent multi-scale decomposition (Equation 1).

To unify the representation space, we first apply a linear embedding:

\begin{array}{l} H^{(0)} = X W_{e} + b_{e}, W_{e} \in ℝ^{d \times d_{h}}, & (2) \end{array}

where $d_{h}$ denotes the hidden dimension and H⁽⁰⁾ represents the initial latent feature. This step ensures that the raw input is projected into a unified latent space (Equation 2).

Next, multi-scale downsampling is applied to construct the pyramid structure:

\begin{array}{l} H^{(l)} = {Down}_{l} (H^{(0)}), l = 1, 2, \dots, L, & (3) \end{array}

where Down_l(·) denotes the downsampling operation at scale l, and L is the total number of pyramid levels. This process produces multi-scale features ranging from fine-grained to coarse-grained representations (Equation 3).

At each scale, local patterns are further enhanced through convolutional mapping:

\begin{array}{l} {\tilde{H}}^{(l)} = σ (H^{(l)} W^{(l)} + b^{(l)}), & (4) \end{array}

where σ(·) denotes a nonlinear activation function, and W⁽^l⁾ is a scale-specific transformation matrix. This operation increases the discriminability of features at each level (Equation 4).

To achieve effective cross-scale propagation, a top-down feature transmission mechanism is employed:

\begin{array}{l} F^{(l)} = {\tilde{H}}^{(l)} + Up (F^{(l + 1)}), & (5) \end{array}

where Up(·) represents an upsampling operation, and F⁽^l⁾ denotes the fused feature at scale l. This mechanism transmits high-level semantic information to lower-level fine-grained features (Equation 5).

Furthermore, to capture topological structural relationships, a graph-based transformation is introduced:

\begin{array}{l} Z^{(l)} = GraphConv (F^{(l)}, A^{(l)}), & (6) \end{array}

where A⁽^l⁾ is the adjacency matrix constructed at scale l, used to model the topological dependencies among features. This step enables FPST to characterize non-Euclidean structural properties (Equation 6).

After topological mapping at all scales, multi-scale fusion is performed:

\begin{array}{l} Z = \sum_{l = 1}^{L} α_{l} Z^{(l)}, \sum_{l = 1}^{L} α_{l} = 1, & (7) \end{array}

where α_l denotes adaptive fusion weights that are dynamically learned during training. This operation aggregates multi-scale topological features into a unified representation (Equation 7).

Finally, the pyramid feature representation is expressed as

\begin{array}{l} H_{FPST} = ϕ (Z), & (8) \end{array}

where ϕ(·) represents normalization and residual mapping, ensuring both numerical stability and representational expressiveness (Equation 8).

In summary, FPST adopts a “multi-scale construction–topological mapping–cross-level fusion” strategy, which preserves local fine-grained variations while simultaneously enhancing global temporal dependency modeling. This significantly improves the predictive performance of marine O₂% dynamics.

3.3 Gradient Attention Mechanism

In the decoder, we propose the Gradient Attention Mechanism (GAM), whose core idea is to explicitly incorporate feature gradient information to enhance the adaptability of attention distribution. This allows the model to better capture abrupt changes and turning points in O₂% sequence prediction. The mechanism consists of three steps: gradient extraction, gradient-weighted attention computation, and adaptive fusion. The overall module architecture is illustrated in Figure 3.

Figure 3

Diagram illustrating a gradient attention mechanism. On the left, input features (F1 to Fn) are processed into gradient-based features. These features feed into a network with interconnected nodes for hidden representation extraction. Below, additional processing includes 1x1 convolution, MLP, GN, GELU, and output layers. Outputs are marked as attention coefficients. Arrows indicate data flow from input to processed output.

Figure 3. Schematic structure of the Gradient Attention Mechanism (GAM). This module explicitly computes feature gradients and integrates them with multilayer perceptron and convolutional mappings to realize gradient-guided adaptive feature weighting, thereby enhancing the capability of modeling abrupt dynamics.

First, the temporal gradients of the encoder’s hidden representations $H \in ℝ^{T \times d_{h}}$ are computed as

\begin{array}{l} \nabla H_{t} = H_{t} - H_{t - 1}, t = 2, 3, \dots, T, & (9) \end{array}

which explicitly extracts the rate of change between adjacent time steps, thereby highlighting the sensitive regions of O₂% dynamics (Equation 9).

To mitigate noise amplification, the gradients are normalized:

\begin{array}{l} {\tilde{G}}_{t} = \frac{\nabla H_{t}}{∥ \nabla H_{t} ∥_{2} + \in}, & (10) \end{array}

where $\in = 1 e - 8$ is a small constant to prevent division by zero. This ensures stable gradient magnitudes and improves convergence during training (Equation 10).

Next, gradient-guided query, key, and value mappings are introduced:

\begin{array}{l} Q = H W_{Q} + \tilde{G} U_{Q}, K = H W_{K} + \tilde{G} U_{K}, V = H W_{V}, & (11) \end{array}

where $W_{Q}, W_{K}, W_{V}, U_{Q}, U_{K}$ are learnable parameters. These formulations extend the standard attention by incorporating gradient-sensitive correction terms (Equation 11).

The gradient attention weights are then computed as

\begin{array}{l} A = softmax (\frac{Q K^{⊤}}{\sqrt{d_{h}}} + β \cdot \tilde{G} W_{G}), & (12) \end{array}

where β is a tunable hyperparameter and W_G is a gradient projection matrix. This formulation directly injects gradient information into the attention weight distribution (Equation 12).

The updated feature representation is obtained through weighted attention:

\begin{array}{l} H^{'} = A V, & (13) \end{array}

which emphasizes the responses of rapidly changing regions on top of the original value vectors V (Equation 13).

To further achieve adaptive feature fusion, a multilayer perceptron (MLP) is employed:

\begin{array}{l} F = MLP ([H, H^{'}]), & (14) \end{array}

where [·] denotes concatenation and F is the fused feature representation. This step unifies original features and gradient-enhanced features into a shared representation space (Equation 14).

In the output stage, residual connection and normalization are applied:

\begin{array}{l} O = LayerNorm (F + H), & (15) \end{array}

ensuring numerical stability and facilitating deeper network training (Equation 15).

Finally, the fused representation is projected into the prediction space:

\begin{array}{l} \hat{Y} = O W_{o} + b_{o}, & (16) \end{array}

where W_o and b_o are output layer parameters, and $\hat{Y}$ denotes the final predicted O₂% sequence. This completes the mapping from gradient-augmented features to target sequence forecasting (Equation 16).

In summary, the Gradient Attention Mechanism explicitly incorporates gradient information, enabling the model to capture both stable trends and abrupt transitions in O₂% dynamics. This results in more robust and fine-grained predictive capabilities for marine environmental monitoring tasks.

3.4 Training objective

During training, our goal is to make the predicted sequence as close as possible to the true O₂% dynamics. Let the input time series be X = {x₁,x₂,…,x_T}, the model prediction be $\hat{Y} = {{\hat{y}}_{1}, {\hat{y}}_{2}, \dots, {\hat{y}}_{T}}$ , and the corresponding ground-truth observations be $Y = {y_{1}, y_{2}, \dots, y_{T}}$ . In this study, we employ Mean Squared Error (MSE) as the primary optimization objective, which is defined as

\begin{array}{l} L_{MSE} = \frac{1}{T} \sum_{t = 1}^{T} {(y_{t} - {\hat{y}}_{t})}^{2} . & (17) \end{array}

This loss function effectively quantifies the deviation between predictions and ground truth, while penalizing larger errors more heavily. It guides the model to maintain high accuracy in capturing abrupt variations of O₂% dynamics, in addition to learning stable trends. Combined with the proposed Feature Pyramid Space Transformation and Gradient Attention Mechanism, the MSE loss provides a unified optimization target, enabling stable convergence and high-quality forecasting under multi-scale modeling and gradient-sensitive feature fusion.

4 Dataset and experimental settings

4.1 Dataset

The dataset used in this study originates from the National Oceanic and Atmospheric Administration (NOAA) marine observation buoy network. Five representative observation stations were selected, covering the Great Lakes, the Pacific Ocean, the Gulf of Mexico, and estuarine regions. The dataset includes multiple marine environmental parameters such as dissolved oxygen saturation (O₂%), temperature, salinity, and pH, which collectively reflect typical oxygen dynamics under seasonal variations and hydrodynamic forcing across different regions. These data provide a reliable foundation for training and validating the proposed time series forecasting model. The basic information of the five stations is summarized as follows:

● Station 45161 — Muskegon Buoy, Michigan, 43.185°N, 86.352°W

● Station 51045 — Hilo Bay Buoy, Hawaii, 19.734°N, 155.082°W

● Station bbsf1 — Everglades National Park, Florida, 25.472°N, 80.349°W

● Station dpha1 — Dauphin Island, Alabama, 30.251°N, 88.078°W

● Station seto3 — SATURN Estuarine Station #03, 46.200°N, 123.941°W

The NOAA dataset was selected due to its long-term temporal coverage, high spatial representativeness, and rigorous quality control standards applied during data collection and preprocessing. Each buoy is equipped with calibrated sensors that perform real-time monitoring with an average sampling frequency of 10–15 minutes, and measurement accuracy for key variables such as dissolved oxygen saturation and temperature reaches within ±0.1–0.2 units. This level of precision ensures the reliability and consistency of the time series, making the dataset particularly suitable for evaluating high-resolution forecasting models in marine environmental monitoring tasks.

4.2 Experimental settings

To validate the effectiveness of the proposed model, training and evaluation were conducted under a unified experimental environment. All experiments were carried out on a Linux-based server equipped with high-performance GPUs and sufficient memory resources to ensure the efficient operation of deep learning models. In terms of input configuration, only four variables—DEPTH, OTMP, O₂PPM (dissolved oxygen concentration), and O₂% (dissolved oxygen saturation)—were used as model features, while other recorded variables were excluded due to substantial missing data. All these variables are historical observations available at prediction time, ensuring that no data leakage from future information occurs in either training or evaluation. Regarding hyperparameter configuration, key parameters such as learning rate, batch size, hidden dimension, number of attention heads, and training epochs were carefully set. The detailed hardware/software environment and hyperparameter settings are summarized in Table 1.

Table 1

Table 1. Experimental environment and hyperparameter settings.

4.3 Metric evaluation

In the experiments, multiple regression evaluation metrics were employed to comprehensively assess the prediction performance, including Mean Squared Error (MSE), Coefficient of Determination (R²), Mean Absolute Error (MAE), and Mean Absolute Percentage Error (MAPE). These metrics capture deviations between predictions and ground truth from different perspectives, ensuring a holistic evaluation of overall error, goodness of fit, robustness, and relative error levels.

4.3.1 Mean Squared Error

\begin{array}{l} MSE = \frac{1}{T} \sum_{t = 1}^{T} {(y_{t} - {\hat{y}}_{t})}^{2} & (18) \end{array}

MSE measures the squared deviation between predicted and true values, amplifying the impact of larger errors. It is particularly suitable for evaluating overall predictive accuracy. A smaller MSE indicates that the predictions are closer to the observations (Equation 18).

4.3.2 Coefficient of Determination (R²)

\begin{array}{l} R^{2} = 1 - \frac{\sum_{t = 1}^{T} {(y_{t} - {\hat{y}}_{t})}^{2}}{\sum_{t = 1}^{T} {(y_{t} - \bar{y})}^{2}} & (19) \end{array}

R² reflects the proportion of variance in the data explained by the model, with a value range of $(- \infty, 1]$ . When R² approaches 1, the model demonstrates strong explanatory power and good fit to the observed sequence (Equation 19).

4.3.3 Mean Absolute Error

\begin{array}{l} MAE = \frac{1}{T} \sum_{t = 1}^{T} | y_{t} - {\hat{y}}_{t} | & (20) \end{array}

MAE calculates the average absolute deviation between predictions and true values, serving as an indicator of the model’s verall error level. Compared to MSE, MAE is less sensitive to outliers and better reflects model robustness under typical conditions (Equation 20).

4.3.4 Mean Absolute Percentage Error

\begin{array}{l} MAPE = \frac{100 %}{T} \sum_{t = 1}^{T} | \frac{y_{t} - {\hat{y}}_{t}}{y_{t}} | & (21) \end{array}

MAPE characterizes the prediction error as a percentage relative to the true values, offering intuitive interpretability. Lower MAPE values indicate smaller relative errors across varying magnitudes, making it particularly useful for cross-region or cross-variable comparisons (Equation 21).

5 Experimental results

5.1 Comparison of experimental results with other models

To comprehensively validate the effectiveness of the proposed method, we conducted comparative experiments against a variety of representative baseline models. Specifically, the comparison includes the traditional machine learning model XGBoost (Chen and Guestrin, 2016), a shallow neural network (MLP) (Schmidhuber, 2015), recurrent neural networks (LSTM) (Hochreiter and Schmidhuber, 1997), one-dimensional convolutional neural networks (1DCNN) (Fawaz, 2020), and Transformer-based (Vaswani et al., 2017) approaches that have been widely applied in time series modeling in recent years. In addition, advanced architectures with strong multi-scale modeling capabilities, such as TimeMixer (Wang et al., 2024), iTransformer (Liu et al., 2023), MSTL (Bandara et al., 2025), TimesNet (Wu et al., 2022), and Crossformer (Zhang and Yan, 2023) were also selected to provide diverse baselines across different categories of deep learning frameworks. By comparing with these models, the proposed approach can be evaluated in terms of overall accuracy, stability, and multi-scale feature extraction ability, thereby providing an objective assessment of its performance in marine O₂% forecasting tasks. The experimental results are shown in Table 2.

Table 2

Table 2. Comparison results across stations and baseline models (MSE/MAE/MAPE/R²).

From the overall results, Ours consistently achieves the best or tied-best performance across all five stations and four evaluation metrics, with the advantages being more pronounced at stations characterized by stronger multi-scale dynamics and more frequent abrupt changes. For instance, at station 45161, Ours reduces MSE by 10.58% compared to the best non-proposed baseline (MSTL), decreases MAE by 2.24%, lowers MAPE by 0.62%, and improves R² by 0.012. At the estuarine station seto3, MSE decreases by 1.43%, MAE by 4.99%, MAPE by 2.35%, and R² increases by 0.002. These improvements are consistent with the contributions of the Feature Pyramid Space Transformation (FPST) in the encoder, which reconstructs multi-scale structures, and the Gradient Attention Mechanism (GAM) in the decoder, which enhances responsiveness to turning points and rapid fluctuations, thereby demonstrating stronger capabilities in characterizing the non-stationary and multi-resolution dynamics of O₂%.

At the oceanic and coastal stations (51045, bbsf1, dpha1), the improvements of Ours are relatively moderate but remain consistent. Specifically, at station 51045, MSE is reduced by 1.40%, MAE by 0.27%, MAPE by 0.77%, and R² improves by 0.003. At bbsf1 and dpha1, MSE decreases slightly by 0.12% and 0.05%, respectively, while MAE, MAPE, and R² also show synchronous improvements. This phenomenon suggests that in scenarios with stronger periodicity, higher signal-to-noise ratios, or when baseline models already capture dominant rhythms effectively, Ours mainly manifests advantages in robustness and convergence of tail errors. However, in stations with more complex hydrodynamics and more frequent abrupt changes, the combination of FPST and GAM yields more significant gains in accuracy and goodness of fit.

This paper further presents comparative experimental results on the number of training model parameters and training time, as shown in Table 3, which indicate that the models differ substantially in both parameter scale and training efficiency. Traditional methods such as XGBoost have negligible parameter counts and the shortest training time, while deep models generally have higher complexity. MLP, LSTM, and 1D CNN have smaller parameter counts and higher training efficiency, making them suitable for lightweight scenarios. Transformer series models (such as Crossformer and iTransformer) perform better in capturing long sequence dependencies, but their parameter count and training time increase significantly. In contrast, TimesNet achieves superior efficiency while maintaining low computational cost. The model proposed in this paper has a similar parameter count to Crossformer, but due to the efficient design of the FPST and GAM modules, the training time is controlled within 2.1 seconds, demonstrating a good balance between performance and efficiency.

Table 3

Table 3. Model parameter size and training time comparison.

5.2 Ablation experiment results

This paper also gives the results of ablation experiments, which are shown in Table 4.

Table 4

Table 4. Summary of ablation study results.

From an overall perspective, integrating the FPST and GAM modules yields significant and consistent performance improvements across all five monitoring stations. Compared with the Transformer baseline model, the proposed method reduces the MSE by approximately 24.4% at station 45161 (from 0.324 to 0.245), by 10.4% at station seto3 (from 0.231 to 0.207), and by 4.2% at station 51045 (from 5.518 to 5.287). Meanwhile, both MAE and MAPE decrease concurrently, and the R² score improves steadily. This indicates that the proposed Feature Pyramid Space Transformation and Gradient Attention Mechanism can significantly enhance the model’s representation and fitting capability for O₂% dynamic variations without altering the training objective. It is worth emphasizing that introducing either module individually (+FPST or +GAM) already yields substantial gains, while their combination achieves the best performance, highlighting the complementary advantages between structural hierarchy modeling and gradient-based feature enhancement.

From the perspective of station-specific performance, the magnitude of improvement is positively correlated with the complexity of ocean dynamics and the frequency of abrupt changes. At nearshore station 45161 and estuarine station seto3, where short-term fluctuations and multi-resolution structures are more prominent, FPST effectively facilitates cross-scale information flow, and GAM enhances the model’s sensitivity to sudden transitions, thereby substantially reducing tail-end prediction errors. In contrast, stations such as 51045, bbsf1, and dpha1 exhibit stronger periodicity or higher signal-to-noise ratios, where the baseline model already captures the main rhythm; the advantage of the proposed method is mainly reflected in the robustness of the predictions and the fine-grained improvement in R². In summary, the ablation experiments verify that the proposed FPST and GAM modules collaboratively enhance predictive performance from the perspectives of multi-scale structural modeling and gradient variation perception, providing a more generalizable and interpretable solution for dynamic dissolved oxygen saturation forecasting under complex marine environments.

5.3 Comparison experimental results between true values and predicted values

This paper also gives a comparison between the true value and the predicted value on the test data, and the experimental results are shown in Figure 4.

Figure 4

Five line graphs labeled a to e, displaying oxygen levels over time. Graph (a) shows fluctuations from November 2023 to January 2024. Graph (b) depicts relatively stable levels with slight variations in September 2023. Graph (c) illustrates spikes and drops in June and July 2023. Graph (d) indicates high volatility from October to December 2023. Graph (e) presents fluctuations in February and March 2021. Each graph compares true and predicted oxygen levels, marked by blue and orange lines, respectively.

Figure 4. Time series of dissolved oxygen saturation observed (blue line) and predicted (orange line) for the five NOAA time series stations: (a) SETO3, (b) 45161, (c) 51045, (d) bbsf1, and (e) dpha1. The y-axis denotes dissolved oxygen saturation (O₂%), and the x-axis represents the observation time. The observed and predicted values closely overlap across all stations, indicating that the proposed model accurately captures the temporal dynamics of dissolved oxygen saturation.

The predicted curves across the five stations closely follow the observed time series, aligning well in both overall trends and dominant amplitude variations. At station 45161, the long-term gradual changes interspersed with local sudden drops are closely followed. At station 51045, the model tightly fits the medium- and short-period fluctuations superimposed on a relatively stable background. At stations bbsf1 and dpha1, pronounced periodic oscillations are captured with consistent peak–valley timing and minimal phase lag. At station seto3, although the impulses and abrupt rises display relatively large amplitudes, the model effectively tracks their envelopes and directional variations. Overall, the high degree of overlap between predictions and ground truth suggests that the model is capable of capturing both long-term trends and medium- to short-term structures across different stations.

The residual errors are mainly concentrated in segments characterized by rapid transitions and high variance. For example, slight underestimations appear at the sharp peaks of seto3 and the steep drops of bbsf1 and dpha1, while short delays are observed in the local hypoxic events of 45161. These phenomena are consistent with the characteristics of abrupt events and observational noise in marine monitoring. In summary, the visualization in Figure 4 aligns with the quantitative results of MSE, MAE, MAPE, and R², demonstrating that the proposed model achieves robust and generalizable performance in multi-scale variation and abrupt change response, making it suitable for marine O₂% monitoring and forecasting applications.

5.4 Scatter plot experiment results

To intuitively evaluate the consistency of model fitting at the sample level, a parity plot was generated for the test set, where each scatter point corresponds to a single-step or multi-step prediction of O₂%. In the plot, the true values are placed on the horizontal axis and the predicted values on the vertical axis, with a reference line y = x superimposed to examine potential systematic bias. This visualization provides a complementary assessment to the previously reported MSE, MAE, MAPE, and R² metrics. The experimental results are shown in Figure 5.

Figure 5

Scatter plot grid comparing predicted and true O2% values in five panels labeled (a) to (e). Each plot shows data points, a y=x reference line, and a fitted line with equations slightly varying in slope and intercept, indicating the accuracy of predictions. The alignment of data points along the lines varies, affecting the fit quality in each subplot.

Figure 5. Scatter plots comparing observed and predicted dissolved oxygen saturation for the five NOAA time series stations: (a) SETO3, (b) 45161, (c) 51045, (d) bbsf1, and (e) dpha1. The x-axis denotes the observed dissolved oxygen saturation (O₂%), and the y-axis denotes the predicted O₂%. Blue points represent individual observations, the dashed line indicates the ideal 1:1 reference line (y = x), and the orange line shows the fitted regression relationship between observed and predicted values.

From the scatter plots, it can be observed that the points at each station are generally distributed along the reference line y = x, indicating that the model is overall without obvious systematic bias. At stations 51045 and seto3, the scatter bands are narrow and closely overlap with the reference line, reflecting stable consistency. At station bbsf1, the linear relationship is also evident, though a few outliers appear in the medium to high range. At stations 45161 and dpha1, the scatter bands are relatively wider, primarily during periods characterized by stronger local fluctuations or more complex environmental drivers. Overall, the model is able to capture both the main trend and amplitude, and the predicted–observed relationships approximate linearity.

In terms of bias patterns, some stations show slight underestimation at the high-value end (with points falling below the y = x line, likely corresponding to local increases during warm seasons or eutrophication disturbances), while slight overestimation occurs at the low-value end (with points located above the reference line), reflecting a typical regression-to-the-mean effect. This is consistent with the mixed abrupt–stationary characteristics of O₂%. The convergence of the scatter bands aligns with the previous conclusions from MSE, MAE, and R², suggesting that the multi-scale representation (FPST) and gradient attention mechanism (GAM) effectively reduce dispersion in extreme or turning segments, although a few outliers remain visible during strong abrupt events, indicating that extreme processes remain a challenge for the model.

5.5 The impact of parameter sensitivity on experimental results

To systematically examine the robustness of the proposed model, parameter sensitivity experiments were conducted by varying key hyperparameters that may significantly influence training dynamics and predictive performance. The first set of experiments focuses on the sensitivity to attention head size, the experimental results are shown in Figure 6.

Figure 6

Four line graphs showing relationships between hidden dimensions and metrics: MSE, MAE, MAPE, and R². For MSE, values decrease from 4.215 at dimension 2 to 3.514 at 6, then increase. MAE decreases from 1.482 to 1.294, then rises slightly. MAPE falls from 1.672 to 1.391 and rises. R² increases from 0.791 to 0.840 at 6, then slightly declines. The optimal performance is marked at dimension 6.

Figure 6. Attention head sensitivity experiment results (averaged across five stations).

From the hidden attention head experiment, it can be observed that as the attention head increases from 2 to 6, the error-related metrics consistently decrease while R² steadily improves. This indicates that a larger representation capacity can work synergistically with the multi-scale topological reconstruction of FPST, significantly enhancing the joint characterization of long-term trends and short-term abrupt changes in O₂%. At the same time, this configuration aligns well with the gradient-guided attention in GAM, enabling the model to achieve consistently high fitting accuracy and stability across all five stations. Considering both accuracy and computational cost, attention head=6 setting can be regarded as the preferred default configuration, as it ensures predictive performance while supporting efficient deployment and reuse in marine environmental monitoring scenarios.

This paper further presents the experimental results of different optimizers, as shown in Table 5, which indicate that AdamW with decoupled weight decay achieves the best average performance across all four metrics, reflecting that its adaptive learning rate and regularization mechanism align better with the training characteristics of FPST and GAM. This allows the model to robustly fit both low-frequency trends and high-frequency fluctuations of O₂%. Compared with SGD, AdamW reduces MSE, MAE, and MAPE by approximately 14.8%, 12.1%, and 14.9%, respectively, while improving R² by 0.048. It also yields slight improvements over Adam, such as an additional reduction of about 5.8% in MSE. Therefore, adopting AdamW as the optimizer in this study is well justified, and the results further demonstrate that although the metrics exhibit certain fluctuations, the proposed model maintains stability.

Table 5

Table 5. Results of optimizer sensitivity experiments (averaged across five stations).

5.6 The impact of different gradient calculation methods on experimental results

This paper also uses Station 45161 as the experimental benchmark to examine the impact of different gradient computation methods, as shown in Table 6, which demonstrates that these methods have a significant influence on the model’s predictive performance. The first-order gradient method demonstrates a certain ability to capture local variations; however, due to the lack of gradient smoothing, the model is easily affected by noise in complex dynamic sequences, resulting in relatively high errors. The smooth gradient method mitigates this issue by performing multiple sampling and averaging operations to reduce noise sensitivity, leading to slightly lower MSE and MAE compared with the first-order gradient. This indicates that appropriate smoothing operations help improve model stability.

Table 6

Table 6. Comparison of different gradient computation methods in the GAM module.

Furthermore, the second-order gradient method leverages higher-order curvature information to enhance its ability to capture nonlinear trends, thus achieving better performance in terms of MSE and R². In contrast, the proposed temporal difference gradient maintains gradient responsiveness while significantly reducing computational complexity and achieves the best results across all metrics. This demonstrates that the temporal difference strategy effectively balances accuracy and efficiency, enabling the model to perform more robustly in capturing both abrupt changes and trend variations.

5.7 Effect of pyramid level L and gradient factor β

In the multi-scale feature modeling and gradient-guided attention mechanism, the pyramid level L and the gradient factor β are the key hyperparameters that determine the model’s representational capability and stability. To analyze the specific impact of these two factors on model performance, this section further designs experiments from two perspectives: structural depth and gradient weighting. Similarly, Station 45161 is taken as an example for the experiments, and the corresponding results are shown in Figure 7, which illustrates that both the pyramid level L and the gradient factor β have a significant impact on the model’s performance.

Figure 7

Two line graphs compare the effects of pyramid level $L$ and gradient factor $\beta$ on metric values. The left graph shows $L$ ranging from 1 to 5, where MSE decreases initially and R² increases before peaking and declining. The right graph displays $\beta$ ranging from 0.1 to 0.9, with MSE decreasing initially and R² peaking in the middle before slightly declining. Both graphs have MSE and R² represented by blue circles and orange squares, respectively.

Figure 7. The effect of different L and different β on experimental results.

With the increase of L, the MSE first decreases and then increases, indicating that a moderate hierarchical structure can enhance the representation capability of multi-scale features, while an excessively deep pyramid may introduce redundant features and increase the risk of overfitting. Meanwhile, the variation of β exhibits a gentle unimodal trend, where the model achieves optimal performance at a moderate weighting. This suggests that an appropriate gradient guidance helps to strike a balance between feature enhancement and noise suppression, thereby further improving the model’s stability and predictive accuracy.

5.8 Prediction performance with different input feature subsets

To further clarify the relationship between input configuration and model performance, this subsection investigates how well dissolved oxygen saturation (O₂%) can be predicted from different subsets of the available input variables. Starting from the full feature set, we construct several reduced input configurations by selectively combining individual physical predictors, while keeping the model architecture and training protocol fixed. This design allows us to quantify the marginal contribution of each variable to the reconstruction of O₂% and to assess the robustness of the model under partially missing input conditions. By comparing the prediction performance across these feature subsets, we provide a more transparent interpretation of what the model predicts from which inputs.

Across all five NOAA time series stations, the results in Table 7 show a consistent pattern that using the full set of available predictors yields the best reconstruction performance for dissolved oxygen saturation, with the lowest error metrics and highest R² values. When the input is restricted to O₂% alone, prediction accuracy degrades noticeably, and the various two-variable subsets (O₂%+DEPTH, O₂%+OTMP, O₂%+O₂PPM) fall in between these two extremes, indicating that each additional physical variable provides complementary information that helps the model better capture local oxygen dynamics. The degree of performance degradation varies across stations, reflecting that the relative importance of depth, temperature, and dissolved oxygen concentration is not uniform but depends on site-specific hydrographic conditions. Overall, this comparison confirms that the proposed model benefits substantially from richer physical context and that omitting key input variables leads to systematically weaker reconstruction of O₂% even though the model architecture and training protocol remain unchanged.

Table 7

Table 7. Prediction performance of dissolved oxygen saturation (O₂%) under different input feature subsets at the five NOAA time series stations.

6 Conclusion

This study focuses on the task of O₂% time series forecasting in marine environmental monitoring and, using multi-year in situ records from five representative NOAA stations, proposes an enhanced Transformer-based framework that integrates two key innovations: the Feature Pyramid Space Transformation and the Gradient Attention Mechanism. The FPST module, embedded in the encoder, is not only used to improve predictive performance, but also to decompose the observed dissolved oxygen saturation signals into hierarchical temporal components and to reveal station-dependent multi-scale structures and topological dependencies. The GAM module in the decoder enhances the sensitivity to abrupt variations and dynamic transitions in O₂%, enabling a more explicit characterization of how sharp gradients, rapid events, and regime shifts manifest in the observational records. Experimental results on the five buoy sites confirm that the proposed approach consistently outperforms mainstream baselines in terms of MSE, MAE, MAPE, and R², and the associated ablation and input-subset analyses further clarify the relative roles of depth, temperature, and local oxygen concentration as drivers of O₂% variability in different hydrographic regimes.

Beyond the methodological contribution, the observationally grounded findings of this work hold significant implications for ocean environmental applications. By quantifying the multi-scale structure and gradient-driven behavior of dissolved oxygen saturation across contrasting coastal and shelf environments, the study provides new insight into the timing, persistence, and predictability of low-oxygen conditions and other stress-related fluctuations. Accurate forecasting of O₂% variations derived from these analyses offers early indicators of hypoxia and ecosystem pressure, supporting proactive marine management, refinement of observing strategies, and the sustainable regulation of fisheries resources. The proposed framework can be integrated into real-time monitoring systems to enhance the early-warning capacity for low-oxygen events and to supply data-driven constraints for climate- and biogeochemistry-related assessment of coastal oxygen dynamics. In future work, the methodology can be extended to larger-scale, multimodal, and physically informed datasets to further improve model generalization and interpretability, thereby strengthening its role as a bridge between intelligent prediction tools and process-based understanding of the marine environment.

Data availability statement

The original contributions presented in the study are included in the article/supplementary material. Further inquiries can be directed to the corresponding author.

Author contributions

WT: Conceptualization, Data curation, Formal Analysis, Methodology, Writing – original draft, Writing – review & editing. BG: Conceptualization, Data curation, Investigation, Writing – review & editing. XB: Data curation, Formal Analysis, Investigation, Methodology, Writing – review & editing.

Funding

The author(s) declared that financial support was not received for this work and/or its publication.

Acknowledgments

The authors would like to thank their colleagues and collaborators for their valuable support and constructive feedback during the preparation of this manuscript.

Conflict of interest

The authors declared that this work was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Generative AI statement

The author(s) declared that generative AI was not used in the creation of this manuscript.

Any alternative text (alt text) provided alongside figures in this article has been generated by Frontiers with the support of artificial intelligence and reasonable efforts have been made to ensure accuracy, including review by the authors wherever possible. If you identify any issues, please contact us.

Publisher’s note

All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article, or claim that may be made by its manufacturer, is not guaranteed or endorsed by the publisher.

References

Bandara K., Hyndman R. J., and Bergmeir C. (2025). Mstl: A seasonal-trend decomposition algorithm for time series with multiple seasonal patterns. Int. J. Operational Res. 52, 79–98. doi: 10.1504/IJOR.2025.143957

Crossref Full Text | Google Scholar

Chen T. and Guestrin C. (2016). “Xgboost: A scalable tree boosting system,” in Proceedings of the 22nd acm sigkdd international conference on knowledge discovery and data mining. (Cornell University), 785–794.

Google Scholar

Chen Y., Zhang H., Xu C., Zhu Q., Cai M., and Yuan J. (2024). Research on seawater dissolved oxygen prediction model based on improved generative adversarial networks. Ocean Model. 191, 102404. doi: 10.1016/j.ocemod.2024.102404

Crossref Full Text | Google Scholar

Fawaz H. I. (2020). Deep learning for time series classification. arXiv preprint arXiv:2010.00567.

Google Scholar

Hao Z. (2024). A dissolved oxygen prediction model based on gru–n-beats. Front. Mar. Sci. 11, 1365047. doi: 10.3389/fmars.2024.1365047

Crossref Full Text | Google Scholar

Hochreiter S. and Schmidhuber J. (1997). Long short-term memory. Neural Comput. 9, 1735–1780. doi: 10.1162/neco.1997.9.8.1735

PubMed Abstract | Crossref Full Text | Google Scholar

Huang S., Shao J., Chen Y., Qi J., Wu S., Zhang F., et al. (2023). Reconstruction of dissolved oxygen in the Indian ocean from 1980 to 2019 based on machine learning techniques. Front. Mar. Sci. 10, 1291232. doi: 10.3389/fmars.2023.1291232

Crossref Full Text | Google Scholar

Lim B., Arık S.Ö., Loeff N., and Pfister T. (2021). Temporal fusion transformers for interpretable multi-horizon time series forecasting. Int. J. forecasting 37, 1748–1764. doi: 10.1016/j.ijforecast.2021.03.012

Crossref Full Text | Google Scholar

Liu Y., Hu T., Zhang H., Wu H., Wang S., Ma L., et al. (2023). itransformer: Inverted transformers are effective for time series forecasting. arXiv preprint arXiv:2310.06625.

Google Scholar

Liu W., Wang J., Li Z., and Lu Q. (2024). Issa optimized spatiotemporal prediction model of dissolved oxygen for marine ranching integrating dam and bi-gru. Front. Mar. Sci. 11, 1473551. doi: 10.3389/fmars.2024.1473551

Crossref Full Text | Google Scholar

Liu H., Yang R., Duan Z., and Wu H. (2021). A hybrid neural network model for marine dissolved oxygen concentrations time-series forecasting based on multi-factor analysis and a multi-model ensemble. Engineering 7, 1751–1765. doi: 10.1016/j.eng.2020.10.023

Crossref Full Text | Google Scholar

López-Andreu F. J., Lopez-Morales J. A., Hernández-Guillen Z., Carrero-Rodrigo J. A., Sánchez-Alcaraz M., Atenza-Juárez J. F., et al. (2023). Deep learning-based time series forecasting models evaluation for the forecast of chlorophyll a and dissolved oxygen in the mar menor. J. Mar. Sci. Eng. 11, 1473. doi: 10.3390/jmse11071473

Crossref Full Text | Google Scholar

Ma Y., Fang Q., Xia S., and Zhou Y. (2024). Prediction of the dissolved oxygen content in aquaculture based on the cnn-gru hybrid neural network. Water 16, 3547. doi: 10.3390/w16243547

Crossref Full Text | Google Scholar

Ma D., Zhao F., Zhu L., Li X., Wei J., Chen X., et al. (2025). Deep learning reveals hotspots of global oceanic oxygen changes from 2003 to 2020. Int. J. Appl. Earth Observation Geoinformation 136, 104363. doi: 10.1016/j.jag.2025.104363

Crossref Full Text | Google Scholar

Nie Y., Nguyen N. H., Sinthong P., and Kalagnanam J. (2022). A time series is worth 64 words: Long-term forecasting with transformers. arXiv preprint arXiv:2211.14730.

Google Scholar

Oreshkin B. N., Carpov D., Chapados N., and Bengio Y. (2019). N-beats: Neural basis expansion analysis for interpretable time series forecasting. arxiv 2019. arXiv preprint arXiv:1905.10437.

Google Scholar

Schmidhuber J. (2015). Deep learning in neural networks: An overview. Neural Networks 61, 85–117. doi: 10.1016/j.neunet.2014.09.003

PubMed Abstract | Crossref Full Text | Google Scholar

Shao J., Huang S., Chen Y., Qi J., Wang Y., Wu S., et al. (2023). Satellite-based global sea surface oxygen mapping and interpretation with spatiotemporal machine learning. Environ. Sci. Technol. 58, 498–509. doi: 10.1021/acs.est.3c08833

PubMed Abstract | Crossref Full Text | Google Scholar

Sundararaman H. K. K. and Shanmugam P. (2024). Estimates of the global ocean surface dissolved oxygen and macronutrients from satellite data. Remote Sens. Environ. 311, 114243. doi: 10.1016/j.rse.2024.114243

Crossref Full Text | Google Scholar

Valera M., Walter R. K., Bailey B. A., and Castillo J. E. (2020). Machine learning based predictions of dissolved oxygen in a small coastal embayment. J. Mar. Sci. Eng. 8, 1007. doi: 10.3390/jmse8121007

Crossref Full Text | Google Scholar

Vaswani A., Shazeer N., Parmar N., Uszkoreit J., Jones L., Gomez A. N., et al. (2017). Attention is all you need. Adv. Neural Inf. Process. Syst. 30, 5998–6008.

Google Scholar

Wang S., Wu H., Shi X., Hu T., Luo H., Ma L., et al. (2024). Timemixer: Decomposable multiscale mixing for time series forecasting. arXiv preprint arXiv:2405.14616.

Google Scholar

Woo G., Liu C., Sahoo D., Kumar A., and Hoi S. (2022). Etsformer: Exponential smoothing transformers for time-series forecasting. arXiv preprint arXiv:2202.01381.

Google Scholar

Wu H., Hu T., Liu Y., Zhou H., Wang J., and Long M. (2022). Timesnet: Temporal 2d-variation modeling for general time series analysis. arXiv preprint arXiv:2210.02186.

Google Scholar

Wu H., Xu J., Wang J., and Long M. (2021). Autoformer: Decomposition transformers with auto-correlation for long-term series forecasting. Adv. Neural Inf. Process. Syst. 34, 22419–22430.

Google Scholar

Xu L., Liu W., Chengqing C., Liu T., Gao X., Sohel F., et al. (2025). Hybrid deep learning framework for real-time do prediction in aquaculture. Sci. Rep. 15, 24643. doi: 10.1038/s41598-025-10786-5

PubMed Abstract | Crossref Full Text | Google Scholar

Zhang C., Meng Q., Ma X., Liu A., and Zhou F. (2025). Modeling bottom dissolved oxygen on the east China sea shelf using interpretable machine learning. J. Mar. Sci. Eng. 13, 359. doi: 10.3390/jmse13020359

Crossref Full Text | Google Scholar

Zhang Y. and Yan J. (2023). “Crossformer: Transformer utilizing cross-dimension dependency for multivariate time series forecasting,” in The eleventh international conference on learning representations. (International Conference on Learning Representations).

Google Scholar

Zhou T., Ma Z., Wen Q., Wang X., Sun L., and Jin R. (2022). “Fedformer: Frequency enhanced decomposed transformer for long-term series forecasting,” in Proceedings of the International Conference on Machine Learning. (Brookline, MA, USA: PMLR), 27268–27286.

Google Scholar

Zhou H., Zhang S., Peng J., Zhang S., Li J., Xiong H., et al. (2021). “Informer: Beyond efficient transformer for long sequence time-series forecasting,” in Proceedings of the AAAI conference on artificial intelligence, Vol. 35. (Palo Alto, CA, USA: AAAI Press), 11106–11115.

Google Scholar

Keywords: dissolved oxygen saturation forecasting, marine environmental monitoring, multi-scale features, time series modeling, transformer

Citation: Tan W, Geng B and Bai X (2026) Feature-gradient enhanced transformer for accurate dissolved oxygen saturation forecasting in marine environments. Front. Mar. Sci. 12:1705996. doi: 10.3389/fmars.2025.1705996

Received: 15 September 2025; Accepted: 19 December 2025; Revised: 28 November 2025;
Published: 12 January 2026.

Edited by:

Chengbo Wang, Xidian University, China

Reviewed by:

Tien Anh Tran, Seoul National University, Republic of Korea
Liang Zhao, Henan University of Technology, China

Copyright © 2026 Tan, Geng and Bai. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

*Correspondence: Weiyan Tan, dGFuXzc5ODU3MUAxNjMuY29t

Disclaimer: All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article or claim that may be made by its manufacturer is not guaranteed or endorsed by the publisher.

Feature-gradient enhanced transformer for accurate dissolved oxygen saturation forecasting in marine environments

1 Introduction

2 Related work

2.1 Research on time series algorithms

2.2 Interdisciplinary research on time series forecasting and ocean-related topics

3 Methodology

3.1 Overall model architecture

3.2 Feature pyramid space transformation

3.3 Gradient Attention Mechanism

3.4 Training objective

4 Dataset and experimental settings

4.1 Dataset

4.2 Experimental settings

4.3 Metric evaluation

4.3.1 Mean Squared Error

4.3.2 Coefficient of Determination (R2)

4.3.3 Mean Absolute Error

4.3.4 Mean Absolute Percentage Error

5 Experimental results

5.1 Comparison of experimental results with other models

5.2 Ablation experiment results

5.3 Comparison experimental results between true values and predicted values

5.4 Scatter plot experiment results

5.5 The impact of parameter sensitivity on experimental results

5.6 The impact of different gradient calculation methods on experimental results

5.7 Effect of pyramid level L and gradient factor β

5.8 Prediction performance with different input feature subsets

6 Conclusion

Data availability statement

Author contributions

Funding

Acknowledgments

Conflict of interest

Generative AI statement

Publisher’s note

References

4.3.2 Coefficient of Determination (R²)