Interpretable machine learning for bridge-pier scour prediction and flood resilience

Khan, Adil; Ismael, Dalya

doi:10.3389/fbuil.2025.1731114

ORIGINAL RESEARCH article

Front. Built Environ., 30 January 2026

Sec. Dam Engineering and Design

Volume 11 - 2025 | https://doi.org/10.3389/fbuil.2025.1731114

This article is part of the Research TopicResilient Flood Protection Infrastructure: Adaptive Design, Analysis, and Innovative Solutions for Evolving HazardsView all 3 articles

Interpretable machine learning for bridge-pier scour prediction and flood resilience

Adil Khan ¹ ^†

Dalya Ismael ²*^†

¹ Department of Civil and Environmental Engineering, Old Dominion University, Norfolk, VA, United States
² Department of Engineering Technology, Old Dominion University, Norfolk, VA, United States

Bridge-pier scour is a leading cause of flood-induced bridge failure, yet practice still lacks transparent, physics-informed tools that link data-driven prediction with design guidance. This study develops an interpretable, physics-aware machine-learning framework to predict equilibrium scour depth and translate those predictions into actionable strategies for flood-resilient infrastructure. Using the 2014 U.S. Geological Survey Pier-Scour Database (569 laboratory cases), five models: Gradient Boosting, AdaBoost (Tree), XGBoost, Gaussian Process (RBF kernel), and Kernel Ridge (polynomial), were trained and evaluated with K-fold cross-validation. Model performance was evaluated using R², RMSE, and MAE. Gradient Boosting performed best, achieving training and testing R² of 0.99 and 0.96, a near-ideal parity fit, and consistent accuracy across folds. Interpretability is provided by SHAP, whose attributions align with hydraulics; the pier width normal to flow accounts for 70.6% of the total importance in predicting scour depth. Predicted scour is mapped to four scenario envelopes that capture rare, peak, and sustained hydraulic extremes and yield clear design checks for flood resilience. A physics-based imputation scheme for sediment critical velocity and duration of flow is integrated in the framework so that missing inputs are handled in a hydraulically consistent way. The developed models are deployed in an interactive web app, allowing practitioners to obtain code-free scour predictions across all learners. Applied to the Knik River bridge and benchmarked against related work, the framework improves accuracy and provides actionable margins for design verification, maintenance prioritization, retrofit planning, emergency response, and transparent risk communication.

1 Introduction

Bridge failures severely disrupt national transportation networks and pose serious threats to hydraulic infrastructure and human life. Beyond the immediate injuries and fatalities, the loss of service can significantly hinder economic growth (Cook et al., 2015; Diaz et al., 2009). Historical examples include the 1907 Quebec Bridge collapse, which killed 75 workers during construction (Pearson and Delatte, 2006), and the 1967 Silver Bridge failure, which claimed 46 lives in service (Harik et al., 1990). More recent incidents, such as the 2007 Tuojiang Bridge collapse during construction, which caused 64 deaths, 22 injuries, and an estimated direct economic loss of 39.7 million yuan (Tang and Huang, 2024), and the 2018 Morandi (Polcevera) Viaduct failed in Italy, which resulted in 43 deaths and approximately 100 million yuan in losses (Morgese et al., 2020). According to (Zhang et al., 2022), natural hazards account for more than 50% of all bridge failures. The increasing frequency and intensity of natural hazards tied to climate change, along with global population growth and urbanization, are amplifying risks to civil infrastructure worldwide.

Among these hazards, flooding remains the leading cause of bridge damage and failure (Argyroudis and Mitoulis, 2021; Ismael et al., 2024). Fu et al. (2012) and Xu et al. (2016) analyzed Chinese bridge collapses from 2000–2012 and found that nearly 46% were caused by floods. Similarly, a U.S. survey identified over 500 U.S. bridge failures between 1989 and 2000, with 48.31% attributed to flooding (Wardhana and Hadipriono, 2003). NOAA (2015) reported a 612% rise in hydraulic damage rates compared with the 1960s, reflecting the growing exposure of bridges to extreme hydrologic events. Hydraulic bridge failures are primarily driven by scour, flooding, or ice-floe actions, and the associated risk perceptions in construction and infrastructure settings affect how these hazards are managed (Ismael and Shealy, 2018). Intensifying extreme rainfall and flood events associated with climate change are direct drivers of hydraulic failures, precipitation-related risks to bridge safety and performance are expected to grow (Nasr et al., 2021), highlighting how uncertainty and risk-based decision processes influence infrastructure vulnerability (Shealy et al., 2017). The AASHTO LRFD Bridge Design Specifications state that most bridge failures in the U.S. and elsewhere are due to scour, the dominant hydraulic effect on bridges (AASHTO, 1998). Scour develops during floods when high-velocity flows increase near-bed shear stresses that mobilize and remove sediment around foundations (USGS, 2016). Specifically, scour-related failures due to flooding are estimated to account for about 60% of bridge collapses nationwide (Wang et al., 2017), highlighting the need to explicitly incorporate scour effects into bridge design and evaluation.

Local scour refers to sediment erosion and transport that develop around hydraulic and marine structures under flowing water. When flow enters a pier, streamlines separate and form a complex three-dimensional flow field characterized by a horseshoe-vortex at the upstream face, a downward jet, and wake vortices downstream (Chen H. et al., 2025). These vortices significantly increase local bed shear stress relative to the approach flow, triggering sediment entrainment and progressive degradation of the bed topography (Ma et al., 2024). The resulting scour can undermine structural stability and has been linked to several major bridge collapses (Yang et al., 2018). As scour deepens, the exposed length of the foundation increases, reducing lateral stiffness, decreasing buckling resistance, and lowering the overall factor of safety (Anisha et al., 2022). Advancing the mechanistic understanding and quantification of local scour is therefore essential for ensuring the safe operation of hydraulic infrastructure and for designing bridges that can withstand extreme flood events.

In parallel, artificial intelligence (AI) and machine learning (ML) have emerged as transformative tools across engineering disciplines, driven by advances in data availability and computational power. Machine learning, a core branch of AI, can autonomously detect complex patterns in high-dimensional data and improve performance without explicit programming (Rahman and Chavan, 2025). ML has demonstrated success across diverse domains, including engineering education (Ismael, 2023), medical diagnostics (Asif et al., 2025), accounting (Magazzino and Haroon, 2025), chemistry (Seal et al., 2025), and civil engineering (Khatir et al., 2025). Its growing popularity in civil and hydraulic engineering is largely due to its ability to process large, heterogeneous datasets and model nonlinear relationships that are difficult to capture using traditional empirical or analytical methods (Aldoseri et al., 2023).

Recent developments in hydraulic engineering have accelerated the use of soft computing and ML algorithms for scour prediction (Akib et al., 2014). Pal et al. (2012) applied an M5 model tree to multi-dimensional datasets and achieved performance comparable to a back-propagation neural network, outperforming empirical formulas. Cheng et al. (2015) coupled an Evolutionary RBF Neural Network with an Artificial Bee Colony optimization algorithm, producing higher accuracy than both AI baselines and conventional equations. Choi et al. (2017) employed an adaptive neuro-fuzzy inference system (ANFIS) using five key variables, flow depth, pier width, critical velocity, sediment size, and mean velocity, achieving strong predictions of equilibrium scour depth compared with artificial neural networks and empirical relations.

Although these studies have advanced scour prediction, a critical gap remains in understanding the mechanistic importance and interaction of input variables, particularly under extreme hydraulic forcing. Most existing models focus on predictive accuracy but provide limited insight into the physical significance of features or their relationships with governing hydraulic laws. This study directly addresses that gap by integrating physics-informed constraints into data-driven modeling and applying interpretable machine learning techniques to reveal variable importance and dependencies consistent with hydraulic theory. This study also implements rigorous generalization assessment through (K)-fold cross-validation, supplemented by hold-out testing and targeted stress tests, to demonstrate accuracy and robustness in both training and testing regimes.

To translate recent advances in machine learning into engineering practice, this study develops a physics-informed framework for predicting bridge-pier scour depth. It further operationalizes these predictions into a flood-resilient design tool by defining extreme-condition scenario envelopes that capture rare, peak, and sustained hydraulic events. The framework quantifies the mechanistic relevance of key hydraulic variables through SHAP (Shapley Additive Explanations) to ensure that data-driven predictions align with physical laws and interpretable scour mechanisms. The framework is demonstrated using the Knik River bridge piers in southcentral Alaska to estimate credible upper bounds on scour and to evaluate structural sufficiency under flood conditions. This approach enables early-phase design to incorporate defensible maximum scour depths, supporting more flood-resilient, cost-effective, and sustainable infrastructure towards flood resilience. For existing assets, the same toolchain yields a quantitative scour-based flood-risk criterion to prioritize monitoring, proactive maintenance, or decommissioning where warranted. This approach enhances predictive reliability, explains why specific drivers matter, and bridges data-driven inference with first-principles hydraulics to support decision-oriented design, monitoring, and maintenance for flood-resilient and sustainable infrastructure.

2 Materials and methods

Figure 1 presents a comprehensive summary of the methodology implemented in this investigation. It details the entire workflow, beginning with data collection and preprocessing, and proceeding through successive stages of model training, validation, interpretability analysis, and scenario generation. This schematic representation clarifies the sequential relationships among each methodological component and supports reproducibility of the study’s approach.

Figure 1

Flowchart depicting a data processing pipeline. It begins with

Figure 1. Schematic diagram of the study.

2.1 Data sources

The dataset used in this study is the 2014 U.S. Geological Survey Pier-Scour Database (PSDb-2014), compiled by Benedict and Caldwell (2014) through a systematic literature review of published laboratory and field measurements of pier-scour. For the present analysis, the laboratory subset consisting of 569 data points was utilized. These data encompass a wide range of hydrologic, sediment, and geometric conditions, representing diverse flow regimes and experimental configurations. The dataset is widely regarded as a benchmark source for scour prediction studies and provides a reliable foundation for reproducible model development and comparison with prior research.

2.2 Variables and notation

The equilibrium scour depth at a bridge pier is governed by three categories of parameters: (i) bed-material properties, (ii) water inflow conditions, and (iii) pier geometry (Pizarro et al., 2020). In alignment with this framework, the principal variables adopted in this study include the pier width normal to flow (b_n), approach flow velocity (V_o), sediment critical velocity (V_c), approach flow depth (y_o), median sediment size (D₅₀), geometric standard deviation of the sediment-size distribution (σ_g), and the duration of flow/scouring (T). Accordingly, the equilibrium scour depth at the pier (y_s) can be expressed as a function of these variables, as shown in Equation 1.

y_{s} = F (b_{n}, V_{o}, V_{c}, y_{o}, D_{50}, σ_{g}, T) (1)

2.3 Statistical evaluation of the dataset

The histograms in Figures 2A–H illustrate the distributional features summarized in Table 1, revealing the effective range of each variable. The pier width $b_{n}$ is tightly clustered at smaller sizes (0.05–3.00 ft; median 0.35 ft) with a strongly right-skewed tail. Approach and critical velocities ( $V_{o}$ and $V_{c}$ ) share similar shapes: most $V_{o}$ values lie below 4 ft/s but extend to 7 and 4.18 ft/s, with medians of 1.35 and 1.15 ft/s. Flow depth $y_{o}$ Spans 0.07–6.23 ft (median 0.56 ft), concentrating in shallow flows. Median sediment size $D_{50}$ covers 0.22–7.80 mm (median 0.80 mm), heavily right-skewed toward fine–medium sands, while sediment gradation $σ_{g}$ is narrowly distributed (1.10–1.50; median 1.30), indicating well-sorted beds. Duration T has the greatest range (29–54,780 min; median 2,998 min) with a strong tail. The target scour depth, y_s is right-skewed (0.01–4.63 ft; median 0.32 ft), with most events below 1 ft and a sparse upper tail. These distributional patterns in Figure 2 corroborate the summary statistics presented in Table 1.

Figure 2

Eight histograms labeled A to H display frequency distributions for different variables. (A) shows $b_n$ with a mean of 0.35 and standard deviation of 0.46. (B) shows $V_0$ with a mean of 1.68 and standard deviation of 1.03. (C) shows $V_c$ with a mean of 1.45 and standard deviation of 0.72. (D) shows $y_0$ with a mean of 0.88 and standard deviation of 0.80. (E) shows $D_{50}$ with a mean of 1.23 and standard deviation of 1.41. (F) shows $\sigma_g$ with a mean of 1.45 and standard deviation of 0.67. (G) shows $T$ with a mean of 4609.13 and standard deviation of 6339.31. (H) shows $y_s$ with a mean of 0.45 and standard deviation of 0.46.

Figure 2. Histograms of input and output features, (A) Pier width normal to flow b_n (ft), (B) Approach flow velocity V_o (ft/s), (C) Sediment critical velocity V_c (ft/s), (D) Approach flow depth y_o (ft), (E) Median sediment size D₅₀ (mm) (F) Geometric standard deviation of the sediment-size distribution σ_g, (G) Duration of flow T (min), and (H) Scour depth at the pier y_s (ft).

Table 1

Table 1. Statistical summary.

Given the extensive positive skew across several variables ( $b_{n}, V_{o}, V_{c}, y_{o}, D_{50}, T, y_{s}$ ) and heavy tails, particularly for $T$ and $y_{s}$ , robust predictive models are recommended, as they tend to produce more stable and symmetric residuals (Yang et al., 2019). Although $σ_{g}$ shows limited marginal variability, potential interactions with $D_{50}$ and flow variables may influence scour response. Because $y_{o}, V_{o},$ and $D_{50}$ span wide ranges (Table 1), models should be validated across hydraulic intensities and sediment coarseness (Baranwal and Das, 2024a). Dimensionless groups, flow intensity ( $V_{o}$ / $V_{c}$ ) and relative depth ( $y_{o}$ / $b_{n}$ ) are recommended for sensitivity analysis, as they can reduce scale effects and linearize correlations with y _s (Hassan and Jalal, 2021). Furthermore, outlier-aware evaluation metrics (e.g., median-based and quantile-based statistics) and uncertainty reporting are applied to prevent rare, long-duration tests or large-scour cases from biasing risk-relevant predictions (Arachchige and Prendergast, 2024).

The Pearson correlation heatmap (Figure 3) indicates two dominant patterns among the study variables. The critical velocity $V_{c}$ shows a strong positive correlation with median sediment size $D_{50}$ ( $0.95$ ), and approach flow velocity $V_{o}$ relates to both ( $0.44$ with $V_{c}$ and $0.43$ with $D_{50}$ ), the same correlation was also reported by various studies (Chen F. et al., 2025; de Lange et al., 2024). By contrast, the pier width normal to flow $bn$ has a strong correlation with $y_{s}$ ( $0.86$ ) and a substantial relationship with time $T$ ( $0.61$ ), while $T$ itself relates to $y_{s}$ ( $0.66$ ), validating the correlations reported by recent studies (Nandi and Das, 2025a; Xu et al., 2025). Initial flow depth $y_{0}$ shows a moderate relation with $y_{s}$ ( $0.53$ ) and weaker ties to the hydraulics (V_o and V_c) block, having a correlation of 0.11 and 0.39. Sediment grain-size distribution ( $σ_{g}$ ) is largely independent of the other factors, slightly negative with $V_{c}$ and $D_{50}$ ( $- 0.13$ ) and negligible with $y_{s}$ ( $- 0.12$ ) (Baranwal and Das, 2024b). Overall, the correlation structure suggests that geometric configuration (bn and yo) and flow duration (T) are the primary predictors of scour depth, whereas velocity–size variables (V_o, V_c, and D₅₀) are strongly interrelated and should be treated carefully to mitigate multicollinearity in model development.

Figure 3

Heatmap displaying Pearson correlation coefficients between variables $b_n$, $V_o$, $V_c$, $y_o$, $D_{50}$, $\sigma_g$, $T$, and $y_s$. Colors range from blue (negative correlation) to red (positive correlation). Diagonal shows perfect correlation (1.00). Significant positive correlations include $D_{50}$ with $V_c$ and $b_n$ with $y_s$.

Figure 3. Pearson correlation heatmap for the laboratory dataset showing linear correlations among input variables and output feature y_s (ft).

2.4 Preprocessing

All variables were standardized with consistent symbols and units, including b_n (ft), V_o (ft/s), V_c (ft/s), y_o (ft), D₅₀ (mm), σg (−), T (min), and ys (ft) by trimming headers and mapping aliases; columns were then forced to numeric with placeholders treated as NaN (Jacobson et al., 2024; Kang, 2013; Peng et al., 2023). For the laboratory dataset, rows containing missing values were removed to ensure that descriptive statistics and model training were based exclusively on observed measurements. The resulting cleaned dataset was then stored and used to develop the scour-depth prediction models (Kang, 2013; Peng et al., 2023).

Because the compiled measurements originated from multiple laboratories, the raw distributions contain outliers and scale heterogeneity. We therefore applied normalization and standardization (feature scaling) before modeling to improve numerical conditioning and robustness. As evident in Table 1 and the histograms in Figures 2A–H, the dataset is biased; certain value ranges are over-represented while others are limited or absent, which can bias learning and increase the risk of overfitting (Charilaou and Battat, 2022). The adopted preprocessing workflow, including consistent labeling, type coercion, outlier-aware scaling, and missing-data handling, was therefore implemented to reduce these effects and enhance generalization performance in the predictive modeling stage.

For the Knik River piers, all records were retained, and units were harmonized across variables. A new hydraulic descriptor, the Froude number, was engineered as given in Equation 2 (CHOW, 1959).

Fr = Vo / \sqrt{g yo} (2)

where $g = 32.174 \frac{ft}{s^{2}}$ . Remaining gaps in {b_n, V_o, y_o, D₅₀, σ_g, y_s, Fr} were imputed using distance-weighted k-nearest neighbors (KNN) imputation to preserve local structure within the feature space.

Two variables frequently missing from the records, V_c and T, were not statistically imputed. Instead, they were estimated using physics-integrated formulas within three defined hydraulic scenarios (baseline, worst-case, and extreme-quantile) to enable risk-aware model evaluation. For modeling, tree-based algorithms operated on raw-scale variables, whereas kernel-based algorithms applied in-pipeline standardization. All cleaned and scenario-augmented datasets were archived to ensure full transparency and reproducibility.

2.5 Machine learning models

2.5.1 Gradient boosting

Gradient Boosting formulates supervised learning as a process of functional gradient descent, building a strong predictor by sequentially adding weak learners (typically shallow decision trees) fit to the negative gradients of a specified loss function (Friedman, 2001). This framework unifies boosting methods across differentiable objectives for both regression and classification tasks. Model generalization is regulated through shrinkage, tree-depth constraints, subsampling, and early stopping, which collectively limit variance and overfitting while allowing the model to capture nonlinear relationships and higher-order interactions with minimal feature engineering (Friedman, 2002).

Rooted in the principle of iteratively focusing on difficult-to-predict samples (Freund and Schapire, 1995), Gradient Boosting remains one of the most effective algorithms for structured tabular data where accuracy and robustness are critical. In this study, Gradient Boosting was selected as a primary benchmark due to its balance between interpretability and predictive power, making it well suited for capturing the nonlinear hydraulic interactions underlying bridge-pier scour.

2.5.2 AdaBoost

AdaBoost (adaptive boosting) constructs a strong predictive model by sequentially combining multiple weak learners, typically shallow decision trees. Schapire (1990) demonstrated that boosting weak learners can yield a strong classifier. AdaBoost trains an initial tree, then reweights the training samples, reducing weights on correctly classified points and increasing weights on misclassified ones, so subsequent trees focus on the hard cases (CAO et al., 2013). This error-driven reweighting and retraining repeats over multiple rounds, and the final predictor is a weighted sum of all weak learners, with weights reflecting each learner’s accuracy, which together improve performance iteratively (Schapire, 2013).

2.5.3 XGBoost

XGBoost is a state-of-the-art gradient-boosting framework that ensembles regression trees to deliver efficient and accurate prediction on structured tabular data (Liang et al., 2020). It builds trees sequentially, with each new tree fitted to the residuals of the current model to minimize the overall objective, i.e., an additive, gradient-descent interpretation of boosting (Jin and Agrawal, 2003). The training process employs a second order (Newton) approximation of the loss function with explicit (L1/L2) regularization on leaf weights, supports stochastic subsampling of rows and columns, and is optimized for parallel and distributed computation. Learning typically terminates after a predefined number of trees or once additional iterations provide negligible improvement (Shahani et al., 2021).

XGBoost has been widely adopted for its combination of speed, accuracy, and scalability across engineering and scientific applications (Liang et al., 2020). In this study it was used to evaluate performance improvements gained from regularization and second-order optimization, and to test the model’s capability to capture complex nonlinear interactions among hydraulic and geometric variables influencing scour depth.

2.5.4 Gaussian process regression (RBF kernel)

Gaussian Process Regression (GPR) is a Bayesian, nonparametric modeling approach that defines a prior directly over functions and produces both point predictions and calibrated uncertainty estimates.

Using the radial basis function (RBF) kernel, GPR encodes smooth, stationary relationships in the data. Also, its key hyperparameters (signal variance, length scale, noise level) are typically learned by maximizing the marginal likelihood, yielding models that balance data fit with complexity automatically (Rasmussen and Williams, 2005). RBF–GPR is well-suited to moderate-sized datasets where uncertainty quantification matters, though exact training scales cubically with the number of samples; sparse and variational methods mitigate this cost while preserving accuracy (Neal, 1996). These properties make RBF–GPR a principled baseline for nonlinear regression and a robust comparator to tree-based ensembles in scientific and engineering prediction tasks. In this study, GPR provides both high-fidelity predictions and interpretable uncertainty bounds essential for assessing confidence in scour-depth estimation under variable hydraulic conditions.

2.5.5 Kernel ridge regression (polynomial kernel)

Kernel Ridge Regression (KRR) combines ridge regression’s L-2 regularization with the kernel trick, providing a convex, closed-form solution in the dual space. This formulation enables the learning of nonlinear relationships without explicitly generating polynomial features (Gammermann, 2000). KRR captures interaction terms up to a specified degree using a polynomial (Poly) kernel, producing a global, smoothly varying fit with power-law extrapolation tendencies. Model control is achieved through the ridge penalty and kernel hyperparameters (degree, scale, offset), which are typically optimized via cross-validation. KRR is particularly effective for small-to mid-size datasets because it is stable, non-iterative, and avoids local minima. However, its dense $n \times n$ kernel matrix causes training and memory requirments to scale poorly with sample size (Hastie et al., 2009). In practice, Poly-KRR serves as a strong baseline when the underlying relationships are approximately polynomial or when explicit interpretability of interaction order is useful.

2.6 Model development

After preprocessing, the standardized laboratory dataset was randomly partitioned into 80% training and 20% testing subsets, following recommendations by Bichri et al. (2024). A fixed seed ensured full reproducibility. To better preserve the empirical distribution of the target variable (y_s), stratified random splitting was performed using binned scour depths, ensuring that both training and test sets captured the complete range of scour magnitudes and covariate combinations. This strategy mitigates selection bias and supports more reliable model generalization (Kapoor and Narayanan, 2023).

Model training and any preprocessing operations that could introduce data leakage (e.g., scaling for kernel-based methods) were encapsulated within scikit-learn pipelines and fitted exclusively on the training data. The held-out test data were never used during training, hyperparameter tuning, or preprocessing (Wu et al., 2025). This setup provides an unbiased assessment of out-of-sample performance while ensuring that both splits represent diverse hydraulic, geometric, and sediment conditions (Hameed et al., 2025).

Model development and analysis were performed in Google Colab (Python 3.12.12, Linux-6.6.105+), using NumPy 2.0.2, pandas 2.2.2, scikit-learn 1.6.1, SHAP 0.50.0, and Matplotlib 3.10.0. A total of five machine-learning models were developed and benchmarked, with their hyperparameters optimized using Random Search (RandomizedSearchCV, 50 random trials). The models were Gradient Boosting, AdaBoost (tree-based), XGBoost, Gaussian Process Regression with an RBF kernel, and Kernel Ridge Regression with a polynomial kernel. Each model was implemented within a leak-safe pipeline: tree ensembles (Gradient Boosting, AdaBoost with shallow trees, and XGBoost) used raw feature scales, whereas kernel-based methods (GPR with RBF and KRR with Poly) incorporated standardized features via StandardScaler to inputs. The five models, their algorithmic families, and principal hyperparameters are summarized in Table 2.

Table 2

Table 2. Hyperparameters of the models.

2.7 Model performance evaluation

Model performance was assessed using three commonly adopted statistical indicators: the coefficient of determination (R²), root mean square error (RMSE) and mean absolute error (MAE) evaluated for both training and testing subsets to verify generalization and detect potential overfitting. These performance metrics are widely used in the literature for validating predictive models (Khajavi et al., 2025; Madurwar et al., 2025; Ramujee and Praseeda, 2025).

The coefficient of determination (R²) quantifies the proportion of variance in the observed data explained by the model, ranging from − $\infty$ to 1, where a value of 1 indicates a perfect fit and 0 corresponds to a mean-only baseline (Bentegri et al., 2025). RMSE and MAE are non-negative error measures in the same units as the target variable, with 0 representing perfect prediction (Mohammed et al., 2025). RMSE is generally equal to or greater than MAE, as it penalizes larger deviations more heavily.

Together, these three indicators provide a balanced evaluation framework: R² reflects the model’s explanatory power, RMSE captures overall prediction accuracy with sensitivity to large errors, and MAE represents typical error magnitude and robustness to outliers. The mathematical formulations for these evaluation metrics are presented below in Equations 3–5.

R^{2} = 1 - \frac{\sum_{i = 1}^{n} {(y_{i} - {\hat{y}}_{i})}^{2}}{\sum_{i = 1}^{n} {(y_{i} - \bar{y})}^{2}} (3)

RMSE = \sqrt{\frac{1}{n} \sum_{i = 1}^{n} {(y_{i} - {\hat{y}}_{i})}^{2}} (4)

MAE = \frac{1}{n} \sum_{i = 1}^{n} |y_{i} - {\hat{y}}_{i}| (5)

where $y_{i}$ is the observed value for sample $i$ , ${\hat{y}}_{i}$ is the predicted value for sample $i$ , $\bar{y}$ is the mean of the observed values, n is the number of samples

2.8 K-fold cross validation

K-Fold cross-validation partitions the available training data into K equally sized, non-overlapping folds and performs K repeated training and validation cycles. In each cycle, one-fold serves as the validation set while the remaining (K-1) folds constitute the temporary training subset (Teodorescu and Obreja Braşoveanu, 2025; Wilimitis and Walsh, 2023). This rotation ensures that every observation is used for validation exactly once and for training (K-1) times, producing a distribution of performance scores that reflects model sensitivity to data variability and the bias–variance trade-off (Kapoor and Narayanan, 2023).

To estimate out-of-sample performance and minimize split-specific bias, 5-fold shuffled cross-validation was applied on the training set only (representing 80% of the total data), consistent with the approach used by Al-Shamasneh et al. (2025). The training data were stratified by scour depth ranges so that each fold maintained a similar distribution of the response variable, ensuring balanced representation across folds. A fixed random seed was employed to ensure repeatability.

For each fold, models were trained on 80% of the training data and validated on the remaining 20%. All preprocessing steps, including scaling for kernel-based methods, were encapsulated within the scikit-learn pipelines to prevent data leakage. Performance was recorded as R² and RMSE for each fold, allowing for a consistent comparison of model stability and predictive accuracy across the five folds.

2.9 Sensitivity analysis

Sensitivity analysis examines how variations in model inputs influence the predicted output. In this study, it was applied to quantify how changes in key hydraulic, geometric, and sediment variables affect the predicted equilibrium scour depth (y _s). The analysis employed SHAP (SHapley Additive ExPlanations), a principled framework for both global and local model interpretability based on cooperative game theory.

SHAP treats each input feature as a “player” in a game whose contribution to the prediction is computed as the average of its marginal effects across all possible future combinations. The resulting Shapley values satisfy the desirable properties of local accuracy, missingness, and consistency, making SHAP a unique and theoretically grounded additive explanation model (Aas et al., 2021). This framework unifies numerous prior feature-attribution methods and provides both local explanations (case-specific contributions that increase or decrease predictions) and global explanations through mean absolute SHAP values aggregated across samples.

TreeSHAP extends this formulation to decision trees and ensembles, enabling polynomial-time computation of exact Shapley values while also accounting for feature interaction effects. It provides practical visualization tools such as mean SHAP bar charts, beeswarm summaries, and dependence plots that capture local-to-global behavior without compromising fidelity to model predictions (Lundberg et al., 2020; Lundberg and Lee, 2018).

Overall, SHAP provides a theoretically sound and practically effective bridge between per-instance reasoning and global sensitivity of complex models. When applied with consideration for inter-variable dependence and supported by complementary diagnostics, it enables transparent, reproducible insights into how hydraulic, geometric, and sediment parameters collectively influence predicted scour (Alasmari et al., 2025).

2.10 Extreme-condition scour scenarios for flood resilience

Bridge scour arises from the interaction of hydraulic intensity, event duration, sediment mobility, and pier geometry. Field datasets rarely span the full range of conditions that govern safety, and key drivers such as the critical velocity for bed mobility (V_c) and the effective duration of mobility (T) are often missing (Belmokhtar et al., 2025; Shanmugam et al., 2025). A scenario framework allows us to impute defensible values where observations are incomplete by implementing physics-informed integration into the models. The framework further propagates credible extremes through a physics-guided predictor of scour depth (y_s) and summarizes risk per pier by taking the envelope across scenarios. Scenarios convert patchy records into decision-ready evidence about plausible and upper-bound scour. To operationalize the analysis, this study considers four application scenarios, Q99, WC-VcT, WC-Flow + Base, and WC-Flow + VcT.

The Knik River bridge in southcentral Alaska is examined as a case study to estimate credible upper bounds on pier scour and to cross-validate structural sufficiency under forecast and design-flood conditions. The proposed methodology estimates extreme pier scour under rare and persistent flood conditions by developing four physically consistent, pier-specific scenarios: Q99, WC-Flow, WC-VcT, and WC-Flow + VcT. Scour depth is predicted after recalculating the mobility (sediment motion) threshold velocity and event duration to reflect changes in hydraulic and sediment states. Using this setup, we can fold V_c and T into the scenario envelope via physics-informed learning model integration and validate them in cases with missing V_c or T values. This technique shows the model can both predict pier scour depth and impute the missing parameters in a physics-consistent way. The process begins with standardized inputs for each pier, including $b_{n}$ , $V_{o}$ , $V_{c}$ , y_o, D₅₀, $σ_{g}$ , T, and y_s. All data are converted to numeric values, with placeholders treated as missing and filled using physics-based formulas before the scenario modifications are applied.

Values of $V_{c}$ is estimated using a Shields-type formula that considers sediment gradation (Soulsby, 1997; van Rijn, 1984). The formula for calculating the critical velocity can be expressed as Equation 6.

V_{c} = C_{uc} \sqrt{g \cdot D_{50, ft}} \cdot σ_{g}^{α} (6)

Where $C_{uc} = 2.6$ , $α = 0.15$ , $g = 32.174 {ft / s}^{2},$

and $D_{50, ft} = D_{50, mm} \times 10^{- 3} \times 3.28084$

Values of duration of flow (T) are estimated from an advection-based time scale modified by the mobility ratio, using $T = 60 (\frac{y_{o}}{V_{o}}) {(\frac{V_{o}}{V_{c}})}^{β}$ with $β = 1.25$ , ensuring that duration stays between 20 min and 3 weeks (Melville and Chiew, 1999). For other inputs ( $b_{n}, V_{o}, y_{o}, D_{50}, σ_{g}$ ), gaps are filled using robust median values. To avoid uniformity across the dataset, each pier’s rank within the empirical distributions of $V_{o}, y_{o}, D_{50}$ , and $σ_{g}$ is calculated so that high-energy piers receive stronger scenario variation. After any variable changes, both $V_{c}$ and $T$ are recalculated to ensure consistency with the modified state. The updated threshold velocity is expressed in Equation 7 (Julien, 2010; Soulsby, 1997; van Rijn, 1984).

V_{c} \leftarrow V_{c, nominal} {(\frac{y_{o}}{y_{o 0}})}^{- 0.12} (7)

where $V_{c, nominal} = C_{uc} \sqrt{g \cdot D_{50, ft}} σ_{g}^{α}$ .

The event duration is adjusted accordingly using $T \leftarrow 60 (\frac{y_{o}}{V_{o}}) {(\frac{V_{o}}{V_{c}})}^{1.35} {(\frac{y_{o}}{y_{o 0}})}^{0.55}$ . These relationships make the recalculated values dynamically consistent with the altered hydraulic and sediment conditions. The Q99 scenario increases each pier’s $V_{o}, y_{o}, D_{50},$ and $σ_{g}$ toward their 99th-percentile values, scaled by pier rank so that higher-energy piers experience stronger effects. This represents rare but plausible extremes while maintaining pier-level diversity. In the WC-Flow scenario, both $V_{o}$ and $y_{o}$ are multiplied by rank-dependent factors with small random noise, as given in Equations (8) and (9) (Brandimarte et al., 2012; Das et al., 2021).

V_{o} \leftarrow V_{o} (1.45 + 0.60 r_{V_{o}}) (1 + 0.03 ε) and (8)

y_{o} \leftarrow y_{o} (1.35 + 0.45 r_{y_{o}}) (1 + 0.02 ε) (9)

Sediment characteristics are forced toward their upper distribution tails, and $V_{c}$ and $T$ are then recalculated to capture strong, short-duration hydrodynamic forcing. The WC-VcT scenario focuses on situations in which sediment motion is sustained through long-lasting floods or multiple storm events. It lowers $V_{c}$ and increases $T$ based on pier rank, as given in Equations 10, 11.

V_{c} \leftarrow V_{c} (0.65 - 0.15 r_{V_{o}}) and (10)

T \leftarrow T (3.0 + 3.0 r_{V_{o}}) (11)

While applying smaller adjustments to $V_{o}$ and $y_{o}$ to maintain variability across piers. The combined WC-Flow + VcT scenario merges both sets of conditions, applying the strong hydrodynamic changes from WC-Flow and the persistence adjustments from WC-VcT. After applying these combined multipliers to $V_{o}, y_{o}, V_{c},$ and $T$ , both values are recalculated again to maintain physical consistency, producing the most conservative condition for resilience checks. Scour depth for each scenario is estimated using an ensemble of Gradient Boosting, AdaBoost (Tree), XGBoost, Gaussian Process (RBF), and Kernel Ridge (Polynomial) models. The input vector is $x = [b_{n}, V_{o}, V_{c}, y_{o}, D_{50}, σ_{g}, T]$ . The physics-informed formulation used to predict pier scour depth is given by Equation 12 (Bruce and Melville, 2000).

{\hat{y}}_{s} = k b_{n} {Fr}^{m} y_{o}^{n} (12)

with $Fr = \frac{V_{o}}{\sqrt{g y_{o}}}$ , $m = 0.38$ , and $n = 0.33$

For each pier, the maximum estimated scour depth across all scenarios, $y_{s}^{\max} = \max_{scenario} {\hat{y}}_{s}$ , is selected and paired with the scenario that produces it. In interpretation, the Q99 scenario provides hazard-based checks for extreme statistical conditions, while WC-Flow represents short, intense flood events. The WC-VcT scenario highlights the effects of prolonged sediment mobility under persistent floods, and WC-Flow + VcT defines the upper bound useful for countermeasure design and durability assessments. Since $V_{c}$ and $T$ are recomputed for each scenario using Equations 13, 14, the model reflects condition-specific hydraulics.

V_{c} \propto \sqrt{g D_{50}} σ_{g}^{α} {(y_{o} / y_{o 0})}^{- 0.12} and (13)

T \propto (\frac{y_{o}}{V_{o}}) {(\frac{V_{o}}{V_{c}})}^{1.35} {(\frac{y_{o}}{y_{o 0}})}^{0.55} (14)

By applying this methodology, all scenarios remain physically realistic. The resulting scour envelopes reveal whether the peak intensity of flow or prolonged duration dominates pier scour risk for each pier of the Knik River bridge. The resulting pier-specific values are summarized in Table 3 (A), V_c (ft/s) by pier across each scenario, and Table 3 (B), T (min) by pier across each Scenario. Together, these tables show how mobility thresholds drop, and durations lengthen as scenarios intensify. These patterns directly amplify predicted scour and help flag piers most vulnerable under extreme, long-lasting floods.

Table 3

Table 3. Comparison of V_c (ft/s) and T (min) at each pier under four flow and scour scenarios.

3 Results

3.1 Model performance

The developed models were evaluated based on the performance indicators, including R², RMSE, and MAE, which were also implemented by other researchers for ML model performance evaluation (Khoshvaght et al., 2025; Koçak, 2025; Shobayo et al., 2025). A higher R² (closer to 1) means the model explains more of the variation and fits the data better (Mamudu et al., 2025). Lower RMSE and MAE reflect higher accuracy, and values of 0 correspond to a perfectly fitting model (Chai and Draxler, 2014).

Across models, performance is uniformly strong, with tree ensembles leading as illustrated in Figures 4A–E plots. Gradient Boosting delivers the top generalization, having the best fit line R² (0.98), and the parity plot shows predictions clustered tightly around the 45° line, as shown in Figure 4A. The best-fit line nearly overlaps the parity line over most of the range, indicating minimal systematic bias, with only a few high-value points pulling slightly above the diagonal. AdaBoost (Tree) closely mirrors this behavior, having fit-line R² = 0.97, showing a similarly tidy spread and a best-fit line that tracks the diagonal with small deviations at the upper end, as depicted in Figure 4B. The developed XGBoost model parity plot is largely well aligned. The best-fit line for XGBoost remains close to the 1:1 line but is influenced by a handful of larger residuals at high targets, which explains the higher test RMSE. Kernel methods provide stable baselines with smooth behavior. Gaussian Process (RBF) and Kernel Ridge (Poly), have fit line R-squared value of 0.97 and 0.95, maintain compact clouds near the origin and a gradual, orderly dispersion as values grow. In both plots, the best-fit line sits just below the parity line at higher observed values, reflecting mild underprediction in the extreme range while remaining well calibrated through the bulk of the data.

Figure 4

Five scatter plots compare observed versus predicted values for different models, showing training, testing, ideal, and best fit lines. A: Gradient Boosting; B: AdaBoost (Tree); C: XGBoost; D: Gaussian Process (RBF); E: Kernel Ridge (Poly). Each plot includes a best fit equation and R² value, indicating the fit quality. Plots depict varying prediction accuracies and alignments.

Figure 4. Parity plots comparing observed versus predicted pier-scour depth y_s (ft) for five models: (A) Gradient Boosting, (B) AdaBoost (tree-based), (C) XGBoost, (D) Gaussian Process Regression (RBF), and (E) Kernel Ridge Regression (poly). Blue markers denote training samples and red markers testing samples; the dashed line indicates the ideal 1:1 relationship and the solid line the best-fit regression, with fitted equations and (R²) values shown in each panel.

Moreover, to assess overfitting and underfitting, the gap between training and testing metrics should be small across all performance indicators (Aliferis and Simon, 2024; Emmert-Streib and Dehmer, 2019). Table 4 summarizes the training and testing performance of the developed models. Gradient Boosting and AdaBoost (Tree) exhibit small training-testing gaps, high testing R² (0.96) with modest increases in RMSE and MAE, indicating good generalization. XGBoost shows slight signs of overfitting, with near-perfect training metrics having R² (0.999) and deteriorating more on the testing set R² (0.939), higher RMSE, and MAE. The resulting difference in R² between training and testing is 0.06, indicating that the model performs very well overall, with only a small number of test samples acting as outliers that fall noticeably below the 1:1 parity line. The kernel methods, Gaussian Process (RBF) and Kernel Ridge (Poly), yield the largest test errors and lower test R² (0.927–0.929), reflecting a slightly weaker generalization. Overall, the parity visuals and summary table agree that the predictions are well centered with tight residual structure for most targets, and the best-fit lines lie closely to the ideal diagonal across models, most notably for Gradient Boosting and AdaBoost, highlighting a strong capture of all the data points.

Table 4

Table 4. Developed models performance in testing and training.

3.2 Residuals comparison of the developed models in training and testing

Residual plots (observed minus predicted) are used to screen for structural non-linearity, outliers, and to compare behavior on training versus testing data as a check on generalization (Kumar et al., 2025; Sharma et al., 2025). Checking residuals on the test and training sets helps separate true signal from overfitting, and models that generalize well show small, random-looking residuals in both. Figures 5A–E depicts residual histograms (observed vs. predicted) for all five models are tightly centered near zero, with substantial overlap between the training (blue) and testing (red) curves, indicating that none of the models is markedly overfitting. Gradient Boosting shows a compact, symmetric spread around zero, suggesting low bias and stable variance. AdaBoost (Tree) and Gaussian Process (RBF) exhibit a mild right-shift of the test curve (slightly positive bias), but the displacement is small relative to their overall dispersion. XGBoost produces a narrow core with light right tails, implying good central accuracy with a few underpredicted cases. Kernel Ridge (Poly) has slightly broader tails than the boosting models, though its training–testing overlap remains high. Overall, the distributions are unimodal and roughly symmetric, residual magnitudes are modest, and the similarity of train/test shapes supports good generalization across the range of scour depths represented in the data.

Figure 5

Five histograms titled Residuals for Gradient Boosting, AdaBoost (Tree), XGBoost, Gaussian Process (RBF), and Kernel Ridge (Poly), labeled A to E. Histograms compare residuals for train and test datasets, with counts on the y-axis and residuals (observed minus predicted) on the x-axis. Each shows predominantly centered residuals around zero, indicating model accuracy. Train and test distributions are shown in blue and red, respectively.

Figure 5. Residual histograms for pier-scour depth predictions y_s (ft) over the laboratory dataset for five models: (A) Gradient Boosting, (B) AdaBoost (tree-based), (C) XGBoost, (D) Gaussian Process Regression (RBF), and (E) Kernel Ridge Regression (poly). Residuals are defined as observed minus predicted $y_{s}$ ; blue bars show training samples and red bars testing samples, allowing comparison of error spread and potential overfitting.

For the Gradient Boosting model, residuals $r = y_{obs} - y_{pred}$ were used to derive empirical prediction intervals. The 5th and 95th percentiles of the residuals are $- 0.0449$ ft and $0.0401$ ft on the training set and $- 0.0949$ ft and $0.1723$ ft on the testing set, respectively. These values imply an approximate 90% prediction interval of ${\hat{y}}_{s} + [- 0.095, 0.172]$ ft on unseen data, corresponding to a typical uncertainty band of about $\pm 0.13$ ft around each predicted scour depth. Together with the close agreement between train- and test-based bands, this narrow interval width indicates that the Gradient Boosting model achieves accurate and well-calibrated scour depth predictions with only modest uncertainty on unseen cases.

3.3 Observed vs. predicted strength of models across all the data points

The actual vs. predicted plots, as shown in Figures 6A–E, demonstrate how well each model tracks measured scour depths across the whole sample index, also utilized by other studies (Nandi and Das, 2025b; Showkat et al., 2025). Overall, alignment is strong for Gradient Boosting and AdaBoost, whose test traces are close to the observed values with very minor variations in the top tail, indicating good generalization from low to high values. XGBoost follows the pattern well but demonstrates the previous overfitting tendency; training points are practically precise, but a few significant (y_s) test spikes are slightly under-predicted. The kernel approaches, Gaussian Process (RBF) and Kernel Ridge (Poly), recreate the central range stably while smoothing the highest peaks more than the boosting models, which is consistent with their slightly higher test errors. Errors are concentrated at the largest (y_s) events across all the models, as is predicted in hydraulics, where extremes are infrequent and difficult to learn, but most mid-range depths are well captured (McInerney et al., 2020).

Figure 6

Five graphs compare actual versus predicted values for different models. (A) Gradient Boosting shows high accuracy with R-squared of 0.994 for train and 0.959 for test. (B) AdaBoost also performs well, with R-squared of 0.974 for train and 0.958 for test. (C) XGBoost has R-squared of 1.000 for train and 0.939 for test, indicating strong performance. (D) Gaussian Process achieves R-squared of 0.987 for train and 0.930 for test. (E) Kernel Ridge displays lower accuracy, with R-squared of 0.749 for train and 0.723 for test. Each graph plots sample index against value, differentiating actual and predicted data.

Figure 6. Actual versus predicted pier-scour depth y_s (ft) for the laboratory dataset under five models: (A) Gradient Boosting, (B) AdaBoost (tree-based), (C) XGBoost, (D) Gaussian Process Regression (RBF), and (E) Kernel Ridge Regression (poly). In each panel, green lines show observed and predicted values for the training set and purple lines for the testing set, with inset boxes reporting the corresponding train and test (R²).

3.4 Cross-validation of performance by K-fold validation

The current study utilizes K-fold for cross-validation to estimate out-of-sample performance by splitting the data into $k$ folds, training on $k - 1$ folds, and evaluating on the held-out fold, then repeating so each sample is tested once and averaging the scores. The K-Fold cross-validation is also employed by various researchers to validate their ML models performance against overfitting or underfitting, indicating its usefulness and importance in the ML models performance evaluation (Habal and Benbouras, 2025; Jamal and Ahmed, 2025; Zhao et al., 2025). As shown in Table 5, ensemble-based models perform better and more consistently when cross-validated. Gradient Boosting and XGBoost achieve higher (R²) while maintaining low, densely dispersed RMSE, with XGBoost having the greatest single-fold result (R² = 0.95) and Gradient Boosting demonstrating somewhat greater stability across five folds (R² = 0.83–0.94). AdaBoost is competitive but has more variability. The kernel models are sensitive to how the data are split. Gaussian Process varies a lot between folds and shows the clearest outliers, including one-fold with low (R²) and high RMSE. Kernel Ridge has some strong folds, but overall, it is less consistent and shows a few moderate outliers. On the other hand, the tree ensembles show compact error distributions; XGBoost and AdaBoost each exhibit an isolated high-error fold, whereas Gradient Boosting maintains the tightest spread with only minor deviations. Overall, Table 5 validates that ensemble tree approaches provide the most robust generalization on this dataset, while GP and KR are comparatively volatile.

Table 5

Table 5. Developed models performance across each fold.

3.5 Global drivers of pier-scour depth

To identify the global drivers of pier-scour depth, SHAP (Tree SHAP) was applied to the trained models to decompose predictions into additive feature contributions and aggregate them across the dataset. This yields a ranked importance profile, supported by beeswarm and dependence plots that highlight which hydraulic, geometric, and sediment variables most consistently increase or decrease predicted scour. SHAP is now widely used to quantify global feature importance and has become a leading tool for model interpretation, with recent studies showing it provides consistent, dataset-wide insights into how predictors drive outputs (Cappelli et al., 2023; Cappelli and Grimaldi, 2023; Mushtaq et al., 2024).

Figures 7A, B depicts SHAP feature-importance and SHAP beeswarm, both of which jointly demonstrate that the pier width normal to flow b_n (ft) is by far the dominant predictor of scour depth y_s (ft), accounting for 70.6% of the model’s overall importance. Flow speed V_o (ft/s) with 9.8% and approach depth y_o (ft) with 7.5% offer the next highest contributions, followed by event duration T (min) with 4.3%, and sediment gradation (σ_g) with 4.0% respectively. On the other hand, critical velocity (V_c) and sediment size (D₅₀) have a minor global influence of 2.3% and 1.4%. Moreover, the beeswarm plot, as shown in Figure 7B, clarifies directionality and nonlinearity at the observation level. Large bn values (warm points to the right) consistently increase predicted y_s, while small bn values reduce it, an effect that is strong and monotonic, also reported by Baranwal and Das (2024a), Fuladipanah et al. (2023). Also, higher V_o tends to shift predictions upward, while lower V_o tends to shift them downward, aligning with standards that faster approach flow promotes scour. The SHAP plot shows the effect of y_o is more moderate and slightly nonlinear; higher depths generally push ys upward, but with a visible spread that suggests interactions with V_o and b_n. The plot also depicts that longer T shows a mild positive trend (more exposure leads to more scour growth), also validated by Melville and Chiew (1999). Sediment-size distribution (σ_g) exhibits mixed local effects (both signs), consistent with gradation influencing scour depth, aligning well with Mir et al. (2018) conclusions. In contrast, higher V_c typically reduces predicted scour (points with high Vc cluster on the negative SHAP side), reflecting that a bed that requires larger velocities to mobilize is less prone to scour under the same forcing, as reported by Arneson et al. (2012). D₅₀ effects are small and mostly negative in this sample, indicating limited incremental predictive power once bn, V_o, and y_o are known. Taken together, Figures 7A, B) indicates a physically consistent hierarchy, geometry (b_n) dominates, hydraulics (V_o, y_o, T) provide substantial but secondary control, and mobility metrics (σ_g, V_c, D₅₀) modulate scour at the margins. The tight, mostly one-sided SHAP pattern for b_n and V_o also suggests the model learned stable, interpretable relationships rather than relying on incorrect interactions.

Figure 7

(A) A bar chart showing SHAP feature importance for `y_s`, with `b_n` (ft) as the most important feature at 70.6%. Other features include `V_o` (ft/s) at 9.8% and `y_o` (ft) at 7.5%.(B) A SHAP beeswarm plot for `y_s`, displaying the impact of features on model output. `b_n` (ft) has the highest impact, with points color-coded by feature value, ranging from low (blue) to high (pink).(C) A scatter plot comparing SHAP importance with Pearson correlation for `y_s`. `b_n` (ft) shows the highest correlation and importance, while other features like `T` (min) and `V_c` (ft/s) show lower values.

Figure 7. SHAP-based interpretation of the Gradient Boosting model for pier-scour depth y_s (ft) using the laboratory dataset. (A) Global SHAP feature-importance bar plot showing each input’s share of total importance. (B) SHAP beeswarm plot, where point position gives SHAP value (impact on y_s) and color indicates feature value (blue = low, red = high). (C) Scatter plot of Pearson correlation with (y_s) versus SHAP importance, with the diagonal line marking agreement between correlation- and SHAP-based rankings.

To evaluate the consistency between classical correlation analysis and interpretable feature importance, a feature-correlation versus SHAP comparison plot is given in Figure 7C. The horizontal axis reports the Pearson correlation between each input and the target (y_s), and the vertical axis reports the global SHAP importance as a percentage of the total contribution. Overall, the two measures are consistent, with (b_n) exhibiting both the highest correlation and the largest SHAP importance, confirming its dominant influence on the model predictions. Variables such as (y_o) and (T) show moderate positive correlations and intermediate SHAP contributions, whereas weakly correlated inputs such as (D₅₀) and (V_c) contribute only marginally. The gradation parameter (σ_g) displays a slightly negative correlation with (y_s) but only modest SHAP importance, indicating that its effect is small and predominantly inverse. Overall, this comparison indicates that the Gradient Boosting model’s learned importance structure is broadly aligned with the underlying statistical relationships in the data.

3.6 Local interpretation of pier-scour depth

Figures 8A–D presents SHAP dependence plots, where the x-axis corresponds to the value of the input feature and the y-axis to its SHAP value, representing feature’s marginal contribution to the predicted scour depth (y_s). Point colors encode the value of a secondary feature (as indicated by the accompanying color bar), with cooler tones represent lower values and warmer tones higher values. Systematic color gradients that coincide with changes in SHAP values highlight potential interaction effects between the two features. Within these plots, pier width (b_n) is the dominant predictor. SHAP values increase almost monotonically with (b_n) as shown in Figure 8A, implying larger expected scour for wider piers (Fuladipanah et al., 2023). Approach flow velocity (V_o) exerts a strong positive effect that saturates at higher speeds. Whereas initial depth (y_o) contributes positively but with greater dispersion, consistent with a secondary role modulated by geometry and hydraulics, as evident from Dong et al. (2025) work. Event duration (T) shows a threshold response, rapid SHAP increases from short to moderate durations, followed by a plateau, indicating diminishing marginal effects for long exposures. The colour overlay clarifies interactions: higher (V_c) and larger (b_n) elevate the SHAP contributions of (V_o) and (y_o), while smaller (D₅₀) (finer sediment) aligns with larger SHAP values for (T), meaning duration is more consequential on easily mobilized beds. Together, these patterns support the importance order (b_n > V_o > y_o > T) and reveal nonlinear, partially saturating responses shaped jointly by geometry, hydraulics, and sediment properties.

Figure 8

Four scatter plots illustrate SHAP dependence for different features affecting $ y_s $ (in feet). (A) $ b_n $ (in feet) vs. SHAP values with $ V_c $ color scale. (B) $ V_0 $ (in feet/second) vs. SHAP values with $ b_n $ color scale. (C) $ y_0 $ (in feet) vs. SHAP values with $ b_n $ color scale. (D) $ T $ (in minutes) vs. SHAP values with $ D_{50} $ color scale. Each plot shows blue to red hues indicating value variations.

Figure 8. Panels (A–D) show SHAP dependence plots for the four most influential variables, (A) Pier width normal to flow b_n (ft), (B) Approach flow velocity V_o (ft/s), (C) Approach flow depth y_o (ft), and (D) Duration of flow T (min). Each point corresponds to one observation, with the x-axis giving the feature value and the y-axis giving its SHAP value, representing feature’s contribution to the predicted scour depth y_s (ft).

3.7 Generated scenarios assessment

This section evaluates the generated scenarios, comparing predicted scour responses across stress-tested combinations of hydraulic, geometric, and sediment conditions to identify sensitivity patterns, potential worst cases, and the robustness of model conclusions. The predicted scour depth at Knik River bridge piers across all scenarios is presented in Table 6, with the frequency histograms in Figure 9A showing a clear ordering across scenarios. Among all the scenarios, the combined worst case (WC-Flow + VcT) dominates the upper range with a right-shifted distribution and a long upper tail, indicating both higher typical scour and more frequent extremes (several peaks around 4.7–4.9 ft). The WC-Flow scenario is still severe but lies slightly below the combined case, with a distribution centered at smaller depths and a shorter upper tail, consistent with peak hydrodynamic forcing without the extra scour development from extended duration. The WC-VcT scenario concentrates at lower depths with less spread, meaning longer periods above the mobility threshold increase scour, but the increases are smaller than those produced by peak-flow events. The Q99 scenario clusters narrowly around 4.0–4.3 ft and typically lies 0.3–0.6 ft below WC-Flow + VcT, marking a rare-but-plausible benchmark distinct from the engineered worst case.

Table 6

Table 6. Predicted scour depth by pier for each scenario, $y_{s}$ (ft).

Figure 9

(A) A histogram showing frequency distribution of $ y_s $ in feet for different scenarios: Q99, WC-Flow, WC-Flow+VcT, WC-VcT. (B) A boxplot comparing $ y_s $ in feet across scenarios with median, mean, and interquartile range marked. (C) A bar graph displaying $ y_s $ in feet for different Pier IDs across scenarios.

Figure 9. Scenario-based predictions of pier-scour depth y_s (ft) for seven case-study piers under four hydraulic/sediment scenarios (Q99, WC-Flow, WC-VcT, WC-Flow + VcT) from the Gradient Boosting model. (A) Overlapping histograms of predicted y_s (ft) for each scenario, showing shifts in distribution. (B) Boxplots of y_s (ft) by scenario, with boxes for the interquartile range, whiskers for the full range, circles for outliers, and triangles for the mean. (C) Bar chart of y_s (ft) versus pier ID, comparing scenario-specific scour depth at each pier.

The summary statistics plot, as shown in Figure 9B illustrates the distributions of scour-depth across the scenarios. Scenario WC-Flow + VcT yields the highest medians (3.8–4.3 ft) and the widest interquartile (IQR) ranges (1.2–1.6 ft), confirming both elevated typical scour and variability. Scenario WC-Flow shows slightly lower medians (3.4–3.9 ft) and narrower IQRs (0.9–1.3 ft). In many cases, medians differ from the combined scenario by only 0.1–0.3 ft, indicating sites where peak velocity is the primary driver. On the other hand, scenario WC-VcT produces distinctly lower medians (2.0–2.5 ft) with tighter IQRs (0.6–0.9 ft), highlighting a milder central tendency despite the role of duration of event (T). Scenario Q99 remains high but stable, reinforcing its use as a decision threshold separating climatological extremes from design-envelope stress tests.

The per-pier comparison depicted in Figure 9C reveals heterogeneous sensitivity. At several piers, WC-Flow nearly matches WC-Flow + VcT (differences lie 0.1–0.3 ft), signaling locations where peak flow alone governs risk and thus may warrant rapid-response triggers tied to rising velocity. At some piers, the larger gap between WC-Flow + VcT and WC-Flow signals a strong duration (V_c, T) effect. In other words, these sites are therefore more vulnerable to extended floods or multi-storm sequences and warrant long-term monitoring and additional protective measures. In contrast, at sites where WC-VcT matches or exceeds the other scenario, it indicates that what matters most is how long the strong flow lasts, not just how high the single peak flow. In other words, longer periods of strong current can dig more scour even if the peak is not the biggest, so watch how long the current stays strong enough to move sediment. Therefore, tracking how long the flow remains above the sediment-mobility threshold is essential, and this can be operationalized using the scenario envelope.

Across the figures, WC-Flow + VcT consistently produces the largest and most variable scour depths, WC-Flow is a close second at many piers, WC-VcT is lower but still consequential for duration-sensitive sites, and Q99 offers a compact, high-end benchmark below the engineered worst case. Together, these results support a tiered risk triage: scale design envelopes to the combined worst case, deploy rapid-trigger monitoring where peak flow dominates, and prioritize duration-aware mitigation where persistence drives risk. These results show that the proposed framework remains accurate and interpretable across diverse hydraulic conditions and scenario stress tests.

3.8 Practical tool

To facilitate practical implementation, an interactive bridge pier scour prediction application was developed and made available on the Hugging Face platform. The tool combines five calibrated models: Gradient Boosting, AdaBoost, XGBoost, Gaussian Process Regression, and Kernel Ridge Regression, in a unified interface. Users enter seven required input variables, including pier width normal to flow (b_n), approach flow velocity (V_o), sediment critical velocity (V_c), approach flow depth (y_o), median sediment size (D₅₀), geometric standard deviation of the sediment-size distribution (σ_g), and the duration of flow/scouring (T), all constrained to their allowable ranges. For each set of input values, the application provides the predicted scour depth from every model, allowing direct comparison of model outputs. All trained model files as well as the Python scripts for model training, evaluation, and SHAP-based interpretation are included in the Hugging Face repository. This structure promotes accessibility for practitioners, guarantees transparency and reproducibility for researchers, and is fully usable without any prior knowledge of coding. The link to the developed application, model scripts, and SHAP analysis code is provided in the data availability section of the manuscript.

4 Discussion

4.1 Comparison with related work

Table 7 provides a quantitative comparison of the present models with established bridge-pier scour predictors spanning various methodological scopes. Among the developed models, the tree-based algorithms, the Gradient Boosting model achieves superior generalization, evidenced by the highest testing R² value (0.959) and the lowest test RMSE (0.145). AdaBoost offers closely comparable performance (test R² 0.958, RMSE 0.146). Relative to cylindrical-pier models reported in Fuladipanah et al. (2023), including MARS, GEP, and M5 model tree, the present tree-based regressors demonstrate clear improvement in both test accuracy and error rates (testing R² for MARS, GEP, and M5 were 0.917, 0.872, and 0.698; RMSE were 0.090, 0.114, and 0.284, respectively). Similarly, complex-pier models such as those developed by Tien Bui et al. (2020) achieved the highest accuracy of R²= 0.91 in training for ANN among the developed models and dropped to 0.82 in testing. Overall, the current boosted tree approaches yield a more balanced combination of predictive performance and model simplicity for predicting bridge-pier scour.

Table 7

Table 7. Related work comparison.

The SHAP analysis in this study reveals a clear hierarchy among predictors. Geometry, especially the width parameter $b_{n}$ , has the largest, consistent influence on scour depth. Hydraulic factors like approach flow velocity ( $V_{o}$ ) and depth ( $y_{o}$ ) also have positive impacts but show some nonlinearity. Mobility and exposure variables, such as duration ( $T$ ), sediment gradation ( $σ_{g}$ ), critical velocity ( $V_{c}$ ), and sediment particle size ( $D_{50}$ ), have smaller or mixed effects. These results align with some recent studies: Eini et al. (2023) found pier diameter ( $D$ ) most important, with flow depth ( $Y$ ) and velocity ( $V$ ) secondary, and sediment size ( $d_{50}$ ) and critical velocity negatively correlated. Emami et al. (2025) highlighted dimensionless time ( $Ut / D$ ) and relative pier size ( $D / D_{m}$ ) as key for time-dependent scour. Large-scale analyses also identify shape factor ( $K_{1}$ ) and depth ratio ( $y / b$ ) as top predictors, confirming the geometry-led control with hydraulics second and mobility minor (Piraei et al., 2025). While interpretations broadly align with prior work, the present study explicitly quantifies the effects of seven input features, whereas some studies restrict analysis to a minimal set, risking the exclusion of an influential variable whose true effect may be larger.

A unique feature of the present study is the application of a comprehensive scenario envelope (Q99, WC-Flow, WC-VcT, WC-Flow + VcT) for stress testing, which supplies robust, decision-ready outputs that are not present in previous related work. Taken together, the ML techniques high out-of-sample accuracy, minimal training–testing performance gaps, and scenario-driven stress testing set this approach apart from prior studies, while aligning outcomes with core hydraulic principles and practical engineering expectations. Moreover, the present study developed an interactive practical tool for bridge pier–scour prediction, enabling direct use of the trained models by practitioners. As summarized in Table 7, the existing bridge-scour literature does not offer a comparable implementation-oriented tool alongside its modeling frameworks.

4.2 Engineering implementation

Scour depth is the lowering of the riverbed around a bridge pier and is a key parameter for bridge resilience. During severe floods, the stage and velocity rise quickly, forming intense downflow, horseshoe, and wake vortices around the pier. These vortices concentrate shear, mobilize the surrounding sediments, and carry them away, exposing foundations. If the scour depth grows beyond design limits, the pier’s capacity is reduced, and structural failure can occur (Lee and Hong, 2019). This makes scour depth central to flood-resilient design and operations. It must be monitored in time, with procedures to track scour growth and issue early warnings when thresholds are approached. Historically, scour has been the leading cause of bridge failures, with flood-driven high stages and velocities acting as the primary driver of scour development, explored by the current study. Floods cannot be prevented, but their impacts can be managed. At the design stage, prediction of scour depth for the site using hydraulic conditions and credible flood forecasts is crucial. While in operation, the identification of piers at higher risk, scheduling timely strengthening or protection, and, when necessary, temporarily removing a bridge from service to avoid catastrophic outcomes is the need of the day.

The current study goes beyond building a single predictor of pier-scour depth and undertakes a full model development, validation, and interpretation cycle aimed at practical deployment. Multiple machine learners were trained and tuned, and their performance was quantified on both training and testing splits using complementary metrics (R², RMSE, MAE) to expose accuracy, bias, and dispersion (Rana et al., 2025). To guard against optimistic estimates tied to a particular split, the study ran K-fold cross-validation, summarizing fold-wise scores and variability. The cross-validation framework revealed the metrics to diagnose overfitting (high train/low test, large fold variance) and underfitting (uniformly low scores) rather than relying on a single, whole-dataset score (White and Power, 2023). Residual and parity plots were reviewed alongside the metrics to verify that errors were pattern-free, and that upper-tail behavior was understood. Also, the study emphasized model insights, not only predictions. The employed framework aggregated explanations across the fitted models and generated SHAP analyses to rank the global importance of the hydraulic, geometric, and sediment variables for predicting scour depth and probe local dependence and interactions, e.g., how the effect of approach velocity changes with pier width or initial depth for a specific pier or duration (Nandi and Das, 2025a). These explanations link the learned relationships to hydraulics, help identify regime-dependent behavior (peak-dominated vs. duration-sensitive), and provide actionable levers for design and operations.

Moreover, the workflow was designed for field use. Starting from curated USGS records, targeted feature engineering is applied (including physics-informed estimates of missing drivers such as V_c and T, trained and validated models with transparent checks, and then wrapped the results in a scenario envelope that supports forecasting, triage of at-risk piers, and targeted monitoring or reinforcement. The emphasis is on reproducible, decision-ready outputs rather than record-keeping alone. Furthermore, the study outlines a forward path: incorporating additional toolkits (uncertainty quantification, conformal prediction, dynamic (V_o/V_c) exposure metrics), expanding site-specific scenarios as data improve, and deepening physics–ML integration to better represent near-threshold mobility and extreme events. Taken together, the approach demonstrates how machine learning and AI can be applied systematically and responsibly to strengthen flood-resilient bridge design and operations, while leaving clear hooks for future refinement.

To check whether the hydraulic structures remain adequate under forecast and design flood conditions, the current study used Knik River bridge piers located in southcentral Alaska as a case study to establish conservative upper limits on pier scour. Scenario analysis translates predictive insights into actionable strategies for flood-resilient bridge design (Kosič et al., 2023). Each pier is tested against four critical stress scenarios: Q99 (representing rare high extremes of V_o, y_o, D₅₀, σ_g), WC-Flow (short, intense floods), WC-VcT (events with low V_c and long T), and WC-Flow + VcT (the most conservative envelope). For each scenario, input variations trigger re-computation of V_c and T, ensuring dynamic consistency in predictions. Design recommendations typically emerge in two forms. Sites where peak flows dominate are best managed with measures that dissipate velocity or split flow, such as guide banks and pier upgrades. While sites sensitive to duration require countermeasures focused on resisting prolonged flood exposure, like strengthening the piers, deeper cutoffs, and toe protection (Lagasse et al., 2001). The WC-Flow + VcT scenario provides an upper benchmark for intervention: piers with predicted y_s above established safety limits under this scenario should be prioritized for structural upgrades or heightened flood monitoring. This risk-based framework allows authorities and practitioners to strategically allocate engineering resources by identifying the governing scour scenario for each pier. Sites vulnerable to peak flows and sites sensitive to event duration can be managed with scenario-tailored toolkits, optimizing expenditures and maximizing risk reduction. Importantly, over-design can be avoided where existing resilience is sufficient, as the scenario analysis clarifies which piers require immediate strengthening and which can be safely monitored without strengthening. This targeted approach provides a robust pathway for enhancing bridge safety and longevity, especially under future flood uncertainty.

The study also uses a physics-informed scenario envelope to handle missing inputs and to turn forecasts into actionable risk. When key variables are unavailable at a site, most often the critical velocity (V_c) and the effective duration of mobility (T) can be computed by establishing hydraulic relations and embedding those formulas inside the scenario framework. By doing so, each scenario (e.g., peak-flow, duration-focused, combined worst case, or Q99) carries internally consistent approach flow velocity (V_o), (y_o), (V_c), and (T), allowing the model to estimate scour even where records are incomplete. Using forecast flood stage and velocity, the scenario envelope identifies which piers will face high intensity (V_o/V_c) and long exposure (large T), ranks them by predicted scour, and triggers early warnings for high-risk assets.

The approach is transparent, safe, and sustainable because it ties imputation to physics rather than ad hoc guesswork, remains usable when monitoring gaps exist, and scales naturally to real-time operations. In this way, the study contributes to flood resilience by combining publicly available hydrologic data, physics-based estimation of missing drivers, and scenario-based analytics to guide timely protection, monitoring, and communication before hazardous conditions develop.

4.3 Limitations and scope

The models are based solely on laboratory data, and while laboratory experiments cover many scenarios, caution is needed when applying these results to field conditions. Factors like scale effects, armoring, debris or ice impacts, and live-bed sediment movement are less represented in lab settings compared to real-world environments. For the Knik River site, certain input variables, especially V_c and T, were either estimated or set according to scenario definitions, which introduces uncertainty to the predictions. The Q99 scenario draws on percentile ranks within the dataset, not on a hydrological basis; more realistic predictions could be achieved by linking these scenarios to basin-specific flood frequency and sediment mobility models. Lastly, the relatively limited variation in D₅₀ and σ_g data restricts the model’s ability to fully explore the influence of sediment gradation.

4.4 Future work

Field calibration and validation at bridges with high-quality monitoring should be prioritized, along with Bayesian or bootstrap uncertainty quantification for $y_{s}$ envelopes. Physics-guided ML (e.g., constraints from threshold mobility and time-to-equilibrium) can reduce extrapolation risk in sparsely sampled regimes. Replacing ad hoc extremes with probabilistic scenario generation tied to joint flood-sediment statistics would align envelopes with target reliability levels for flood-resilient design. Finally, integrating per-pier envelopes and uncertainty into risk-cost decision tools would help agencies prioritize mitigation and monitoring where it yields the greatest resilience gains.

5 Conclusion

This study presents a transparent, physics-aware toolchain for predicting bridge-pier scour depth (ys) and turning those predictions into clear guidance for flood-resilient design and operations. Trained on the PSDb-2014 laboratory data, all the developed models, including Gradient Boosting, AdaBoost, XGBoost, Kernel Ridge (Poly), and Gaussian Process (RBF), performed well in both training and testing. Specifically, the tree ensemble with Gradient Boosting generalizes well, having training and testing R² values of 0.99 and 0.96, respectively. Moreover, the tree ensemble models showed small, well-behaved residuals, which means they track measured scour depth closely in unseen cases. Also, the Gradient Boosting parity fit line $(y = 0.945 x + 0.023, Fit R^{2} = 0.978)$ lies near the 1:1 line, indicating low bias and strong generalization. Collectively, these results provide a robust, decision-ready basis for design, operations, and risk management of bridge foundations under extreme hydraulics.

The developed model’s generalization was assessed using a 5-fold cross-validation approach within an external training–testing framework, where all data processing and physics-informed updates were performed strictly within each training fold to prevent data leakage. This procedure produced consistent, low-error outcomes across the folds, which closely aligned with the results on the held-out test data, indicating that the models are robust and suitable for real-world applications. Across folds, performance ranked Gradient Boosting > XGBoost > AdaBoost (Tree) > Kernel Ridge (Poly) > Gaussian Process (RBF), reinforcing confidence in the algorithms’ robustness and interpretability. To enhance the interpretability of model predictions and verify their physical consistency, SHAP analysis was employed to quantify the contributions of each input variable to the predicted scour depth (y_s). The results clearly indicate that bridge pier width (bn) is by far the most influential factor, accounting for 70.6% of the explained variance, followed by approach flow velocity (V_o) at 9.8% and approach depth (y_o) at 7.5%. Event duration (T) gains relevance in cases of prolonged exposure, while sediment gradation (σ_g), critical velocity (V_c), and median sediment size (D₅₀) exert smaller, yet still interpretable, effects on scour outcomes. Moreover, the present study developed an interactive practical tool for bridge pier–scour prediction, allowing practitioners to directly use the trained models without requiring coding expertise.

The framework enforces physics-informed updates to critical velocity and event duration, maintaining physical realism and preventing model drift under hydraulic extremes. Applied to the Knik River bridge piers, the study categorized extreme conditions into four scenario envelopes that capture rare, peak, and sustained flood events to guide flood-resilient design and risk management. The combined worst-case (WC-Flow + VcT) typically sets the upper design bound, WC-Flow governs peak-driven risk, WC-VcT addresses long-duration vulnerabilities, and Q99 provides a realistic rare-event benchmark. Additionally, the framework validates that missing input data can be effectively handled using realistic imputation without compromising modeling integrity. These envelopes support risk-based triage and practical action in extreme flood events. Where peak flow dominates, rapid-trigger monitoring and velocity-reducing measures (e.g., flow deflectors, local armoring) should be prioritized. Where duration governs scour, duration-resistant countermeasures (e.g., toe protection, deeper cutoffs, improved embedment) become more effective. The framework, therefore, offers a direct path from data and models to asset-level decisions: screen piers with the envelope, identify the governing mechanism (intensity vs. duration), and select fit-for-purpose measures. Because the models are interpretable, stakeholders can audit why a site is flagged (e.g., large b_n and high V_o) and trace the effect of each input on predicted scour depth. This systematic, physics-consistent approach supports flood-resilient design decisions, maintenance prioritization, retrofit planning, emergency response, and clear risk communication.

Data availability statement

The interactive app tool, together with all trained model files and the Python scripts used for SHAP analysis, is available at the provided link (https://huggingface.co/spaces/Adilkhan01/Scour).

Author contributions

AK: Formal Analysis, Data curation, Methodology, Conceptualization, Visualization, Software, Writing – original draft, Investigation. DI: Conceptualization, Writing – review and editing, Supervision, Resources, Project administration, Validation.

Funding

The author(s) declared that financial support was not received for this work and/or its publication.

Conflict of interest

The author(s) declared that this work was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Generative AI statement

The author(s) declared that generative AI was not used in the creation of this manuscript.

Any alternative text (alt text) provided alongside figures in this article has been generated by Frontiers with the support of artificial intelligence and reasonable efforts have been made to ensure accuracy, including review by the authors wherever possible. If you identify any issues, please contact us.

Publisher’s note

All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article, or claim that may be made by its manufacturer, is not guaranteed or endorsed by the publisher.

References

Aas, K., Jullum, M., and Løland, A. (2021). Explaining individual predictions when features are dependent: more accurate approximations to shapley values. Artif. Intell. 298, 103502. doi:10.1016/j.artint.2021.103502