Machine learning application in modelling marine and coastal phenomena: a critical review

Pourzangbar, Ali; Jalali, Mahdi; Brocchini, Maurizio

doi:10.3389/fenve.2023.1235557

REVIEW article

Front. Environ. Eng., 11 September 2023

Sec. Environmental Impact Assessment

Volume 2 - 2023 | https://doi.org/10.3389/fenve.2023.1235557

Machine learning application in modelling marine and coastal phenomena: a critical review

Ali Pourzangbar^1,2*

Mahdi Jalali³

Maurizio Brocchini²

¹Institute for Water and River Basin Management—Hydraulic Engineering and Water Resources Management, Karlsruher Institut für Technologie (KIT), Karlsruhe, Germany
²Department of Civil and Building Engineering and Architecture, Università Politecnica delle Marche, Ancona, Italy
³Department of Civil Engineering, Tehran University, Tehran, Iran

This study provides an extensive review of over 200 journal papers focusing on Machine Learning (ML) algorithms’ use for promoting a sustainable management of the marine and coastal environments. The research covers various facets of ML algorithms, including data preprocessing and handling, modeling algorithms for distinct phenomena, model evaluation, and use of dynamic and integrated models. Given that machine learning modeling relies on experience or trial-and-error, examining previous applications in marine and coastal modeling is proven to be beneficial. The performance of different ML methods used to predict wave heights was analyzed to ascertain which method was superior with various datasets. The analysis of these papers revealed that properly developed ML methods could successfully be applied to multiple aspects. Areas of application include data collection and analysis, pollutant and sediment transport, image processing and deep learning, and identification of potential regions for aquaculture and wave energy activities. Additionally, ML methods aid in structural design and optimization and in the prediction and classification of oceanographic parameters. However, despite their potential advantages, dynamic and integrated ML models remain underutilized in marine projects. This research provides insights into ML’s application and invites future investigations to exploit ML’s untapped potential in marine and coastal sustainability.

1 Introduction

Coastal areas are of vital significance due to their crucial role in supporting aspects such as biodiversity, economic activity, cultural heritage, climate regulation, food security, recreational opportunities, and strategic importance (Neumann et al., 2017). Ensuring their sustainability, however, is a challenge that requires addressing various factors, among which, climate change adaptation, beach protection and water quality management. One approach to ensuring the sustainability of coastal areas involves conducting a thorough examination of each contributing factor by employing data analysis and suitable methods. Effective data analysis and augmentation is, therefore, essential for informed decision-making and sustainable management of coastal areas.

The amount of data related to coastal systems has dramatically increased recently (Goldstein et al., 2019). This data, which often covers large areas and spans long periods of time, is now available in high resolution and can be accessed quickly. This has led to more opportunities for research on the sustainability of activities evolving in coastal areas. However, handling large and complex datasets, as well as identifying their patterns and trends, is not a convenient task. Despite their widespread use and mathematical rigor, conventional statistical techniques, including descriptive statistics (Emmanouil et al., 2020), inferential statistics (Agarwal and Manuel, 2008), regression analysis (Davidson et al., 1996; Hall et al., 2002), correlation analysis (Szmytkiewicz et al., 2000; Kroon et al., 2008; Ruiz de Alegría-Arzaburu et al., 2010), Analysis of Variance (ANOVA) (Martins et al., 2010), and Principal Component Analysis (PCA) (Hua et al., 2007; Miller and Dean, 2007), have limitations when processing large and complex data sets, and can present challenges in terms of interpretability. This has prompted researchers to explore alternative, more sophisticated approaches such as ML, which enables researchers to draw insights from data in a more efficient, accurate and automated way.

ML is a rapidly growing field that has the potential to make significant contributions to the sustainable use and management of marine and coastal environments. This by helping to better understand and predict the impacts of human activities and natural phenomena on coastal ecosystems and identify potential threats. The link between using machine learning to simulate coastal and marine events and sustainability revolves around creating models and taking action. Machine learning employs large amounts of data to create simulations for different scenarios, such as wave propagation or water quality management. These simulations help us fine-tune our actions, like improving wave energy converters or changing shipping paths to avoid pollution. Moreover, these simulations can guide our work towards adapting and mitigating the effects of environmental changes, like coastal erosion caused by rising sea levels. In essence, machine learning offers crucial insights that contribute to improved, sustainable care of our coastal and marine environments.

Typically, the primary input of ML algorithms consists of a data set in various forms such as numeric, image, DEMs collected by Lidar (light detection and ranging), video, and geographic information systems (GIS) data, which are mapped and visualized using GIS. The main output of ML algorithms in coastal engineering can vary depending on the specific application and dataset being used, which includes prediction [e.g., coastal flooding risk (Park and Lee, 2020), storm surge (Sajjad et al., 2020), wave height (Dogan et al., 2021), sediment transport (Pourzangbar et al., 2017b; 2017c; 2017a) and beach erosion (Beuzen et al., 2019)], image processing using satellite imagery data (Agrafiotis et al., 2019) or drone footage (Provost et al., 2020), pattern recognition [e.g., patterns of sediment transport (Liu et al., 2021)], placement optimisation (Cuadra et al., 2016; Sarkar et al., 2016; Neshat et al., 2019), optimization by identifying the most efficient and cost-effective solutions for protecting the coast from erosion and flooding, monitoring (e.g., using sensor data to detect erosion or changes in water quality), anomaly detection (e.g., unusual changes in water quality), and decision making (Lazuardi et al., 2021) by providing decision support to coastal managers and engineers. However, the applicability of ML approaches in coastal engineering is influenced by various factors such as data quality, computational resources, the complexity of the coastal system, and the choice of appropriate algorithms.

Several ML methods have been used to study the sustainable use of coastal areas, including: Artificial Neural Networks (ANNs) used for predictions such as water quality (Chen and Ma, 2010), river classification based on the water quality index (Wong et al., 2021), wave height (Rao and Mandal, 2005; Günaydin, 2008) and beach erosion (Hashemi et al., 2010) and tidal prediction; Decision Trees (DTs) used for classifying the dominating environmental factors; Random Forests (RFs) used for regression and classification tasks, such as predicting the effect of human activities on the coastal environment and water quality index modelling (Sakaa et al., 2022); Support Vector Machines (SVMs) used for solving classification and regression problems, such as identifying the most vulnerable areas in coastal zones; K-Nearest Neighbors (KNN) used for clustering and classification tasks, such as grouping coastal regions based on their sustainability indicators; Ensemble Methods used for improving the accuracy of predictions and classifications, such as predicting the impact of climate change on sustainability of coastal activities, among others.

ML has been widely used in numerous research studies, but there still exists a knowledge gap regarding the selection of parameters, choice of predictive models (be they dynamic or static), domain adaptation, and use of integrated models for analyzing complex systems and evaluating the effects of multiple factors. In relation to data treatment, many existing works have relied on simple heuristic methods or rules of thumb; however, there are more solid mathematical and metaheuristic methods for data preprocessing and parameter identification, highlighted in this paper. Choosing the correct model can be challenging and there is not a definitive method to identify the most suitable ML model for a given problem. In general, the ML approach used to solve a specific issue is selected through a process of trial and error. However, comparing how models perform under different conditions can aid in selecting the most suitable one for a specific issue. To the best of authors’ knowledge, there is not one single paper that offers comprehensive information about the data preprocessing and preparation phase. This paper provides an extensive review of various methodologies employed in coastal engineering to handle datasets. The main focus of this paper is to understand how ML models contribute to the sustainable use and management of marine and coastal environments, rather than the technical intricacies of their setup. The primary goal is to provide a critical review of literature that utilized ML approaches to manage marine phenomena. This review sheds some light on how to prepare parameters and datasets for input into the ML model, the pros and cons of various models, the suitability of ML methods for certain conditions, and their shortcomings and deficiencies.

Although numerous papers have discussed modeling coastal phenomena using experimental, numerical, and mathematical methodologies, the focus of the current paper is exclusively on literature that implemented ML techniques for modeling coastal and marine events. The selected literature spans a broad range of topics from data preprocessing and parameter considerations to different kinds of ML models used for various purposes. Due to the large amount of published papers, the focus of our contribution was directed towards resources published in reputable international journals such as Elsevier, Springer, IWA, Taylor and Francis, Wiley, ASCE, among others. The papers were chosen based on their publication in reputable international journals and were retrieved through online searches using relevant keywords. Among the publications, Coastal Engineering (Elsevier) with 18 papers and Ocean Engineering (Elsevier) with 17 papers, had the most papers in this area. The majority of the sources are fairly recent, predominantly within the past 10 years. Nevertheless, this paper includes some older references that established the groundwork for newer methods. Roughly, fewer than 5% of the literature we reviewed was published before 2000, about 14% between 2000 and 2010, 22% between 2010 and 2015, and over 60% in the last 10 years.

While ML has been implemented in numerous studies, knowledge gaps exist in areas such as parameter selection, choice of models for making predictions (dynamic or static), domain adaptation, and the use of integrated models for modeling complex systems. The emphasis of the paper is on the contribution of ML models to the sustainable use and management of the marine and coastal environment, rather than on the technical details of their configuration.

The paper is structured as follows: Section 2 discusses the key components of data analysis and preprocessing, including data collection and preparation for the modeling process. Section 3 focuses on studies that have applied AI to coastal engineering for sustainable outcomes. The paper also evaluates the accuracy and robustness of the different models in Section 4. Finally, the paper summarizes all the information presented and concludes with a list of references.

2 Data preparation (preprocessing)

Data preparation involves transforming raw data into a format that can be used by ML algorithms for extracting insights or predicting outcomes. This process is vital in ML as it considerably affects the performance of the model (Kelleher et al., 2015). In the event of missing or invalid data, the algorithm either cannot process it or yields less precise, possibly erroneous results. This procedure starts with the acquisition of raw data (refer to Section 2.2), followed by data integration, which entails consolidating data from various sources into a unified dataset. This is succeeded by data cleansing to rectify missing values and outliers (refer to Section 2.4), and then selecting the most pertinent features from the input parameters (feature selection or dimensionality reduction) (see Section 2.5). Subsequently, feature engineering is undertaken, which involves generating new variables from existing parameters using dimensional analysis (DA). Lastly, data transformation is carried out, which involves altering the scale or distribution of variables, such as through data normalization. Figure 1 depicts the multiple phases required for data preprocessing and the methods linked with each step. The upcoming sections provide a detailed explanation of these methods.

FIGURE 1

FIGURE 1. Data preprocessing steps and their corresponding necessary tasks that must be accomplished.

2.1 Marine data types

In coastal engineering, data can come in different forms (Huang et al., 2015) and can be classified into different types based on their identity, format, and structure. Some examples of coastal data types include:

(1) Numeric data (Timmermans et al., 2020), which includes measurements of various physical parameters such as water level, wave height, current velocity, sediment concentration. Such data are typically collected using instruments such as tide gauges, wave gauges, current meters, and sediment samplers. For example, time-series data such as ocean temperature records, sea level measurements, and storm surge data represented by a sequence of observations or measurements taken at regular intervals over time.

(2) Image data (Vos et al., 2019; Turner et al., 2021), which includes aerial and satellite imagery, as well as ground-based photographs. These data can be used to study coastal morphology, vegetation, and land use patterns.

(3) Point Cloud data (Gomez, 2022), represented by a set of 3D points that can be used to create 3D models of coastal terrain and structures. Point cloud data is often collected using light detection and ranging (LiDAR) systems and can be used to create high-resolution digital elevation models (DEMs) of coastal topography.

(4) Video data (Smit et al., 2007; Kim et al., 2020; Kim and Kim, 2020), which includes footage captured by cameras, this data can be used to observe the coastal dynamics and measure the beach profile, the shoreline position, and the wave breaking patterns.

(5) Text data (Brown et al., 2021), represented by written or spoken words, can be analyzed using natural language processing (NLP) techniques. Examples of text data in coastal engineering include social media posts, news articles, and scientific publications.

The following are the most well-known methods for collecting the data mentioned above: field observations, remote sensing measurements, experimental studies, numerical and mathematical models. Both the availability of equipment and the objective of the study influence the selection of the data collection medium (Prata et al., 2019).

2.2 Marine data resources

Data collection within the realm of marine sciences principally relies on three distinctive methods: in-situ observations, remote sensing techniques, and the use of mathematical and numerical models, as outlined by Verwega et al. (2021). In-situ data collection encompasses ship-based measurements, the deployment of moorings, gliders, autonomous underwater vehicles, drifters and floats, the use of sea-floor optic cables, and laboratory analyses. Field observations remain essential for the collection of real-world data on coastal processes, such as wave heights and tidal levels. In-situ instruments are highly accurate with proper maintenance but may have low-time frequency data for large areas. They offer historical climate trend insights not available from remote sensing and are less affected by atmospheric conditions. These observations serve to validate numerical models that simulate coastal processes and predict the behavior of the coastal system, including wave patterns, tidal currents, and shoreline evolution.

Remote sensing involves acquiring data on coastal topography, bathymetry, and other significant parameters through satellite and airborne platforms. Remote sensing technologies are divided into three categories: satellite, ground-based, and drones (Elsayed et al., 2021). The data thus collected enable the generation of high-resolution coastal environmental maps. Although satellites are powerful tools, they face limitations in obtaining high-resolution regional-scale imagery. Clouds can hinder data capture, and high-resolution imagery can be challenging to interpret (Elsayed et al., 2021). A combination of satellite- and ground-based remote sensing and drones could be effective in future marine engineering evaluations. Economically, combining these tools may be comparable to in-situ techniques in terms of overall cost. Such technology could enable rapid, high-resolution water condition assessments and enhance our understanding of water resource processes. Mathematical and numerical models generate data by simulating real-life systems or processes using mathematical equations and algorithms (Xie and Arkin, 1996). They provide the capability to extend observational data, even to the point of simulating future climate scenarios (Eyring et al., 2016). Nonetheless, it is crucial to understand that these models only approximate real-world scenarios and can encompass spatial and temporal scales that exceed the scope of observational data (Matthes et al., 2020). The outputs from these models are typically available on a unique grid, contingent on the specific simulation. For instance, climate models customarily provide a four-dimensional space-time grid. Consequently, the comparison of model outputs with measurements invariably necessitates interpolation or data aggregation. Table 1 provides a detailed summary of the advantages and disadvantages associated with these diverse data collection methodologies.

TABLE 1

TABLE 1. Detailed information of the various data collection methods in coastal engineering.

2.3 Data cleaning: outlier detection

Several factors can influence the quality of observational data. These include inaccuracies in the instruments, malfunctions of the equipment, disruptions from external sources, mistakes during data conversion, communication mishaps, and significant unforeseen errors (Yu et al., 2022). Such anomalies can pose major threats to operational functionality, downstream operations, system resilience, and cleaner production (Ba-Alawi et al., 2021). Therefore, these should be detected promptly and their data rectified to ensure more realistic measurements.

Anomaly detection methods are generally categorized into various types (see Figure 2) such as Statistical Methods, that utilize the properties of the underlying data distribution to identify anomalies (Chandola et al., 2009); Distance-based Methods, which calculate the distance between data points and identify the outliers based on a certain distance threshold (Ramaswamy et al., 2000); Density-based Methods, which estimate the density of data points and identify outliers as those points that reside in low-density regions (Ester et al., 1996); Machine Learning-based Methods, which employ supervised, unsupervised, or semi-supervised ML algorithms to detect outliers (Pimentel et al., 2014); and Ensemble Methods, which combine multiple outlier detection algorithms to improve the overall performance (Zimek et al., 2012). The choice of method, or combination of methods for better results, depends on the nature of the data and the specific problem being addressed.

FIGURE 2

FIGURE 2. Various outlier detection appproaches.

Mahmoodi and Ghassemi (2018) used outlier detection algorithms to improve wave height predictions, while Oehmcke et al. (2015) demonstrated the effectiveness of ML for identifying significant events in marine long-term data. Daranda and Dzemyda (2020) developed a method combining the Density-Based Spatial Clustering of Applications with Noise (DBSCAN) clustering algorithm and k-nearest neighbors analysis for detecting marine traffic anomalies. These studies highlight the potential of leveraging advanced algorithms and ML in marine data analysis and decision-making. This section aims to provide a survey of contemporary outlier detection techniques, comparing their motivations, advantages, and disadvantages. Outliers can significantly impact the results, which makes addressing or eliminating them before analysis and model development crucial.

Considering the learning algorithm, three main methodologies exist for outlier detection (Hodge and Austin, 2004): 1) unsupervised approach, which uses a learning technique to identify outliers without prior knowledge of the data. The data is treated as a static distribution, and the most distant points are flagged as potential outliers; 2) supervised classification method, which requires pre-labeled data. It allows for online classification, where the classifier continuously learns the model and classifies new data as normal or abnormal, and finally 3) semi-supervised recognition technique, which only learns the normal class, using pre-classified data. It can distinguish new data as normal or novel based on its proximity to the boundary of normality. The choice of an outlier detection method depends on the data type, the number of vectors and attributes, speed and accuracy requirements, and the ability to accurately identify outliers. The key factors in choosing a method are selecting an algorithm that can handle the data and defining a suitable neighborhood for the outlier.

2.4 Dimensionality reduction

Incorporating parameters that are not relevant can result in intricate models that pose significant challenges in interpretation and execution compared to the models developed using the most crucial parameters (Pourzangbar, 2012). That is the reason why the focus is placed on building ML models using the most crucial parameters. These parameters are not only essential for the model’s output, but also are unconnected with other input parameters. To derive the most important dimensions (parameters) in the input space, there are several methods including min/max autocorrelation factor analysis (MAFA), dynamic factor analysis (DFA), Least Absolute Shrinkage and Selection Operator (LASSO), Independent Component Analysis (ICA), multicollinearity test and PCA. Table 2 summarizes some famous dimensionality reduction approaches used in marine engineering. The latter two methods are explained below.

TABLE 2

TABLE 2. Some well-known dimensionality reduction approaches and their example references.

2.4.1 Multicollinearity

Multicollinearity is a common issue that can arise in regression analysis when two or more predictor variables in a model are highly correlated with each other. This can cause problems in the analysis, such as unstable and unreliable coefficient estimates. There are several methods to detect multicollinearity in a regression model. Here are a few commonly used tests:

• Correlation matrix: A correlation matrix can be used to identify the degree of correlation between each pair of predictor variables. High correlation coefficients (e.g., greater than 0.7 or 0.8) may indicate multicollinearity.

• Variance Inflation Factor (VIF) quantifies how much the variance of the estimated regression coefficients is expanded due to multicollinearity. Suppose there are three input parameters: $x_{1}, x_{2}$ , and $x_{3}$ , and the goal is to compute VIF for $x_{1}$ . To accomplish this, we predict $x_{1}$ using linear regression based on $x_{2}$ and $x_{3}$ . Next, we determine the correlation coefficient between the predicted and actual values of $x_{1}$ , which we use to calculate VIF using the formula $V I F = 1 / (1 - R^{2})$ . Often, VIF values exceeding 5 or 10 serve as a benchmark for identifying variables that might pose problems.

If the VIF values for the independent variables are high, it indicates that multicollinearity is impacting the regression model. This issue might need to be resolved, possibly by removing one of the correlated variables, combining them, or applying methods such as ridge regression, or principal component analysis.

• Condition number: The condition number is a measure of the overall multicollinearity in the model and is calculated as the square root of the ratio of the largest to smallest eigenvalue of the correlation matrix. Condition numbers greater than 30 may indicate problematic multicollinearity.

• Eigenvalues: Eigenvalues of the correlation matrix can also be used to detect multicollinearity. Large eigenvalues (for example, greater than 1) may indicate high levels of multicollinearity.

• Tolerance (TOL) is another measure that can be used to detect multicollinearity in a regression model. It is the reciprocal of the VIF (variance inflation factor) and measures the proportion of the variance in a predictor variable that is not explained by the other predictor variables in the model. If the Tolerance value for a variable is close to 1, it suggests that there is no multicollinearity between that variable and the other predictor variables in the model. On the other hand, if the Tolerance value is close to 0, it indicates a high degree of multicollinearity between that variable and the other predictor variables in the model. In general, Tolerance values of less than 0.1 or 0.2 are indicative of problematic multicollinearity.

It is important to note that none of these tests can definitively prove the presence of multicollinearity, but rather provide evidence that it may be present in a model. Therefore, it is important to use multiple tests and to interpret the results in the context of the specific research question and data being analyzed.

2.4.2 Principle component analysis

PCA can be utilized for dimensionality reduction (Pearson, 1901). PCA reduces the dimensions of datasets in a way that their interpretability increases. To achieve this, PCA maximizes the variance of datasets by mapping them in a new coordinate (new uncorrelated variables). The most correlated parameters are deleted while information loss is minimum. The initially proposed method was limited to up to three parameters; however, Harold Hotelling has described methods for computing multivariate PCA since 1933 (Hotelling, 1933).

In the mathematical description, it is assumed that the input environment contains $n$ parameters and $m$ measurements for each parameter. Hence, the input matrix $X$ has $n \times m$ components. The input environment can be transformed into a feature environment whose dimensions are not dependent on each other. Accordingly, the feature environment can be represented by a $n \times m$ matrix, i.e., $Y_{n \times m}$ . The transformation can be done using a whitening or sphering transformation matrix ( $Q_{n \times n}$ ) as follows:

Y = Q^{T} . X (1)

w h e r e Y_{(n \times m)} = (\begin{array}{l} Y 1 \\ \begin{array}{c} Y 2 \\ \dots \\ Y n \end{array} \end{array}); X_{(n \times m)} = (\begin{array}{l} X 1 \\ \begin{array}{c} X 2 \\ \dots \\ X n \end{array} \end{array}); Q_{(n \times n)}^{T} = (\begin{array}{c} q 11 q 12 \dots q 1 n \\ q 21 q 22 \dots q 2 n \\ \dots \\ q n 1 q n 2 \dots q n n \end{array})

The primary goal of PCA is to identify the components of the transformation matrix in such a way that the new variables exhibit maximum discrepancy (represented by variance). With some mathematical manipulation, the following equation for the transformation matrix can be derived:

R Q = Ω Q (2)

where $R$ is the covariance matrix of the input environment ( $X$ ), $Ω$ is a diagonal matrix whose components are the eigenvalues ( $λ$ ) of the matrix $R$ , and $Q$ is a matrix that its components are the eigenvectors of $R$ .

2.5 Dimensional analysis

Although numerous methods exist for DA, the majority of studies employ the Buckingham π Theorem to render the parameters dimensionless. Table 3 summarizes some of the studies used DA before feeding their ML models.

TABLE 3

TABLE 3. Comparative overview of various studies utilizing DA and their derived dimensionless parameters.

2.6 Normalization

Normalizing data helps to ensure comparability by transforming it into a common scale, avoiding bias in statistical analyses and allowing for accurate and meaningful results by removing the impact of unit differences, especially when comparing data from different sources. Normalization plays a crucial role in efficient machine and deep learning by ensuring that large numerical inputs are processed effectively (Van Komen et al., 2022). The choice of normalization method depends on the specific requirements of the data and the problem being solved. Some of the famous methods for data normalization are summarized in Table 4.

TABLE 4

TABLE 4. Well known Normalization techniques used in ML modeling.

In Table 4, the transformed data, referred to as $x_{N}$ , is obtained by normalizing the original data ( $x_{i}$ ) in a new range. The original data is contained within a vector, denoted as $X$ , and its minimum and maximum values are represented as $\min_{X}$ and $\max_{X}$ , respectively. The chosen minimum and maximum values for the transformed range are ${N e w}_{\max}$ and ${N e w}_{\min}$ , which are typically set to zero and one, respectively. $μ_{X}$ is the mean of the data, and $σ_{X}$ is the standard deviation of the data.

Min-Max normalization is a technique used to rescale a feature to a specific range, usually between 0 and 1. However, to avoid having zero data in the model, an alternative approach is to expand the range to include values between 0.1 and 0.9. It is a commonly used method for transforming variables so that they are comparable, as it scales the data linearly to a specific range. Through this normalization process, the values in $x_{N}$ are transformed such that the minimum value of $X$ is mapped to 0, the maximum value to 1, and intermediate values are mapped to corresponding values between 0 and 1. The Z-score normalization, also known as standardization, is a method of transforming data to a standard normal distribution with a mean of 0 and a standard deviation of 1. This normalization process rescales the data and centers it around the mean, allowing for easier comparison of values. It is commonly used in various fields, such as statistics, ML and data analysis. Sigmoid normalization uses a sigmoid function to transform the data, proving useful in instances where the data distribution is asymmetrical. The sigmoid function maps any input value to a value between 0 and 1 and it is commonly used in ML and ANN models to represent a probability or to rescale data. Additionally, the sigmoid function is differentiable, which makes it useful in optimization problems and backpropagation in neural networks.

In coastal phenomena, the relationship between inputs and outputs typically displays nonlinearity, but certain models, such as the M5 model tree, are unable to handle nonlinearity. To address this limitation, M5 models have been implemented using a logarithmic form for both inputs and outputs (i.e., the natural logarithm of inputs and outputs). This logarithmic form is more accurate than a linear formulation because it better captures the nonlinear nature of the contributing parameters (Pourzangbar et al., 2017a; Afsarian et al., 2018). Log scaling entails transforming data points through the application of a logarithmic function. The logarithm maps large values to smaller ones and vice versa, helping to make skewed data more symmetrical and manageable for analysis. The selection of a specific logarithmic function depends on the needs of the data and the analysis to be performed, such as log base 10, log base 2, or natural logarithm. Despite its advantages, normalization may result in a loss of interpretability, increased sensitivity to outliers (as seen in techniques like min-max scaling and z-score), loss of information, dependence on the entire dataset, impacts on categorical features, and varying sensitivity across algorithms.

3 AI learning algorithms and their application in marine/coastal engineering

3.1 Supervised-based ML methods

Supervised ML presents a powerful approach, necessitating labeled data for model training. Its versatility permits its usage across a variety of applications, such as image and speech recognition, natural language processing, and predictive analytics. Common algorithms used in supervised learning encompass linear regression, logistic regression (LR), decision trees, random forests, support vector machines, and neural networks. A key advantage of supervised learning is its capacity to generate precise predictions for novel and unseen data (Jiang et al., 2020). However, it also has certain drawbacks, including the requirement for labeled data, the quality and quantity of the training data, and the potential for overfitting. Despite these challenges, supervised learning is seen as an essential method in ML and data science, demonstrating high accuracy and less computational time compared to physical models. Despite the inherent complexity of marine processes, supervised-based ML models have demonstrated benefits in understanding coastal phenomena, thereby finding extensive application in coastal engineering to drive innovative models and solve intricate problems (as summarized in Table 5). Supervised ML models have been employed to predict wave parameters like significant wave height and period, wave reflection and transmission coefficients (van Gent et al., 2007; Gandomi et al., 2020; Kuntoji et al., 2020), tide levels (Lee, 2004), ocean currents and wind files (James et al., 2018; Shamshirband et al., 2020), prediction of wind Characteristics under future Climate Change scenarios (Yeganeh-Bakhtiary et al., 2022), flood inundation using Gaussian process model (Donnelly et al., 2022) and breakwater stability number and wave overtopping discharge, among others. Various ML models, such as ANN and SVM, can be employed to do these predictions. ML models have also found application in morphological and morphodynamic predictions, including profile elevation, area, and length, based on parameters like wind speed, direction, wave height, and beach angle (Hashemi et al., 2010).

TABLE 5

TABLE 5. Various ML learning approaches utilized in coastal studies, along with their associated models and methods.

3.2 Unsupervised-based ML methods

Unsupervised learning is a form of ML that functions without predefined labels or target outcomes (Bishop and Nasrabadi, 2006). Its main purpose is to independently discover patterns, structures, and relationships in data. Common applications include clustering, anomaly detection, and dimensionality reduction. Clustering groups similar data points, anomaly detection spotlights unusual patterns (as detailed in Section 2.3), and dimensionality reduction simplifies the number of features while preserving essential information (as seen in Section 2.4). Algorithms like k-means clustering, hierarchical clustering, PCA, and autoencoders are frequently used in unsupervised learning to identify patterns in data. While unsupervised learning can pose challenges due to the lack of a distinct optimization goal, it still holds a vital position in ML, contributing to advancements in fields such as computer vision, natural language processing, and recommendation systems. In the context of coastal engineering, k-means clustering can be used to classify centroid values for data like the maximum oceanic wind. Average centroid clustering can be obtained from both the previously chosen values and the currently selected clustering data (Baboo and Tajudin, 2013). PCA can be employed in coastal engineering to examine correlation matrices (Roseman et al., 2005) and pinpoint major changes in beach profiles and sand grain distributions (Tsujimoto et al., 2012). Moreover, PCA and hierarchical clustering can help characterize coastal plane shape and hydrodynamics. For instance, the form of arc-shaped coasts, largely influenced by geological structure, can be divided into four broad categories that reflect actual conditions using clustering (Scott et al., 2011). By identifying key data components, PCA can aid in elucidating the underlying patterns and structures of the data.

3.3 Reinforcement-based ML methods

Reinforcement learning (RL) is a type of ML where a program, known as an agent, learns to perform tasks by getting feedback from its environment in the form of rewards or penalties (Rengarajan et al., 2022). The agent executes a series of decisions in a mutable environment, aiming to learn the optimal way (or policy) to maximize rewards over time. This process is typically structured as a Markov Decision Process (MDP), encompassing states, actions, transition functions, and reward functions. There are two main types of reinforcement learning algorithms: model-based and model-free. Model-based RL is like making a map to understand the surroundings. On the other hand, model-free RL does not make a map; it just figures out what to do based on where it is at the moment. So, model-based RL is more about planning ahead, while model-free RL is more about learning on the go (Plaat et al., 2023). Model-free methods, like Q-learning, do not need a model of the environment and calculate the expected total of future rewards for each possible action at each state using the so-called the Bellman equation. Q-learning has been used successfully in many different tasks, which is why it is one of the most commonly used model-free RL algorithms. In coastal engineering, RL can be used to develop control policies to reduce the risk of flooding (Bowes et al., 2021). Deep reinforcement learning (DRL), an advanced form of RL, can be used to control devices that convert wave energy, and has been found to work better than traditional control methods (Anderlini et al., 2020). DRL can also adjust itself to changes in system dynamics, allowing for control even when faults occur. Moreover, RL has been used to maximize the electricity produced by wave energy converters (Zou et al., 2022). In addition, a type of RL called multiagent reinforcement learning can simulate the social and economic effects of sea level rise. This can be a useful tool for planning scenarios, analyzing costs and benefits, and optimizing strategies to adapt to changes (Shuvo et al., 2022).

Table 4 summarizes the various ML learning approaches and their corresponding model types. Each model type utilizes a unique set of algorithms. For instance, in the case of classification tasks, ANNs or SVMs may be utilized. The final column of the table highlights the research studies focusing on each specific learning approach, targeting the investigation of a specific coastal process or event.

3.4 AI contribution to the sustainability of marine environments

Predictive models, such as statistical, numerical, or ML models, play a vital role in marine and coastal engineering to safeguard structures from natural forces. Statistical models use past data to forecast future conditions, while numerical models simulate the event using mathematical equations and formulas. ML models, using artificial intelligence (AI), learn from past data for prediction purposes. Each model has its unique approach and is chosen based on data availability and specific project needs.

Various ML techniques have been implemented in the study of coastal and marine environments. Figure 3, sourced from Scopus, provides a visual representation of the percentage of published papers that used different ML methods since the year 2000. Upon reviewing this figure, it is evident that Principal Components Regression (PCR), Linear Model (LM), Regression Tree (RT), and ANN are the most frequently employed ML algorithms for analyzing coastal and marine phenomena. However, certain ML techniques, such as General Regression Neural Networks (GRNN), M5 model tree, Bayesian Model Averaging method (BMA), Generalized Boosted Regression (GBM), and Extreme Gradient Lift (Xgboost) have been applied less frequently in the investigation of coastal and marine events.

FIGURE 3

FIGURE 3. Percentage of different ML methods application in coastal and marine environments in terms of the published papers indexed in Scopus.

Figure 4 shows the application trend of different ML approaches for coastal and marine phenomena. Previously, techniques such as Deep Neural Networks (DNN), Convolutional Neural Networks (CNN), and RF were sparingly employed in diverse studies. However, there has been a significant increase in their use over the past 5 years, demonstrating a growing reliance on these methods in recent research.

FIGURE 4

FIGURE 4. The trend of varios ML algorithms in coastal and marine applicatins.

Figure 5 illustrates the trend of various ML algorithms since 2008 in coastal and marine applications. Some methods are not frequently used which are colored in red (these low-important approaches are not reported in Figure 4).

FIGURE 5

FIGURE 5. Annual publication trends of papers implementing ML methods for predicting coastal and marine phenomena, as extracted from scopus.

3.4.1 Prediction of oceanographic and morphologic parameters

Researchers use ML algorithms and soft computing techniques to predict oceanographic and morphological parameters, as shown in Figure 6. These methods include ANNs, SVMs, Support Vector Regression (SVR), Fuzzy Logic (FL), evolutionary algorithms, such as Genetic Programming (GP) and DTs, among others. Predictive models are widely used in oceanography and coastal management. Their accuracy critically depends on several factors. These include the dataset used for training, the type and configuration of the ML model, tuning parameters, termination condition, and input and output parameters. It is important to note that specific algorithms, with carefully adjusted parameters, are particularly valuable in various research endeavors, depending on the problem being addressed.

FIGURE 6

FIGURE 6. Oceanographic Parameters predicted by ML algoithms.

ANNs, SVRs, M5 decision tree algorithm, and Recurrent Neural Networks (RNNs) including Long-Short-Term Memory (LSTM) models are used to predict wave heights, as per studies by Duong et al. (2023) and Rizianiza and Aisjah (2015). These ML techniques have shown reliable wave prediction capabilities, maintaining accuracy up to 72 h ahead (Jain and Deo, 2008). The use of intact structural data for predicting significant wave heights has been explored, with emphasis on the critical role of data quality in training ANNs for wave height predictions (Ciortan and Rusu, 2018; Demetriou et al., 2021). ANNs have also been implemented to estimate wave breaking heights considering various factors like seabed slope, water depth, and deep-sea wavelength (Duong et al., 2023). In the field of marine energy forecasting, researchers have used multi-class classification methods with ordinal classifiers, such as SVOREX and SVORIM yielding precise results (Fernández et al., 2015). RNN, especially LSTM models, have been employed to predict motion responses in irregular wave patterns (Kagemoto, 2020). Table 6 provides a summary of the top 10 highly-cited papers focused on predicting significant wave height using ML algorithms. The majority of these studies used meteorological data and past wave height as input parameters. The results demonstrate that LSTM neural networks, ANN, kernel-based predictors like SVM and SVR, as well as decision trees, are capable of accurately predicting wave height.

TABLE 6

TABLE 6. Details of the selected reviewed papers, where the ML methods were used to predict the wave height.

To enhance understanding of the effectiveness of various ML methods in predicting wave height, visual representations of the correlation coefficient and Root Mean Square Error (RMSE) values for different ML techniques applied across multiple data sets have been created (Figure 7). To achieve this, we carefully selected studies that used several ML methods for wave height predictions, ensuring each study used a consistent dataset. This allowed for a visual representation of the performance of these ML techniques with specific datasets. By comparing the overall performance of these ML models across various datasets, certain conclusions can be drawn.

• ANN and SVR algorithms are commonly used in predicting wave height.

• The count of neurons present in the hidden layers of ANNs slightly influences the precision of the model.

• Integrated algorithms, like ICEEMDAN-ELM, exhibit superior performance in terms of accuracy and error indices compared to other ML methods.

• There has been a significant increase in the adoption of ML algorithms, especially integrated algorithms, in recent years (see Figure 8).

FIGURE 7

FIGURE 7. Comparison of CC (upper panel) and RMSE (lower panel) across different ML approaches and datasets. Each color symbolizes a unique study. Results are extracted from Makarynskyy et al. (2005); Mahjoobi and Etemad-Shahidi (2008); Mahjoobi and Adeli Mosabbeb, (2009); Cornejo-Bueno et al. (2016); Berbić et al. (2017); Akbarifard and Radmanesh (2018); Kumar et al. (2018); Nikoo et al. (2018); Ali and Prasad (2019); Shamshirband et al. (2020); Kaloop et al. (2020). The abbreviations are: Online sequential ELM (OSELM), Extreme Learning Machine (ELM), Improved Complete Ensemble Empirical Mode Decomposition method with Adaptive Noise (ICEEMDAN), Online Sequential (OS), RF, Grouping Genetic Algorithm (GGA).

FIGURE 8

FIGURE 8. Wave height prediction using different ML algorithms during 2005–2023.

The M5 decision tree algorithm, ANNs, and gradient boosting decision trees serve as robust tools for predicting wave overtopping discharge on coastal infrastructure such as breakwaters. When focusing on wave overtopping and runup, the M5 decision tree algorithm exhibits promising capabilities for predicting runup waves, taking into account laboratory data and multiple parameters (Abolfathi et al., 2016). ANNs are also used to predict wave reflection and transmission coefficients (Zanuttigh et al., 2016; Formentin et al., 2017). It has been proven that gradient boosting decision trees, as a novel ML technique, has improved the accuracy of predicting average wave overtopping discharges by nearly threefold in comparison to traditional neural networks (den Bieman et al., 2020). Kernel-based approaches, such as Gaussian Process Regression (GPR) and SVR, have also been utilized in predicting wave overtopping, with GPR showing superior performance over ANNs and empirical formulas (Hosseinzadeh et al., 2021).

The measurement of Sea Surface Temperature (SST) is vital for understanding the global climate. It significantly contributes to climate modeling, weather forecasting, and studies on marine ecosystems. Accurately predicting SST can aid in mitigating the environmental harm resulting from rising water temperatures due to human-induced climate change. This prediction not only benefits marine ecosystems but also preserves coastal economies and the broader coastal environment (Choi et al., 2023). LSTM neural networks have proven effective in forecasting SST, showing enhanced performances when the right amount of input data is used (Xu et al., 2020). Multivariate LSTM models, which take into account factors such as wind speed and sea-level air pressure alongside SST, have demonstrated superior results compared to univariate models that only factor in SST (Balogun and Adebisi, 2021). Traditional ML models have been studied for spatio-temporal time series prediction, highlighting the importance of spatial data. Among these, the LSTM model emerged as the most efficient, showing a 25% improvement in forecasting performance (based on RMSE) when spatial information was incorporated (Kartal, 2023). Research indicates that LSTMs, whether using single or multiple variables, surpass other ML models in predicting SST (Xu et al., 2020; Kartal, 2023).

Moreover, accurate predictions of coastal sediment transport are crucial for managing coastal erosion and development, with researchers traditionally estimating sediment transport using experimental methods. Artificial intelligence-based methods potentially improve decision-making for managing coastal erosion and development (Bakhtyar et al., 2008; Kabiri-Samani et al., 2011), given the importance of selecting valid input data and appropriate activation functions (Pourzangbar, 2012; Yeganeh-bakhtiary et al., 2012). Artificial intelligence and ML methods, such as Adaptive Network Based Fuzzy Inference Systems (ANFIS), Fuzzy Inference System (FIS), CERC (Coastal Engineering Research Center), Walton-Bruno (WB), Van Ridge (VR), and ANNs, have been employed to model sediment transport, with ANFIS showing higher accuracy and reliability for estimating longshore sediment transport rates (LSTR) (Bakhtyar et al., 2008; Hashemi et al., 2010). SVR has also been employed, demonstrating superiority over neural networks when the dataset is small or the relationships are linear or non-linear but with a clear margin (Dezvareh and Shafaghat, 2020). Deep learning models, like ANNs, have been developed to address the shortcomings of numerical models in analyzing simultaneous sand and sediment transport (Kim and Aoki, 2021).

3.4.2 Classification models

Classification involves categorizing items or data into groups based on their features, and is crucial in fields such as statistics, ML and data analysis. The goal is to create models that predict the class of new items by identifying patterns in their features. SVM was introduced in the 1990s, RF in the early 2000s, and LR has roots going back to the 19th century. These algorithms are capable of executing simple tasks such as recognition and classification (Lou et al., 2021). In addition to these algorithms, a variety of other classification algorithms, including naive Bayes classifier, DTs, and K-Nearest Neighbors, have been utilized in remote sensing and in situ data analysis to enhance the understanding and monitoring of the environment. Table 7 summarizes the most well-known classification models used in coastal and marine engineering. These algorithms have proven effective in unraveling complex environmental data and facilitating informed decision-making (Tsiakos and Chalkias, 2023). Accordingly, the most famous classification methods are:

• SVM (Cortes and Vapnik, 1995): focuses on training samples near the optimal class boundary, aiming to maximize the margin between support vectors. Fundamentally, it is a binary classifier, and the processing time is managed by applying the classifier to every class combination.

• Regression Tree (RT) (Goldstein et al., 2019): break down prediction tasks into binary splits, forming a tree structure. This tool excels at classification tasks and enables an understanding of the influence of input variables. However, RTs may not be as effective for continuous variables and are prone to overfitting if not properly pruned. Accuracy can be boosted by merging small sequential RT models, giving more weight to poorly predicted data.

• Decision Trees (Pal and Mather, 2003): easy to understand, DTs recursively split data. They can use categorical data and perform classification quickly. However, DTs may suffer from overfitting and non-optimal solutions, which can be addressed through pruning.

• RF (Breiman, 2001): an ensemble classifier using multiple DTs to overcome their limitations. Each tree uses a random subset of training data and features, resulting in a more accurate ensemble. RF classifiers are known for their speed, resistance to overfitting, and ability to handle multicollinearity. They can also assess the importance of variables, although they may be sensitive to certain sampling strategies (Belgiu and Drăgu, 2016).

• Kernel and Nearest Neighbor (K-NN) classifier (Altman, 1992): The K-NN classifier is distinct from other classifiers because it does not create a model during the training phase. Instead, every unclassified sample is directly compared with the original training data.

• Naive-Bayes classifer: it is a classification algorithm that is based on Bayes’ theorem and assumes that the presence or absence of one feature is independent of the presence or absence of other features. It learns the probability distribution of features and corresponding labels from a training dataset and uses it to classify new examples. This algorithm is widely used in applications that have many features and large datasets, such as text classification, sentiment analysis, and spam filtering. The Naive Bayes classifier is computationally efficient and can handle high-dimensional data well.

TABLE 7

TABLE 7. Overview of highly-cited literature studies (extracted from Scopus) on classification models in coastal and marine phenomena.

KNN classifier has been used in various marine-related projects. For the design of marine hydrokinetic turbines, KNN was used to identify and categorize the severity of the rotor blade pitch imbalance encountered by marine current turbines. This approach was found useful for fault detection and severity classification (Freeman et al., 2021). In ocean surface current forecasting, KNN was used as an alternate method (Jirakittayakorn et al., 2017). The KNN algorithm proved capable of forecasting future surface currents up to 24 h in advance. The KNN approach was compared with other prediction techniques such as ARIMA, exponential smoothing, and LSTM, and it was found that the KNN model had the highest accuracy. KNN was one of the six ML classifiers used to generate precise geographic estimates of seabed substrate and seabed habitat mapping (Diesing and Stephens, 2015; Leon et al., 2020). The accuracy of the predictions was evaluated using ground-truth sample data segmented into classes of seabed substrate. In coastal hazards projection, KNN was used to project dangers using several representative concentration route climate change scenarios, regional climate models, and sea level rise ratios (Park and Lee, 2020). Seafloor classification is another marine-related project where KNN was used along with ANN to class the structure of the seafloor and to pinpoint potential anthropogenic effects on delicate benthic assemblages (Gauci et al., 2016). Finally, in sea-land segmentation, KNN was used to produce a pixel-level, sea-land segmentation of the scene based on the Doppler bandwidth of a returns vector in maritime surveillance radars (Shui et al., 2020).

The Naive Bayes classifier is a machine learning algorithm commonly utilized in various applications to enhance model accuracy. A prominent application of the Naive Bayes classifier involves predicting water quality classes utilizing seven popular Water Quality Index (WQI) models (Uddin et al., 2023). There is some confusion about the proper classification of water quality due to differing techniques used in current WQI models. To address this, the Naive Bayes was compared with other ML classifiers. These included SVM, Random Forest, K-Nearest Neighbor, and Gradient Boosting. The goal was to determine the best classifier for evaluating water quality. Another application of the Naive Bayes classifier is in detecting small-scale assemblages of drifting vegetation and beach cast in Germany’s Baltic coast (Uhl et al., 2022). To obtain the best classification results, the classifier was used as part of an ensemble of five classifiers, including a RF, CART, SVM, and stochastic gradient boosting classifier to predict tropical Cyclone based on multi-model fusion across Indian coastal region (Varalakshmi et al., 2021). In all applications, the Naive Bayes classifier was effective in improving the accuracy of the models, particularly in predicting the quality of coastal water and detecting small-scale assemblages of drifting vegetation and beach cast. Its versatility and usefulness in different domains make it a popular choice for improving the accuracy of models in various applications.

Given coastaline extraction from satellite images, three well known methods including image processing techniques, unsupervised classifiers and supervised classifiers have been implemented. Shenbagaraj et al. (2014) employed visual interpretation and ISODATA (Iterative Self-Organizing Data Analysis Technique) classification techniques to extract shorelines from Landsat Thematic Mapper (TM), Enhanced Thematic Mapper Plus (ETM+) sensor images, Toposheet and Google Earth Images spanning a 60-year period from 1953 to 2013 between Kolachel and Kayalpattanam. This approach effectively identified the areas of coastline transgression and regression in the study area. Supervised classifiers such as Maximum Likelihood (Rokni et al., 2015), SVM & ANN & EL (Çelik and Gazioğlu, 2022), RF (Bayram et al., 2017), Minimum-Distance-to-Means, and Mahalanobis distance (Sekovski et al., 2014) also have been employed to classify and detect the coastline position based on the satellite images. As depicted in Figure 9, the average median distance of all shorelines, observed in relation to the reference, suggests that the shorelines produced by the ISODATA and Mahalanobis methods demonstrate the best alignment, with a discrepancy of 2.2 m, thereby being closer to the reference than other methods. Conversely, the Parallelepiped and Maximum Likelihood methods resulted in shorelines with the highest average median distance from the reference shoreline, measuring 5.1 m and 5.6 m respectively.

FIGURE 9

FIGURE 9. Comparison of different classifiers’ performance in detecting shoreline position given the reference shoreline position at different sectors of Ravenna coastal area (Sekovski et al., 2014). Sector 1: Bellocchio Channel. To Reno River; sector 2: Reno River to Destra Reno channel; sector 3: Destra Reno channel to Lamone River; sector 4: Lamone River to Porto Corsini; sector 5: Marina di Ravenna to Fiumi Uniti River; sector 6: Fiumi Uniti River to Bevano stream; sector 7: Bevano stream to Savio River.

4 Summary and conclusion

This study provides a comprehensive review of machine learning applications to model the marine and coastal environments, with comprehensive coverage from data preprocessing to the application of different models. The review indicates that appropriately implemented and optimized ML methods can significantly contribute to marine and coastal sustainability through developing accurate and robust models for prediction of wave height, oceanographic parameters, and sediment transport, image processing, optimization of coastal and marine structures design.

Here are some insights based on your review:

1. Dependence on data quality: the study concludes by reminding us that the efficacy of ML models heavily relies on factors such as the quality of datasets, the type and configuration of the ML model, and tuning parameters. It reemphasizes the importance of sound data science practices in applying ML.

2. Exploitation of data: this paper underlines the importance of data preprocessing, including data cleaning, dimensionality reduction, and normalization in machine learning models. This emphasizes the pivotal role of quality data in the effectiveness of ML applications in modelling phenomena such as wave patterns, coastal erosion, and sediment transport in marine and coastal environments.

3. Diverse machine learning approaches: the current paper is examined three primary types of ML including supervised, unsupervised and reinforcement learning, and their respective applications in marine and coastal science. Supervised learning, using algorithms such as decision trees and neural networks, leverages labeled data to predict parameters like wave height and wind speed, and make morphodynamic predictions. Unsupervised learning, on the other hand, independently discovers patterns and relationships in data for tasks like clustering and anomaly detection, and has been employed to classify wind values and examine beach profiles. Reinforcement learning, operating on a reward or penalty system, plays a vital role in devising control policies and planning for future scenarios in areas like flood risk reduction and wave energy conversion. Various ML methods such as PCR, LM, RT, and ANN are instrumental in facilitating these applications.

4. Classification algorithms: classification algorithms such as Kernel- and Tree-based models play crucial roles in environmental data interpretation and decision-making. SVM is known for its binary classification capabilities, while RT and DT provide swift classification and a better understanding of input variables. RF offers robustness against overfitting and efficiently manages multicollinearity. The KNN classifier performs well in comparing unclassified samples with training data. Naive Bayes, using Bayes’ theorem, efficiently processes and analyzes high-dimensional data and is often used in predicting water quality and tropical cyclone trajectories.

5. Application of ML: from forecasting oceanographic and morphologic parameters to estimating longshore sediment transport rates, the use of ML significantly enhances the capacity for prediction and understanding of marine and coastal environments. ANNs and SVR are frequently used for wave height predictions. Their accuracy and reliability help in crucial areas such as managing coastal erosion and development. The prediction of SST using ML, specifically LSTM neural networks, has shown great promise. Accurate SST prediction can contribute significantly to climate modeling, weather forecasting and the preservation of marine ecosystems. ANFIS has shown accuracy and reliability in estimating longshore sediment transport rates, which is essential for managing coastal erosion and development.

6. The growing role of new techniques: the rising prominence of deep neural networks, convolutional neural networks, and random forests is indicative of the evolution of the field, and the increasing complexity of the problems being addressed. These advanced techniques often deliver superior performance and can manage more complex and high-dimensional datasets. Integrated algorithms such as ICEEMDAN-ELM exhibit superior performance. The adoption of ML algorithms has seen a significant increase in recent years.

4.1 Recommendations for future research endeavours

• Developing hybrid models: the employment of combined and hybrid models has exhibited significant success, notably in addressing multifaceted issues. Eslaminezhad et al. (2022) advanced the efficiency of tree-structured machine learning models in determining the crucial parameters for forecasting flood susceptibility and constructing flood susceptibility maps, through the incorporation of the BPSO algorithm.

• Developoing physical-based machine learning: it is apparent that machine learning models do not adequately consider the actual physical elements of the problem. Consequently, the prospect of integrating physical-based machine learning approaches is recommended for further contemplation.

• Implementing domain adaptation techniques: to address the regional restrictions inherent in existing models, it might be prudent to consider the application of domain adaptation techniques.

• Evaluating models’ uncertainty: it is essential to acknowledge that inherent uncertainty is a fundamental aspect of any model. Thus, it is proposed that the models’ uncertainty be consistently documented, and appropriate methodologies be utilized to alleviate it.

• Development of appropriate scaling techniques: by developing appropriate scaling techniques, one ensures that all features contribute equally to the final prediction, thereby improving the performance of the machine learning model.

Author contributions

AP: Supervision, Compilation and Integration of Data, Data Curation, Software, Validation, Visualization, Writing—Review and Editing. MJ: Literature Search, Information Provision, Writing—Review and Editing. MB: Supervision, Writing—Review and Editing, Funding Acquisition, Project Administration. All authors contributed to the article and approved the submitted version.

Conflict of interest

The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Publisher’s note

All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article, or claim that may be made by its manufacturer, is not guaranteed or endorsed by the publisher.

References

Abolfathi, S., Yeganeh-Bakhtiary, A., Hamze-Ziabari, S. M., and Borzooei, S. (2016). Wave runup prediction using M5′ model tree algorithm. Ocean. Eng. 112, 76–81. doi:10.1016/J.OCEANENG.2015.12.016

Machine learning application in modelling marine and coastal phenomena: a critical review

1 Introduction

2 Data preparation (preprocessing)

2.1 Marine data types

2.2 Marine data resources

2.3 Data cleaning: outlier detection

2.4 Dimensionality reduction

2.4.1 Multicollinearity

2.4.2 Principle component analysis

2.5 Dimensional analysis

2.6 Normalization

3 AI learning algorithms and their application in marine/coastal engineering

3.1 Supervised-based ML methods

3.2 Unsupervised-based ML methods

3.3 Reinforcement-based ML methods

3.4 AI contribution to the sustainability of marine environments

3.4.1 Prediction of oceanographic and morphologic parameters

3.4.2 Classification models

4 Summary and conclusion

4.1 Recommendations for future research endeavours

Author contributions

Conflict of interest

Publisher’s note

References

Glossary