Improved prediction of extreme ENSO events using an arti ﬁ cial neural network with weighted loss functions

The El Niño – Southern Oscillation (ENSO) causes a wide array of abnormal climates and extreme events, including severe droughts and ﬂ oods, which have a major impact on humanity. With the development of arti ﬁ cial neural network techniques, various attempts are being made to predict ENSO more accurately. However, there are still limitations in accurately predicting ENSO beyond 6 months, especially for abnormal years with less frequent but greater impact, such as strong El Niño or La Niña, mainly due to insuf ﬁ cient and imbalanced training data. Here, we propose a new weighted loss function to improve ENSO prediction for abnormal years, in which the original (vanilla) loss function is multiplied by the weight function that relatively reduces the weight of high-frequency normal events. The new method applied to recurrent neural networks shows signi ﬁ cant improvement in ENSO predictions for all lead times from 1 month to 12 months compared to using the vanilla loss function; in particular, the longer the prediction lead time, the greater the prediction improvement. This method can be applied to a variety of other extreme weather and climate events of low frequency but high impact.


Introduction
ENSO refers to the irregular periodic air-sea interaction in the tropical eastern Pacific Ocean, which is manifested in the El Niño-La Niña transition in the ocean and the Southern Oscillation in the atmosphere (Kug et al., 2009;Yeh et al., 2009;Timmermann et al., 2018;Wang, 2018).ENSO is one of the most important climatic phenomena on Earth (Wang, 1995;An and Jin, 2004;Cal et al., 2019), often causing global temperature and precipitation extremes, affecting Earth's ecosystems and human societies.Skillful ENSO prediction offers decision makers an opportunity to take into account the anticipated climate anomalies, potentially reducing the societal and economic impacts by this natural phenomenon and assisting in the management of natural resources and the environment (Han and Kug, 2012;Tang et al., 2018).
Over the past decades, the research about ENSO prediction has attracted broad attention and has resulted in significant improvements (Barnston et al., 2012;Han and Kug, 2012).In particularly, various attempts have been made to improve ENSO predictions using artificial neural network techniques (Dijkstra et al., 2019), such as convolutional neural networks (CNNs) (Ham et al., 2019), temporal convolutional networks (TCNs) (Yan et al., 2020), graph neural networks (GNNs) (Cachay et al., 2021), recurrent neural networks (RNNs) (Hassanibesheli et al., 2022), long short-term memory (LSTM) (Xiaoqun et al., 2020), and convolutional long short-term memory (ConvLSTM) (Gupta et al., 2019).General artificial neural network models use large volumes of input and output data to find nonlinear relationships that minimize errors in which there is no distinction between normal and abnormal events (Elman, 1990;Gers et al., 2000;Graves, 2012;Cho et al., 2014;Zhang et al., 2019;Yang et al., 2018;He et al., 2019).Thus, these models tend to be optimized for data pertaining to highoccurrence normal events and do not sufficiently consider infrequent abnormal events ultimately reducing the prediction accuracy for abnormal years.Figure 1 shows an example of the difference in ENSO prediction performance depending on where the focus is placed.
From a disaster prevention perspective, it is more important to better predict extreme ENSO years (i.e., El Niño and La Niña) than normal years (|Niño3.4|<0.5°C).However, few studies have investigated models' predictive performance focusing on abnormal ENSO years (|Niño3.4|> 0.5°C).Here, we propose a novel weighted loss function to improve prediction accuracy for abnormal years, which are much less frequent than normal years, but more important from the perspective of disasters.The method was applied to the LSTM neural network to predict Niño3.4.Predictive performance was evaluated for forecast lead times ranging from 1 to 12 months and compared with the results obtained using vanilla loss functions.
Two examples of the difference in the 6-month lead prediction performance of Niño3.4 (A) The case of high overall average predictive performance but relatively low prediction of abnormal years (y >0.5).(B) Low overall average predictive performance but relatively high prediction of abnormal years.Individual RMSEs and correlation skills (r), including the results for abnormal years only (y > 0.5, in parentheses), are shown at the top.Kim et al. 10.3389/fmars.2023.1309609Frontiers in Marine Science frontiersin.org 2 Methods

Neural network model
This study used the LSTM neural network, which has the advantage of long-term memory.LSTMs are specifically designed to address the challenge of long-term dependencies in time series data.They are capable of remembering information over extended periods, which is crucial for capturing patterns over extended time frames (Goodfellow et al., 2016).Traditional Recurrent Neural Networks (RNNs) often face the vanishing gradient problem, where gradients become too small for effective learning in deep networks.LSTMs mitigate this problem through their gating mechanisms.LSTMs are also highly flexible and can model the complexities of various time series data (Goodfellow et al., 2016), making them suitable for the diverse and dynamic nature of climate data.
For model training, time series of nine indices (Table 1) were used, in which the input data are monthly mean values for the 12 months prior to the base time, while the output data are the Niño3.4index for 1-12 months after the base time.The model hyperparameters are shown in Table 2.In this study, Adam's method (Kingma and Ba, 2014) was used as an optimizer, but L1 regularization, known as Lasso (Tibshirani, 1996), and L2 regularization, known as ridge regression (Hoerl and Kennard, 1970), which can prevent overfitting, were not used.This is because the use of L1 and L2 regularization can complicate the evaluation of the contribution of the loss function when interpreting the results, as the regularization effects can be mixed in (Hastie et al., 2004;Bishop, 2006).Instead of using regularization, we designed the neural network structure as simple as possible (i.e., the hidden size was set to 12) to prevent overfitting because the more complex the network configuration, the more it depends on the training data.
To improve prediction performance for abnormal ENSO years and ensure reliability, the optimized neural network model itself should first be constructed.Since the performance of artificial neural networks greatly depends on the quality and quantity of training data, the greatest possible volume of high-quality data was obtained, although it is more difficult to obtain sufficient data for training compared to other fields.If a model is designed as a complex network and has numerous parameters in the absence of sufficient data, performance improvement is limited due to overfitting.In general, when overfitting occurs, performance tends to be good during training but decreases significantly during testing.In this study, a neural network model optimized for small volumes of learning data was designed to prevent overfitting.RMSE and Pearson's r were used to evaluate and compare the results of normal and abnormal years.
For the period January 1900-July 2021, for which nine input variables overlapped, the training period spanned 105 years, from January 1900 to December 2004, and the validation period covered about 18 years, from January 2005 to July 2022 (Figure 2).The last 12 years of data were reserved for the verification of the 12-month lead time prediction.Artificial intelligence models are generally developed by dividing the data into training, validation, and testing periods to increase the reliability of the results.However, the lack of long-term data for ENSO predictions makes it difficult to obtain data for testing.Moreover, since the main purpose of this study was to examine whether the prediction of abnormal ENSO years could be improved using a weighted loss function, we focused only examining the effect of the weighted loss function on training and validation.The validation period was set to about 18 years after 2005 to ensure that more than two strong El Niño and La Niña events were included.All input and output data were normalized using maximum and minimum normalization (MinMaxScaler) to make the dimension and size equal.Specifically, to reflect the characteristics of Niño3.4,the input data were normalized between 0 and 1, and the output was normalized between −1 and 1.
Overfitting is a fundamental problem of machine learning that interferes with predictability.In general, overfitting occurs due to the presence of noise, the complexity of the neural network, and especially the limited size of training data.Ying (2019) suggested various methods, such as early stopping, network reduction, data expansion (data augmentation), and regularization, to prevent overfitting.Other methods have also been proposed, including dropout and regularization, small batches for training, low learning rates, simpler models (smaller capacity), and transfer learning (Koidan, 2019).This study used dropout, small batches, low learning rates, simpler models, and early stopping to prevent overfitting.The early stopping point was chosen to be when the validation loss was at its lowest value after the training loss had dropped sufficiently.Figure 3 is an example of training loss and validation loss plotted together.The relatively large total number of epochs was chosen so that many experiments would have enough

Data
Monthly climate indices were downloaded from the official National Oceanic and Atmospheric Administration website (https://psl.noaa.gov/gcos_wgsp/Timeseries/;accessed Oct. 2022).Niño3.4 is the anomaly obtained by subtracting the climatological mean of the period 1981-2010 from the regional SST mean of 5°N-5°S and 170°W-120°W (Figure 4).Niño3.4 was used as a predictand, whereas the other indices (Table 1) were used as predictors.To find potential predictors, various climate indices that are related to ENSO, were selected based on literature research and during model training, the correlations between Niño3.4 and all potential ENSO-related indices were investigated.Finally, nine optimized predictors were then selected as input variables.Table 1 provides references explaining the relevance of each index to ENSO.Since the collected data pertained to different periods, only data for the period in which all variables overlapped were used.

Experimental methods and evaluation
Time series forecasting can be divided into univariate and multivariate forecasting, as well as one-step and multistep forecasting.Univariate forecasting uses one variable X for predictand Y, whereas multivariate forecasting uses multiple variables (X s ).One-step (many-to-one) forecasting predicts a simple future point (Y t+n ), while multistep (many-to-many) The loss curves of training and validation for the weighted L1 experiment (blue: training, red: validation).We analyzed the results at around 1100 epochs, which is the best value for validation after the training loss had dropped sufficiently.simultaneously.This study used multivariate forecasting with nine input variables, allowing the use of as many observations and variables as possible.Multistep forecasting makes it possible to simultaneously make predictions for 1 to 12 months and maintain consistency between the prediction results for each lead time.Onestep forecasting, which independently predicts each month, may perform well for specific individual forecasts but may not perform consistently across the entire lead times (Makridakis et al., 1998;Armstrong, 2001;Box et al., 2015).Figure 5 shows the schematic diagram illustrating input/output data and period for training and validation.
To consistently compare the experimental results, a random seed value in the neural network was fixed and tested (Madhyastha and Jain, 2019).We also did not use ensemble method because it combines multiple models to improve prediction performance, making interpretation more difficult compared to a single model (Hastie et al., 2004).This was a necessary step in the twin experiments conducted to investigate the performance of the loss functions.Table 2 shows the experimental configuration for the investigation of the Niño3.4predictive performance using various weighted loss functions.A total of 4,000 epochs were trained for each experiment, and periodic verification was performed to select the best predictive performance.The procedure was similar to an early stopping method.
To evaluate the model's accuracy and error simultaneously, the RMSE and Pearson's r were compared between the predicted values and the Niño3.4index.Both the RMSE and Pearson's r had values between 0 and 1.The closer Pearson's r is to 1, the higher the accuracy, and the closer the RMSE is to 0, the smaller the error.To investigate the performance of only abnormal years, the result of | Niño3.4|> 0.5°C was additionally presented.

Design of weighted loss functions
In genal, three loss functions are used to calculate the loss of neural network models: L1 (mean absolute error; L(y, f (x)) = (y − f (x)) j j ), L2 (mean squared error; L(y, f (x)) = (y − f (x)) 2 ), and Huber (1964) (L d (y, f (x)) = 0:5(y − f (x)) 2 for y − f (x) j j≤ d and d ( y − f (x) j j− 0:5d , otherwise) where L is the loss function, y is the predictand, f (x) is the predicted value, and d is the boundary value separating quadratic and linear.L1 is less affected by outliers than the other two functions, and L2 is favorable for finding optimal values.Huber was developed to take advantage of L1 and L2 and compensate for their shortcomings based on the d value and shows satisfactory learning ability in most neural networks because of its simplicity and good performance.In this study, these vanilla loss functions are multiplied by two (linear and quadratic) weight functions to weight loss value according to the level of the predictand.The right side of Figure 6 shows the two types of weight functions as an example-linear function (y) and quadratic function (y 2 ), in which slopes of the functions are defined as 1, 5, and 10.This indicates that the weight increases linearly (or as a quadratic function) up to ±1.0, ± 0.2, and ±0.1, respectively.After that, the weight is fixed to 1 to be equal to the vanilla loss function.In this way, each loss value can be weighted according to the level of the predictand.This method can be expressed as follows: where W(y) are the weight functions and are expressed as follows: Linear (W1): otherwise : ( Quadratic (W2): otherwise : ( Since the loss value is combined for all years and the model is trained to minimize this loss, reducing the weight of normal years using these equations allows for a greater focus on abnormal or extreme ENSO events.The adjusted weighting system ensures that the model does not underestimate significant abnormal events, improving prediction accuracy especially for extreme events.The optimal slope of the weighted functions was determined according to the distribution of the predictand (Niño3.4).In this study, six weighted functions, linear (W1) and quadratic (W2) functions with slopes of 1, 5, and 10, were tested based on the data distribution of Niño3.4 (Figure 7).For the period January 1900-July 2021, the training period spanned 105 years, from January 1900 to December 2004, and the validation period covered about 18 years, from January 2005 to July 2022.

ENSO prediction results
A total of 18 experiments (Table 3) were performed by combining the six weight functions and the three vanilla loss functions (L1, L2, and Huber).Predictive performance was evaluated for lead times ranging from 1 to 12 months.In each case, the results obtained using the vanilla L1, L2, and Huber loss functions were compared to those obtained using the weight functions.The weight functions that performed best for the three loss functions are shown in Table 4.
Figure 8 shows a comparison of the Niño3.4predictions for abnormal years (y>0.5°C)over lead times ranging from 1 to 12 months using the vanilla loss function and the six weighted loss functions.For L1, most of weighted functions outperformed vanilla function on average in abnormal years (Figures 8A, B; Table 4).In particular, the performance improvement was greater for longer lead predictions.For the L1 loss function, a quadratic weight function with a slope of 10 (10W2; purple line) performed best, followed by 5W2 (Table 4).For the L2 loss functions, all weight functions had lower errors and higher correlations than vanilla L2 (Figures 8C, D; Table 4).For L2, a quadratic weight function with a slope of 1 (1W2; dark yellow line) showed the best performance The first number of the code represents the slope).(Here, y is Niño3.4(predictand), W1 & W2 are the type of weight function, and L1, L2, & Hb are the type of loss function.Here, the results of vanilla L1, L2, and Huber loss functions are compared with those of using six weight functions consisting of linear (W1) and quadradic (W2) functions with three slopes (1, 5, and 10).The values in parentheses are results for abnormal years only (y > 0.5).Bold values represent the best results for the three loss functions for normal and abnormal years, respectively.
with a 10-20% improvement over vanilla L2 in the 6-12-month predictions of abnormal years.However, the performance improvement of the best (optimal) weighted loss function (i.e., 1W2) was smaller than that of L1.This is because vanilla L2 is known to have significantly better performance compared to L1 (also shown in our results; compare the black lines in Figures 8A-D), making further improvement by the weighted loss function difficult.For the Huber loss functions, the results were overall similar to L2, except that the best performing model was a linear weighted function with a slope of 1 (1W1, green line Figures 8E, F; Table 4).In particular, 1W1 showed the largest improvement over vanilla Huber functions in the 6-9-month lead predictions (Figures 8E, F).Analysis of extremely abnormal years with the Niño3.4index anomalies above 1.0 °C (y>1.0) was also consistent with results from abnormal years with a threshold of 0.5°C (Figure 9).That is, the weighted loss function performs better than the vanilla loss function, and the best weighted loss functions for L1 and L2 were 10W2 and 1W2, respectively.ENSO predictive performance over six-month lead times is a widely used metric when evaluating models.Previous models have shown remarkable performance for up to 6-month lead times (e.g., Ham et al., 2019;Yan et al., 2020).In this study, the best model (10W2L1) for 6-month lead predictions had an RMSE of 0.61 and a correlation skill of 0.71 for all years, which is comparable to the performance of previous state-of-the-art studies (Figure 10).The high performance is mainly the result of a significant improvement (i.e., from 0.8 to 0.70 for RMSE and 0.71 to 0.79 for correlation skill) over the vanilla loss function in abnormal years.In particular, the  in 2007-2008 and 2010-2011.For the latest period, however, the performance was relatively poor (Figure 10).This could be due to changes in climate behavior or the evolution of factors influencing ENSO in recent years (Cai et al., 2023), and possibly because the model did not take these into account.

Discussion
The optimal slope of the weight functions was determined depending on the combination of the distribution of the y data and the function used.Therefore, the selection of the slope also requires an optimization process similar to the tuning of hyperparameters.In this study, we conducted more experiments than those presented here, but only the results of three discrete slopes (1,5,10) are shown here.This is because we believe that the impact of using a weighted loss function in this study can be sufficiently explained through the three samples.The experimental results show that the smaller the slope, the smaller the error in most cases of abnormal years (Figure 11).This may be because the frequency of the y data generally decreases almost linearly or quadratic with y (Figure 7).A closer look reveals that slope 1 generally shows the best performance except quadratic-L1 (Figure 11).When the slope was further reduced to 0.5, the performance was similar to that of slope 1 (not shown).In the case of quadratic-L1, slope 5 or 10 showed the best performance.The distribution of the y data shows that the bulk (81%) of the data were in the range of −1 to 1 and the ranges of slopes 5 and 10 correspond to where the data is concentrated (Figure 7).In some cases, therefore, the greatest benefit can be achieved by reducing the weights only in this high frequency range.These results emphasize again that the optimal type of weight function and slope should be determined according to the data distribution.
In this study, only simple linear and quadratic functions were tested for sensitivity, but it is possible to develop and use various weighting functions (including known probability distribution functions) that better express the distribution of data.The current method is similar to data augmentation and batch sampling in that it solves the data imbalance problem, but it has the advantage of using all the data without artificially reducing or increasing the sample and using various weighting functions to focus on and optimize the desired target.
Observational data necessary for the development of ENSO prediction models using artificial neural networks are insufficient.To solve this problem, Ham et al. (2019) devised a transfer learning method using the results of a three-dimensional global circulation model.Initially, we also constructed an LSTM model using additional long-term climate model data, and tested their performance.However, the uses of these data did not significantly improve the model's predictive accuracy.This may be because the numerical model itself has limitations in realistically simulating ENSO and this can negatively affect learning ability in current LSTM experiments.Subsurface ocean information, known as an early origin of ENSO variations, was also not used as a potential predictor in this study, as there is no long-term subsurface data with high accuracy, and the LSTM's test results did not show a significant improvement in performance.In fact, we have tested subsurface variables extracted from the three-dimensional ocean data of Simple Ocean Data Assimilation (SODA), such as mixed layer depth and thermocline slope, as potential predictors, but did not achieve any significant performance improvement.This may be because that subsurface ocean data has lower accuracy and shorter duration compared to SST data, due to the difficulty in obtaining observational data, which may have contributed to the lack of performance improvement.
While many recent studies have focused on improving ENSO forecasting at long lead times (>1.5 years 18 months) (Luo et al., 2008;Doi et al., 2016;Ham et al., 2019;Patil et al., 2023), this study aims to improve accuracy and reliability for extreme or abnormal ENSO events using a new weighted loss function.We believe that the analysis results presented here, based on a lead time of up to one year, can support the claims and conclusions of this study.However, it is meaningful to investigate whether the current method accurately predicts extreme/abnormal ENSO events even with such long lead times.This remains a task for future research.
State-of-the-art ENSO prediction models can be developed using better methods and data, but it is important that this study can contribute to further improving the performance of these artificial intelligence models.In particular, the method based on the weighted loss functions presented herein can be effectively used when the observational data for training are insufficient and when the number of extreme events is small.Given that most extreme events are infrequent, the proposed method can be used to predict not only extreme ENSO but also various other extreme weather and climate events such as tropical cyclone, floods, drought, and heat waves.This study does not explain which predictors contribute the most to the improvement in accuracy for the abnormal years since AI models are composed of non-linear relationships between predictors, which makes them difficult to interpret.In the case of CNN, heatmaps are often used as a basis for judging the results, but

Data availability statement
The raw data supporting the conclusions of this article will be made available by the authors, without undue reservation.

FIGURE 2
FIGURE 2 Niño3.4time series used for training and validation.The extra data were used for the verification of the 12-month prediction.

FIGURE 5
FIGURE 5Schematic diagram illustrating input/output data and period for training and validation.

FIGURE 6
FIGURE 6Schematic diagram explaining how to apply a weight function to a vanilla loss function.Here, linear (W1) and quadratic (W2) weight functions with slopes of 1, 5, and 10 are applied to the vanilla L1, L2, and Huber loss functions.The source code of the weighted loss function is provided in the Supplementary Information.

FIGURE 7
FIGURE 7Histogram showing the distribution of Niño3.4 along with linear (W1) and quadratic (W2) weight functions with slopes of 1, 5, and 10.The bulk (81%) of the data were in the range of -1 to 1.
FIGURE 10 6-month lead predictions for Niño3.4 using the vanilla L1 loss function (blue line) and an optimal weighted loss function (10W2L1, red line).The observed Niño3.4 is also shown (black line).Individual RMSEs and correlation skills for the validation period (2005-2022), including the results for abnormal years only (y > 0.5, in parentheses), are shown at the top.An additional time series from 2000 to 2004, the last part of the training period, is shown here to ensure that there is no overfitting.
FIGURE 11 RMSEs and correlation skills of linear (A-D) and quadratic (E-H) weight function according to the slope.The performance of the weight function generally improves as the slope decreases, but the quadratic function for L1 performs best at slopes of 5 and 10.The results for abnormal years only (y > 0.5) are shown in (B, D, F, H).

TABLE 1
Indices used as input/output data for ENSO predictions.
IKim et al.  10.3389/fmars.2023.1309609Frontiers in Marine Science frontiersin.orgtime to learn about finding the best epoch for each experiment.For this experiment, we compared the prediction results at about 1100 epochs, the best value of validation.

TABLE 2
Hyperparameters of the neural network model used in this study.

TABLE 3
Experimental configuration of the weighted loss functions to investigate Niño3.4 prediction performance.