Spatial-temporal transformer network for multi-year ENSO prediction

The El Niño-Southern Oscillation (ENSO) is a quasi-periodic climate type that occurs near the equatorial Pacific Ocean. Extreme periods of this climate type can cause terrible weather and climate anomalies on a global scale. Therefore, it is critical to accurately, quickly, and effectively predict the occurrence of ENSO events. Most existing research methods rely on the powerful data-fitting capability of deep learning which does not fully consider the spatio-temporal evolution of ENSO and its quasi-periodic character, resulting in neural networks with complex structures but a poor prediction. Moreover, due to the large magnitude of ocean climate variability over long intervals, they also ignored nearby prediction results when predicting the Niño 3.4 index for the next month, which led to large errors. To solve these problem, we propose a spatio-temporal transformer network to model the inherent characteristics of the sea surface temperature anomaly map and heat content anomaly map along with the changes in space and time by designing an effective attention mechanism, and innovatively incorporate temporal index into the feature learning procedure to model the influence of seasonal variation on the prediction of the ENSO phenomenon. More importantly, to better conduct long-term prediction, we propose an effective recurrent prediction strategy using previous prediction as prior knowledge to enhance the reliability of long-term prediction. Extensive experimental results show that our model can provide an 18-month valid ENSO prediction, which validates the effectiveness of our method.


Introduction
The EI Nino-Southern Oscillation (ENSO) is one of the recurring interannual variability of ocean-atmosphere interactions phenomenon over the tropical Pacific Ocean and contains three phases (onset, mature and decay) with respect to the changes of sea surface temperature(SST). When the SST are higher than normal in the central and eastern equatorial Pacific Ocean, it is called El Nino, and when it is lower than normal, it is called La Nina Larkin and Harrison (2002). With wind and SST oscillations, the ENSO has wide influences, for example, the global atmospheric circulation Alexander et al. (2002), crop production Solow et al. (1998), environmental and socioeconomic (McPhaden et al. (2006)), ecology and economy Reyes-Gomez et al. (2013). Therefore, accurate prediction of ENSO occurrence can guide us to take preventive measures and effectively reduce the impact of natural disasters on human society. However, due to the predictability barrier and chaos of climate variability Mu et al. (2019) ENSO prediction remains an extremely challenging task.
In recent years, there are several related indicators to reveal ENSO underlying complex climate change, such as Nino3.4 index and the SST index Yan et al. (2020). All of them utilize the historical SST or Heat Content (HC, Vertical mean ocean temperature above 300 m) to predict whether the ENSO event will happen in the future. Among these indicators, the Nino3.4 index is frequently employed to evaluate phenomenon of ENSO, which calculates mean SST anomaly(SSTA) maps of three consecutive months in an area of 5°N-5°S and 170°W-120°W Ham et al. (2019). The existing ENSO prediction methods can roughly classified into numerical prediction methods (NWP), traditional statistical methods and deep learning methods Ye et al. (2021b). The NWP methods usually adopt the mathematical physics and integrating governing partial differential equations to predict future Nino3.4 index Bauer et al. (2015). Specifically, Zebiak et al. Zebiak and Cane (1987) proposed the first coupled atmosphere-ocean model for forcasting the ENSO phenomenon, and subsequently various models like Intermediate Coupled Model (ICM), Hybrid Coupled Model (HCM) and Coupled General Circulation Model (CGCM), have been proposed to obtain 6-12 months of reliable predictions He et al. (2019).For example, Zhang et al. Zhang and Gao (2016) developed an ICM for enso prediction focusing on thermocline effect on the SST, which reasonably captures the overall warming and cooling trends from 2014-2016. Subsequently, Barnston et al. Barnston et al. (2019) validated the ENSO prediction skill in the North American Multi-Model Ensemble(NMME) and found that NMME can effectively improve the ENSO prediction skill. Johnson et al. Johnson et al. (2019) used the European Centre for Medium-Range Weather Forecasts(ECMWF) to predict ENSO and found that ECMWF has powerful advantages in ENSO prediction, especially in the difficult-to-predict northern spring and summer season. Ren et al. Ren et al. (2019) developed a statistical model to examine the East Pacific (EP) type and Central Pacific (CP) type predictability, and the results showed that ENSO predictability is mainly derived from changes in the upper ocean heat content and surface zonal wind stress in the equatorial Pacific. However, due to weather prediction is highly dependent on initial and boundary conditions, as well as a large variety of physical quantities, which hinder the application of NWP in long-term prediction Ludescher et al. (2021). Furthermore, with the horizontal resolution increasing, the numerical models will lead to an explosion of time costs and computational resources Mu et al. (2019); Ye et al. (2021b). Traditional statistical methods summarized and analyze the shallow patterns in historical data of ENSO, and then, realize the prediction of future ENSO Yan et al. (2020). Concretely, Petrova et al. Petrova et al. (2017) decomposed the time series into dynamic components and captured the dynamic evolution of ENSO to obtain efficient predictions. Subsequently, PETROVA et al. Petrova et al. (2020) added a stochastic periodic component associated with the ENSO time scale, which further improved the prediction. Wang et al. Wang et al. (2020) proposed a nonparametric statistical approach based on simulation prediction to address the limitation of long-term prediction for statistical methods raised by highly non-linear and chaotic dynamics. Rosmiati et al. Rosmiati et al. (2021) proposed the auto regressive ensemble moving average (ARIMA) model to predict the Niño3.4 Index and found that ARIMA was very effective in predicting ENSO events. However, ENSO is non-linear oceanatmosphere phenomenon over time, traditional statistical methods can not well capture the complex patterns and knowledge to effectively predict the ENSO phenomenon Yan et al. (2020).
As deep learning techniques have developed, researchers have began to design neural networks for predicting weather elements (e.g., rainfall), which can well mine complex and intrinsic correlations, such as artificial neural networks (ANN) Feng et al.  Feng et al. Feng et al. (2016) propose two methods to predict the existence of ENSO, and the time evolution of ENSO scalar features, which provided a new prediction direction for predicting the occurrence for ENSO events. Broni-Bedaiko et al. Broni-Bedaiko et al. (2019) used the LSTM networks for multi-step advance prediction of ENSO events, which complemented the previous models and predicted the ENSO phenomenon 6, 9, and 12 months in advance. Mu et al. Mu et al. (2019) defined ENSO prediction as a spatio-temporal series prediction issue and used a mixture of ConvLSTM and rolling mechanism to predict the outcome over a longer range of events. The GNN was first used in Cachay et al. (2020) for seasonal prediction, it predict the result in a longer lead time. Zhao et al. Zhao et al. (2022) designed an endto-end network, named Spatio-Temporal Semantic Network (STSNet), it provided a multiscale receptive domains across spatial and temporal dimensions. The significant breakthrough work is the CNN-based model designed by Ham et al. Ham et al. (2019), which is proficient in predicting ENSO incidents for as long as 1.5 years, significantly higher than most existing methods. Subsequently, Ye et al. Ye et al. (2021b) adapted the different sizes of the convolutional kernel to capture the different scale information and further improved the accuracy than Ham et al. (2019). Patil et al. Patil et al. (2021) trained CNN models using accurate data with the all season correlation skill greater than 0.45 at lead time of 23 months. Another major breakthrough is the combination of the POP analysis procedure with the CNN-LSTM algorithm by Zhou and Zhang (2022), which explores hybrid modeling by combining physical process analysis methods with neural network and proves its effectiveness. In addition, deep learning in the field of spatio-temporal prediction is now well developed, Li et al. Li et al. (2022) developed an adversarial learning method fully considering the spatial and temporal characteristics of the input data to produce accurate wind field estimates, and Lv et al. Lv et al. (2022) proposed a new generative adversarial network model to simulate the spatial and temporal distribution of pedestrians to generate more reasonable future trajectories, which provides new ideas for ENSO prediction.
Although certain advances have been made in ENSO-related studies, there are still quite limited predictions due to the following reasons: (1) The ENSO phenomenon contains prominent spatiotemporal characteristic, and even if the temperatures of two stations with long time intervals and far apart locations, they may still have complex interactions with different implications for future ENSO prediction. The traditional CNN convolution kernel suffers from the problem of local receptive field, for example, to obtain the SST anomaly relationship between the North Pacific and South Atlantic, it is necessary to stack the deep layers to obtain these two areas, but the amount of information decays as the number of layers increases Ye et al. (2021a). The transformer-based methods explored the attention mechanism to capture the global receptive field. However, these methods mainly model the spatial information, resulting in confusing spatio-temporal features Nie et al. (2022). (2) Due to the variable rate signal and high frequency noise in atmosphere-ocean system, it is a challenge for predicting long-time ENSO in advance. The previous close calendar months have significant effect on the next month prediction, while those with longer intervals have low effect. Existing methods ignore the nearby prediction results when they mine the spatial-temporal patterns in the next time, resulting large errors due to the large magnitude of ocean climate variability over long intervals. (3) The ENSO phenomenon has an obvious statistical characteristic of annual cycle Zhou and Zhang (2022), and how to effectively use this interannual characteristic to capture the correlation between historical and predicted data is the key to improve the prediction of the future trend change in atmosphereocean system.
To solve the above limitations, we designed a novel Spatialtemporal Transformer Network for Multi-year ENSO prediction, which is named STTN. First, as the ENSO phenomenon has large-scale and long-term dependencies across both spatial and temporal dimensions, we employed a multi-head spatialtemporal network to adaptively model the variations along with the changes in space and time, which can effectively captures the global and successive characteristics of climate change. Second, we designed an effective recurrent prediction strategy to utilize the previous predictions as prior knowledge for long-term prediction by a single model. To mitigate the negative influence of false predictions, we encoded the contextual information of successive predictions by temporal convolution operation to fully exploit the historical contextual time series. Third, we integrated the month information into the procedures of SSTA and HC anomaly (HCA) maps feature encoding and predictions, which guides the model to better capture the seasonality and periodicity of the ENSO phenomenon.
The main contributions from our work are summarized below: • We proposed a novel spatial-temporal transformer network to model the variations of SSTA and HCA along with the changes in space and time, which can adaptively captures the inherent characteristics of climatic oscillation. • We introduced an effective recurrent prediction strategy to treat previous predictions as prior knowledge for long-term predictions and utilize the context of predictions to mitigate the error accumulation during recurrent prediction. • We integrated the temporal index as position embedding into the feature learning procedure to facilitate mining the influence of seasonal variation on predicting ENSO. • The extensive experiments indicated that our single model outperforms the state-of-the-art methods with multiple ensemble models, which demonstrates the effectiveness of our method at dynamic prediction.

Data processing
The ENSO prediction has been defined as a spatio-temporal prediction issue, where the objective is to use the ENSO historical data x t-T+1,…, x t-1, x t to predict the Niño3.4 indexes for the next l months. This process is formulated as: where F denotes the deep learning model, l denotes the lead month, T denotes the length of historical input data. The illustration of our proposed network is illustrated in Figure 1. The time unit of ENSO historical input data contains T consecutive months, denoted as x ssta ϵ R T×H×W and x hca ϵ R T×H×W for SSTA and HCA, respectively. T, H, and W indicate time, height, and width for the input data, respectively. To model the spatial and temporal correlation with a global perspective, we adopt the transformer structure as the backbone of our method. To meet the requirement of transformer structure, we first reshape the SSTA and HCA 2D data into a sequence of flattened 2D patches. Taking x ssta as an example, each grid map is divided into N patches with same size: x 0 ssta ∈ R TÂNÂp 1 Âp 2 , N=H×W/(p 1 ×p 2 ). The p 1 and p 2 is the size of each patch, then each patch is converted into a onedimensional vector with p 1 ×p 2 dimension. Then, we adopt a linear layer to project these vectors into D dimension. Finally, the features of the SSTA or HCA can be represented as f ssta ∈R T×N×D and f hca ∈R T×N×D .

Spatial-temporal position encoding
Due to the complex historical input data with periodic characteristics, we need to assign the position indexes for each patch to let the network know the location and order of each patch, so that the model can explore the correlations among different locations or at different times. To encode the temporal information, we adopt different sine and cosine functions Vaswani et al. (2017), which are periodic and can explore the temporal characteristic of abnormal temperature. Take f ssta as an example: where i is the time step of the input sequence or the calendar month in the period of C, and j is the index of dimension, PO∈R T×D . For the location of each patch within space, we learn spatial positional embedding E∈ N×D . Finally, the spatio-temporal position is added to the feature f ssta and to obtain the embedding vector z (0) ssta .
where Norm is the LayerNorm operator, and the embedding vector z (0) hca of HCA can also be obtained by the above process. In addition, the calendar month information and the time step of the input sequence also contributed to the recurrent prediction strategy which will be presented later.

Spatial-temporal attention module
To better model the spatial and temporal characteristics of ENSO, we adopt a multi-head attention to encode the variability. Without losing generality, we take SSTA data as the input. The encoder structure is shown in Figure 2A, which consists of spatial and temporal attention, multi-layer perceptron, and residual connection to obtain the feature representation. To capture the temporal dynamics, we first use the self-attention mechanism in the time dimension. For example, in the case of temporal attention, exclusively using keys from the same patches but different frames as the query, the query, key, and value vectors in the m-th Encoder block can be computed from the feature vector z (m−1) ∈R N×T×D as follows.
where t = 1,…,T, and Norm is the LayerNorm operation, a = 1,…,A is the index of attention heads, and A is the sum of attention heads, the dimension of the attention head is given as are the parameters for the projection layers. The weights of temporal patches are obtained by a dot product calculation as follows.
where s is the softmax activation function and a (m,a) t ∈ R TÂT is the temporal attention layer m in terms of a-th head. The patch representations are calculated by these weights.
Then, these vectors from all the attention heads are concatenated and projected: where W t is the parameter of the linear layer and [] indicates concatenation operation. Further, to capture the spatial dynamics, we use the spatial attention immediately after the temporal attention. The spatial attention calculates the weights in the spatial dimension, exclusively using keys from the same frame as the query. When implementing the spatial attention, we can exchange the spatial and temporal dimensions of z m t ∈ R NÂTÂD , then the query, key, and value vectors can be computed from the feature vector z m t ∈ R TÂNÂD as follows: Then, the weight of each space patch also can be computed by the dot product calculation: where a (m,a) s ∈ R NÂN and s is the softmax activation function. The encoding of the spatial attention at layer m can be similarly obtained by Eq. 6 Finally, we can also obtain the output z (m) s of spatial attention as follows: where W p is the parameter of the linear layer and [] indicates the concatenation operation. After using the temporal and spatial attentions, we use the residual connection and multilayer perceptron (MLP) to ensure the stability of the gradient and mine the spatio-temporal features.
After encoding SSTA and HCA data, we get the spatio-temporal features of SSTA and HCA respectively, and in order to perform joint prediction, we concatenate the features of SSTA and HCA to get the feature Z∈R (2×T×N)×D .

Recurrent prediction strategy
In order to use previous predictions as prior knowledge for long-term prediction, we introduced an effective recurrent prediction strategy (RPS). Specifically, we first utilized the selfattention,cross-attention blocks, MLP layer, and residual connection to construct the decoder of the spatial-temporal characteristics. The structure of the decoder is depicted in Figure 2B. Then, the temporal convolutional block with onedimensional convolution was adopted to encode the prediction context, which can help reduce the error accumulation in the recurrent prediction process. Finally, the fully connected layer maps the feature vector into the Niño3.4 index to optimize the whole network. It is worth noting that these operations are used in each step of the recurrent prediction. Since the Niño3.4 index is calculated by SSTA, we averaged the features of the SSTA to generate the start character CLS Vaswani et al. (2017). When predicting the Niño3.4 index for the l-th lead month, the complete calculation is as follows. First, the output of the decoder before the (l-1) th month is concatenated with CLS to generate the input e 0 ∈R l×D , which is used for the decoder query. Meanwhile, the time sequence position encoding and calendar month information in the period of C are added to the output of the decoder before the concatenation, and then e 0 is input to the decoder to predict the where e m-1 is output of the m-1 layer decoder block, SA is selfattention. To prevent future information leaks, we use the mask [31] to ensure that the l-th lead month feature can only depend on known outputs smaller than the l feature location in e m-1 . CA is crossattention, and its query/key/value can be computed by e' (m) /Z/Z. e m is the output of decoder for the l-th lead month, then we can get the l-th lead month Niño3.4 index after through a fully connected layer. Moreover, in order to use previous predictions as prior knowledge for long-term projection, we concatenate e m l into the input features Z of the CA, l is an index of e m , and use a one-dimensional convolution with k convolution kernels to mitigate the error accumulation.  Giese and Ray (2011), and Global Ocean Data Assimilation System (GODAS) Behringer and Xue (2004). These datasets contain the anomaly maps of SST and HC from 180°W-180°E and 55°S-60°N, the spatial resolution of each map is 5°x 5°. The goal of these datasets is to predict the Niño3.4 indexes in the next consecutive months. The details of the data are shown in Table 2. The training dataset includes simulated data from the CMIP5 Taylor et al. (2012) in the period from 1861 to 2004, the validation dataset includes the reanalysis data from the SODA Giese and Ray (2011) in the period from 1871 to 1973, and the test dataset includes the reanalysis data from the GODAS Behringer and Xue (2004) in the period from 1982 to 2017. All methods utilize the same data for training, validation and evaluation. In addition, following the existing work Zhou and Zhang (2022), we also validated our proposed method in Coupled Model Intercomparison Project Phase 6 (CMIP6 Eyring et al. (2016)), SODA, and GODAS. These datasets contain the anomaly maps of SST and HC from 175°W-175°E and 50°S-50°N, the spatial resolution of each map is 5°x 5°, and the details of the data are shown in Table 3. It is worth noting that the dataset in Table 3 was used only for comparison with Zhou and Zhang (2022).

Evaluation metrics
To fairly evaluate the performances of the proposed method and competing methods, we adopted Temporal Anomaly Correlation Coefficient Skill (Corr) and Root Mean Square Error (RMSE) between the predictions and observations with different leading months l, as used in Ham et al. (2019). Corr is a measure of linear correlation between predicted and observed values, and RMSE is the standard deviation of the residuals, which is a standard measure of prediction error between predicted and observed values. In addition to the above metrics for evaluating the performance of ENSO prediction, we also calculated the Mean Absolute Error (MAE) to evaluate the average absolute values. The formulations of Corr, RMSE, and MAE are as follows: where P is the predicted value, Y is the observed value, P m,l is the mean of P, Y m is the mean of Y, m is the calendar month, ranging from 1 to 12. s and e are the start and end years of the data, respectively.

Implementation details
Our approach was implemented on the Pytorch framework, and all experiments were performed on an NVIDIA RTX3090ti with 24 GB of memory. We adopted the strategy of Adaptive moment estimation (Adam) to optimize the network learning. Following the Noam Optimizer Vaswani et al. (2017), we adjusted the learning rate during training. In order to clearly understand the experimental setup, we list the main hyperparameter symbols, descriptions, and the values being set in Table 4, the B 1 , B 2 , p 0 1 , p 0 2 , p 1 , p 2 are set to 160, 80, 8, 12, 10, 14, respectively. The number of layers M of Encoder and Decoder is fixed to 6, the value for attention head A is fixed to 6, and D 1 and D 2 are set to 384 and 768. The convolution kernel of the temporal convolutional network is k=4. The dropout rate d is set to 0.1. The pos in the input sequence of the Encoder is set to 0, 1, 2 and it is set to 3,…26 in the Decoder. The ENSO cycle C is set to 2. For the reproducibility of the experiments, the seeds of CPU and GPU are both 5 when we initialize the parameters, and the GPU seed is 0 when the model is training.

Comparisons with state-of-the-arts
We compare our method with several representative methods, including numerical prediction and deep learning methods, respectively. The numerical weather prediction contains Scale Inter-action Experiment-Frontier(SINTEX-F) Luo et al. (2008) and the North American MultiModal Ensemble (NMME) Kirtman et al. (2014)

Numerical prediction vs deep learning
All deep learning methods (e.g. CNN, MS-CNN and Transformer, etc.) outperform the numerical prediction methods (e.g. SINTEX-F and NMME). The main reason is that the numerical prediction methods design mathematical models of the atmosphere and ocean to mine complex variations with complex calculation processes, while the data-driven deep model can automatically explore the variant characteristics of the EI Niño-Southern Oscillation.

CNN-based method vs transformer-based method
The ACorr of single CNN model is above 0.5 for a lead of 13 month prediction Ye et al. (2021b), while the ACorr of multi-scale CNN model is above 0.5 for a lead of 15 month prediction Ye et al. (2021b), which demonstrates that different scales of convolutional kernel sizes utilize multiple receptive fields to better obtain the region correlations. Moreover, the transformer-based methods (e.g. Transformer and ENSOTR) adopt the attention mechanism to conduct spatial interactions and easily obtain global correlations between different regions and outperform the CNN-based methods.

Transformer-based mehtod vs ours
Our proposed method dramatically outperforms the state-ofthe-art methods. Specifically, our method without using ensemble multiple models outperforms the ensemble model ENSOTR for all predicted lead months, especially for 3-10 lead months. Comparing to Transformer and ENSOTR, our method not only designs the attention mechanism across both spatial and temporal dimensions but also incorporates the knowledge of prediction and influence of seasonal variation into the learning procedure, which better facilitates the EI Niño prediction. Figure 3B shows the Corr of the Niño3.4 index variation for each calendar month. The figure shows that our model (right) predicts more months with a Corr of the Niño3.4 index higher than 0.5. In particular, when the target season is May-June-July (MJJ), the SINTEX-F only contains 4 months Ham et al. (2019), the MS-CNN contains 10 months Ye et al. (2021b), and the CNN ensemble model (left) contains 11 months with a correlation coefficient skill higher than 0.5. Our method has 15 months for which the correlation coefficient skill is up to 0.5, which shows that our method can effectively mitigate the drifts of SST and HT due to the springtime equatorial Pacific trade winds. In summary, the ACorr of the Niño3.4 index of our model outperforms all competing methods and can skillfully predict the EI Niño3.4 index over 18 months.

Symbol
Description Value

Comparison on the CMIP6 dataset
We also compare our method with POP-Net Zhou and Zhang (2022), which is currently the best performing method trained on the CMIP6 dataset. The results are shown in Figure 4. The ACorr of POP-Net model is above 0.5 for a lead of 17 month prediction, while the ACorr of our model is above 0.5 for a lead of 18 month prediction. In general, the ENSO prediction skill of our model is better relative to POP-Net, especially when the lead month is in the range of 12-24. The main reason is that the STTN model can use previous predictions as a priori knowledge for future predictions, which can provide reliable long-term forecasts. Table 5 compares the number of parameters and time cost for the training and testing of the CNN model Ham et al. (2019) and our model. Since the CNN model uses integrated learning, the total number of models is 11040 (23 leadmonths, 12 target months, 4 network settings, and 10 training sessions per model). The number of parameters in the four network settings is 0.12M, 0.18M, 0.21M, and 0.32M, respectively, and the total number of parameters is 2290.8M, which is much larger than our model. In addition, the training and testing time of our model is much lower than that of the CNN model, because STTN only uses the single model instead of the integrated model. The Niño 3.4 index for the next 23 lead months is available in a single run using the STTN model, which indicates that our model can predict the occurrence of El Niño in a more timely and rapid manner.

Ablation study
In order to verify the importance of our different modules, we performed ablation experiments for each module. To keep the experiment fair, we use the same experimental setting during training as well as testing, including data partition and network hyperparameters. We remove the proposed module from the final network model STTN to demonstrate the effectiveness of using the monthly index of period, the previous prediction as prior knowledge, and TCN, respectively. W/O X indicates the removal of the X module. Figure 5 shows the ACorr,  RMSR, and MAE when the monthly index of period (w/o p), previous prediction as prior knowledge (w/o F-T N), and TCN (w/o TCN) are removed, respectively. In addition, We also compared the effectiveness of spatio-temporal attention and input data of different lengths.

W/o P
The overall performance of the STTN model decreased after removing the monthly index of period, which indicates that although the neural network can capture the correlation between

W/o F-TCN
After removing the previous predictions as prior knowledge, the ACorr between the predicted and observed Niño3.4 index decreased sharply, especially in the long-term prediction, which indicates that the model does not predict the trend of evolution of El Niño over the next 23 months well when considering the input data alone. As shown in Figures 4B, C, where the MAE and RMSE increase after removing the previous prediction, it indicates that the previous predictions can compensate over long intervals and provide reliable long-term predictions.

W/o TCN
With the removal of the TCN module, we observed a low degradation in the performance of the model, which indicates that the cycle and future features are very important information. Compared to STTN, the model relies more on thepredicted Niño3.4 index series after lead month 12, which suggests that the temporal semantics are significant in the later stage for Niño3.4 index prediction.

Effectiveness of spatio-temporal attention
We compared the performance of the models using spatiotemporal attention and without using spatio-temporal attention. Figure 6A-C plots the ACorr, RMSR, and MAE of the prediction results. We first observed that the model with spatio-temporal attention performs better than the model without spatio-temporal attention. The spatio-temporal attention semantically learns more separable features and effectively reduces the spatio-temporal chaos,  The best results are in bold.
allowing the model to better fit the ENSO phenomenon. As can be seen from Table 6, these modules all favor ENSO prediction, and removing any of the modules would harm the performance.

Compare input data of different lengths
We compared the performance of different lengths of input data. Figure 7A-C plots the ACorr, RMSR, and MAE of the predicted results. We observed that the best performance is achieved when the input data length is 3 in lead months 1-8, better performance is achieved when the input data length is 6 or 9 in lead months 8-15, and relatively better performance is achieved when the input data length is 12 in lead months 15-23, so we can conclude that: (1) the early prediction may simply require the SSTA and HCA data that are close in time to the predicted month, and the earlier month may cause noise in the input data; and (2) longerterm predictions require longer inputs, which we speculate may be due to the longer inputs containing more physical laws of ENSO as a result of the westward shift within the ocean.

Case study
To clearly show the difference between the observed and predicted results from 1982 to 2017, we visualized the Niño3.4 index on the GODAS dataset for 1, 3, 6, 9, 12, and 15 lead months ahead, as shown in Figure 8. From the results, we found that the Niño3.4 indexes at 1-, 3-, 6-, and 9-lead months are accurately predicted and obtain a correlation coefficient skill of 0.97, 0.91, 0.82, and 0.74, respectively. When the lead month increased, the correlation coefficient skill decreased due to the absence of evidence for a long time series and the complex climate variation. Nonetheless, the correlation coefficient is 0.61 and over 0.5 when predicting the index for 15 lead months, which verifies the effectiveness of our method to predict the multi-year ENSO trend.
To explore the seasonal impacts, we show the predicted Niño3.4 index of averaging the December-January-February(DJF) season of 1, 6, 12 and 18 lead months in Figure 9. It can be observed that our method successfully predicts the amplitude of the Niño3.4 index at 6 lead months in advance. Even when we increase the lead time up to 18 months, the trend of our predicted results still fits the curve well when a strong El Niño or La Niña occurs. Moreover, we visualize the predicted results of a typical Super El Niño during (A) 1982-1983, and (C) 2015-2016 as well as a Super La Niña during (B) 1988-1989 in Figure 10. The predictions are the continuous outputs of our method from 1 to 23 lead months, and we can see that our model can successfully predict the evolution of these strong EI Niño phenomena and the results are consistent with the observed results even for longer lead times.
As both the SSTA and HCA influence the ENSO phenomenon, we visualize the contributions of these two factors in Figure 11. This figure shows that when we input three consecutive months during

Conclusion
In this paper, we propose a novel spatial-temporal transformer network for multi-year ENSO prediction. Motivated by the attention mechanism, we designed a spatial-temporal attention mechanism to model the contributions of different ocean locations with change over time. For long-term prediction, this article proposes utilizing the accurate previous prediction as prior knowledge and fusing the seasonal variation during the encoding of the temporal information to facilitate the ENSO prediction. Moreover, we use a single model instead of a multi-model architecture to reduce computational resources, which is more convenient for predicting ENSO with different lead times. Extensive experiments using the model on the Coupled Model Intercomparison Project phase 5 (CMIP5) and the Coupled Model Intercomparison Project phase 6 (CMIP6) have shown that our method can provide a more accurate prediction over the existing methods, which verifies the effectiveness of the spatial-temporal attention mechanism, the prior knowledge of previous prediction and the temporal index for modeling the seasonal variation. In the future, we will add more variables and fully explore the relationship among their sea-air interactions to facilitate the reliability of multiyear ENSO prediction.

Data availability statement
Publicly available datasets were analyzed in this study. This data can be found here: https://doi.org/10.5281/zenodo.3244463.

Author contributions
DS contributed to conceptualization and editing, and supervision. XS data processing, modeling, and writing of the original draft. WLi performed methodology and writing review, A-AL performed validation and review. TR and WLiu, performed validation and investigation. ZS optimized the model framework. All authors contributed to the article and approved the submitted version.

Funding
This work was supported in part by the National Key Research and Development Program of China (2021YFF0704000) and the National Natural Science Foundation of China (U22A2068).