Probabilistic Forecast of the Extended Range Heatwave Over Eastern China

Several probabilistic forecast methods for heatwave (HW) in extended-range scales over China are constructed using four models (ECMWF, CMA, UKMO, and NCEP) from the Subseasonal-to-Seasonal (S2S) database. The methods include four single-model ensembles (SME; ECMWF, CMA, UKMO, and NCEP), multi-model ensemble (MME), and Bayesian model averaging (BMA). The construction and verification of reforecasts are implemented by a defined heat wave index (HWI) which is not only able to reflect the actual occurrence of heatwaves, but also to facilitate forecast and verification. The performance is measured by traditional verification method at each grid point of the 105°E to 132°E; 20°N to 45°N domain for the July, August, and September (JAS) of 1999–2010. For deterministic evaluations of HWI forecast, BMA shows a better pattern correlation coefficient than SME and MME and comparable equitable threat score (ETS) with ECMWF and MME. The good performance of ECMWF and MME take advantage of setting the percentile thresholds for forecasting HW. For the probabilistic forecast, the Brier score of BMA is comparable (superior) to that of MME and ECMWF at short (long) lead-time. BMA also demonstrates an improvement on the reliability of probabilistic forecast, indicating that BMA method is a useful tool for an extended-range forecast of HW. Meanwhile, in the real-time extended-range probabilistic forecast, the beginning date, end date, and probability of HW event can be predicted by the HWI probabilistic forecast of BMA.


INTRODUCTION
The heat wave (HW) is one of the extreme events around the world. It causes widespread destruction of infrastructure and human activity, economic damage, and loss of life. It is well known that the frequency and intensity of HWs are remarkably increased in the past years around various parts of the world due to climate change. Typical HW events of recent years are central Europe in 2003 (Fink et al., 2004;Fouillet et al., 2006;Schär et al., 2004;Trigo et al., 2005), over Russia in 2010 (Barriopedro et al., 2011;Dole et al., 2011), and over Europe in 2018 (Schiermeier, 2018). The 2010 Russian HW was the strongest recorded over the past 30 years (Hoag, 2014). It caused about 55,000 deaths and numerous wildfires, the worst drought in Russia in nearly 40 years, and the loss of at least millions of hectares of crops. Moreover, in the past summer (late June 2021), the Pacific North-West HW crossing the US and Canada has been called a roughly 1-in-1,000-years event (www.climate.gov 2021: https://www.climate.gov/news-features/event-tracker/preliminary-analysis-concludes-pacificnorthwest-heat-wave-was-1000-year). In East China, the temperature has increased significantly in the past 50 years, and the HW is one of the major disastrous weathers in this region (Shi et al., 2008).
In summer 2013, this region experienced its worst HW on record for the past 113 years (Xia et al., 2016). The time needed to prepare for a persistent extreme event is often beyond the skillful prediction timescales of a few days that are currently available (White and Coauthors, 2017). Demands are growing rapidly in the operational prediction and applications communities for forecasts that fill the gap between medium-range weather and long-range or seasonal forecasts, and the extended-range forecast. However, it has been less well addressed despite the considerable socio-economic value that could be derived from such forecasts.
The establishment of an extensive database (Vitart et al., 2017), containing sub-seasonal (up to 60 days) forecasts by the WMO S2S research demonstration project makes the forecasting of HW in this time range possible. However, the accuracy of the extended range forecast is limited due to the chaotic characteristics of the atmosphere and uncertainties associated with initial conditions and models (Thompson, 1957;Lorenz, 1963Lorenz, , 1969Smagorinsky, 1969). Probabilistic forecasts are becoming an inevitable method of solving this problem (Carrol and Maloney, 2004). Probabilistic forecasts aim to predict the uncertainty of a quantity or event of interest in the form of full predictive probability distributions (Gneiting and Katzfuss, 2014) rather than single-valued or point forecasts. Information about the uncertainty of a forecast can provide decision-makers with a range of possible outcomes and the amount of confidence associated with a particular event (Krzysztofowicz, 2001), which is valuable for deciding if, when, and how many precautionary measures should be taken. Besides, the use of probabilistic forecasts can realize more economic value than control forecasts for most potential users according to cost-loss ratios (Zhu et al., 2002). Especially at and beyond 120 h lead-time, all users are better off using the ensemble system than the control forecasts. Therefore, we focus this paper on the generation and validation of probabilistic forecasts for heatwaves in the extended range.
Several methods are currently used to construct probabilistic forecasts from the ensemble forecast. The probability of a single model ensemble (SME) is derived by computing the ratio of ensemble members with events that occur to all members. For this probabilistic forecast to be accurate and reliable, it is necessary to enlarge ensembles, which represent a range of possible evolutions of the system given the uncertainties (Richardson, 2000). In addition, modeling uncertainties can be taken into account by combining ensemble forecasts from several models and forming a multimodel ensemble (MME; Krishnamurti, 1999). Even still, raw ensemble (SME and MME) forecasts do not capture the full range of forecast scenarios and bear uncertainties that grow larger as lead time increases (Hamill and Colucci 1997;Raftery et al., 2005;Stauffer et al., 2017). Bayesian model averaging (BMA; Raftery et al., 2005;Sloughter et al., 2007Sloughter et al., , 2010Fraley et al., 2010) is one of the stateof-the-art approaches developed for ensemble-based probabilistic precipitation forecasts. The BMA predictive probability density function (PDF) is a weighted average of PDFs centered on the individual bias-corrected forecasts, where the weights are equal to posterior probabilities of the models generating BMA model and reflect the models' relative contributions to predictive skill over the training period (Raftery et al., 2005). It was originally applied to the prediction of temperature and sea level pressure, and those PDFs were approximately normal distribution, yielding wellcalibrated and sharp PDFs. Many analyses demonstrated that the BMA method performed superior to raw ensemble forecasts (Sloughter et al., 2007;Schmeits and Kok 2010;Liu and Xie, 2014). Subsequently, the method was employed in more studies (Casanova and Ahrens 2009;Erickson et al., 2012;Liu and Xie, 2014) for the short-and medium-range forecasts with TIGGE (The THORPEX Interactive Grand Global Ensemble) dataset. However, it is unclear whether BMA is fit for the probabilistic forecast of heatwave and whether it performs better than raw ensembles in the S2S time scale. In this framework, we seek to answer this question by constructing probabilistic reforecasts of SME, MME, and BMA for heatwave and evaluating these reforecasts skills, specifically probabilistic skill, based on a S2S dataset.
Since a heatwave is an event with consecutive hot days in one region and daily maximum surface temperatures is not enough to represent it, it is necessary to develop an index based on daily maximum surface temperatures that not only reflects the actual occurrence of heatwaves but also facilitates probabilistic forecast and verification. The heatwave has been widely identified by an extreme heat factor (EHF) index based sliding 3-days window of temperature (Nairn et al., 2009). But this index does not apply to this study because it is difficult for this discontinuous variable to construct proper PDFs. Another type of heatwave is defined as one pentad mean surface maximum air temperatures exceeding the local 95th percentiles during the control period of 1960-1990(Zhu and Li, 2017. The defects of this definition, which does not consider the continuity of the heatwave, also make it unavailable to this study. Fischer and Schär (2010) defined a heatwave as a spell of at least six consecutive days with maximum temperatures exceeding the local 95th percentile over a control period. This heatwave is based on the synoptic temperatures, which will not engage any extended-range skill. How to synthesize the aforementioned points and then develop a skillful HWI in the extended range is another major topic in this study.
The metrics applied to verify the hindcasts, and the data used are described in Section 2. The definitions of HWI and heatwave are described in Section 3, which also shows the construction of a probabilistic reforecast for HWI. The forecasting skills for the different reforecasts are compared in Section 4. Section 5 summarizes the results and discusses the real-time probabilistic prediction of heatwaves based on HWI.

Data
The S2S database (Vitart et al., 2017) collects forecasts/reforecasts (or hindcasts) from the subseasonal forecasting systems of 11 different centers. The individual systems, including the China Meteorological Administration (CMA), the European Center for Medium-range Weather Forecast (ECMWF), the National Centers for Environmental Prediction (NCEP), and the United Kingdom Met Office (UKMO) are selected as SME.
The configuration details of the four systems are summarized in Table 1. The daily maximum temperature has been extracted from the four different reforecast systems for the common 12years (1999-2010) summertime (July to September). Given that all atmospheric models have different native horizontal and vertical resolutions, the S2S data are extracted on a common 1.5 × 1.5°grid on the China domain (105°E to 132°E; 20°N to 45°N). The constructions of MME and BMA require the same initial dates for all four systems. The different reforecast frequencies shown in Table 1 could be a challenge to build.
Here we chose to overcome this drawback by selecting the calendar dates of the ECMWF reforecast as the referred starting dates, with others corresponding to these dates. It is easy for the daily CMA and NCEP to complete this correspondence. For UKMO, the closest date (early or later) is selected to construct MME and BMA, which means that there may exist different forecasting valid in MME and BMA. Nevertheless, the operation will not create an issue with the extended range forecast.
The observed daily maximum temperatures of 2,248 stations in China are provided from the Chinese Meteorological Information Center. The station data are interpolated onto the common 1.5°× 1.5°grid to match up the forecasts in the S2S database by natural neighbor interpolation method and used to BMA train and verification for reforecasts over 1999-2010.

Mean Absolute Error and Pattern Correlation Coefficient
The mean absolute error (MAE) and pattern correlation coefficient (PCC) are defined by where o i and f i are the observed and forecasting value, f and o are the average ones, N is the number of forecast/observe pairs.

Equitable Threat Score
The ETS represents the deterministic forecasting skill of heatwaves and is specified as follows: Variables in the formula are defined in Table 2. ETS takes a random chance [R(a)] away to account for true forecast skill. Larger ETS values represent higher forecasting skills, while ETS less than 0 donates no skill in the forecast.
The probability of detection (POD), false-alarm rate (FAR), and miss rate (MSR) are also employed to assess the deterministic forecast as follows:

Brier Score
The Brier score has been widely used in the assessment of probabilistic forecasts (Ferro 2007). It is essentially the meansquared error of the probability forecasts, considering that the observation O i 1 if the event occurs and O i 0, if the event does not occur. The score averages the squared differences between pairs of forecast probabilities and the subsequent binary observations, Where n is the number of forecast/event pairs, i denotes a numbering of the n and P i is the forecasting probability for heatwave. The Brier score is negatively oriented and has the range 0 ≤ BS ≤ 1. On the fly: every set of re-forecasts are produced to calibrate real-time ensemble forecasts of the following week using the latest version of IFS. The ensemble re-forecasts consist of ensemble starting the same day and month as a real-time forecast, but covering the past years.

Brier Skill Score
The Brier skill score (BSS) is: Where BS ref is the reference probabilistic forecast, commonly probability of event occurrence from climatology, and BS p is the perfect forecast, BS p 0. Ideally, the climatological probabilities would be determined from independent data, but commonly they are calculated from the sample observed data. In the conventional method of calculation, an average climatology p c is: Considering the differences between the climatological event frequencies for different months, the sample p c is determined by each month so as to BS ref . Then the BSS in each month is calculated and the final BSS score is the weighted average of monthly BSS. Hamill and Colucci (1997) for more details.

Continuous Ranked Probability Score
The CRPS measures the difference between the predicted and occurred cumulative distributions is the well-known Heaviside function:

HWI and Heat Waves
In this study, we define HWI as a sliding 5-days window and 9points equal weight average of the daily surface maximum temperatures. For each grid point, a heatwave is a period with HWI continuously greater than a certain percentile threshold. The percentile is referred to the ratio of daily surface maximum air temperatures below 35℃ to all samples for each grid separately, shown in Figure 1A. The thresholds of HWI ( Figure 1B) can be calculated from the percentile in Figure 1A. Why do we set the percentile related to 35°C but not a fixed 90th percentile or 95th percentile? It is due to the discomfort humans feel from environmental temperatures higher than 35°C, and a fixed percentile may relate to lower temperatures in colder regions. The HWI is developed from the points of definitions of heatwave in previous studies (Nairn et al., 2009;Fischer and Schär, 2010;Perkins eta al., 2012;Zhu and Li., 2017) and for facilitation of probabilistic forecast and verification. First, the 5days running average referred to Nairn et al. (2009), has filtered the synoptic and small-scale noise that it can represent the largescale and long-lasting weather event to enhance extended-range forecast capability (Buizza and Leutbecher, 2015). The spatial running average is implemented to match the time running average coordinately representing equivalence scales. Second, the running average of daily temperature may also match normal distributions that HWI may be fitted to use BMA. A previous study (Raftery et al., 2005) has proved that predictive PDFs of BMA were much better calibrated than the raw ensemble. Whether the PDF of HWI fits normal distributions will be examined in the following section. Third, the daily HWI will be facilitated for verification when using the traditional verification method mentioned above.
The observed HWI is mainly defined to consider the key properties addressed through its extreme and persistence in terms of the period and surrounding area. The extreme property of the Frontiers in Earth Science | www.frontiersin.org January 2022 | Volume 9 | Article 810579 HWI is seen from the value of percentile ( Figure 1A) ranging from 80 to 92. The average persistent days of the observed heat wave identified by HWI being greater than 5 days ( Figure 2A) indicates persistent property.
To further examine whether heatwaves determined by HWI are objective, we compare them with the heatwaves in Lin et al. (2021). Based on observation data from weather stations, they objectively define the heat waves in four regions of China (the four boxes in Figure 1A) by considering the certain percentage of stations with a maximum temperature greater than 35°C, the coincidence degree of stations comparing with the previous day, and the persistent days (See Lin et all., 2020 for more details). For comparison, we reconstruct HWI by four regional averages instead of the spatial running average. The corresponding thresholds are calculated by the four regional averages of percentiles in Figure 1A, which are 0.82, 0.80, 0.93, and 0.97 respectively. Basing the reconstructed HWI and threshold, we obtained the four regional heatwaves of our study. We found that 90 percent of these events are consistent with the heat waves in Lin et al. (2021). Figure 2B shows the period of these two types of heatwave that occurred from 2000 to 2010 in South China. The result indicates great consistency, verified the rationality of our definition for HWI.

Single-and Multi-Model Ensemble Probabilistic Forecast of HWI
For short-and medium-range forecasts the extreme threshold are generally defined from observations or reanalysis and are often replaced by fixed boundaries that have societal implications. However, numerical models drift with increasing lead-time very quickly toward their own climate, which can be very different from observations. Therefore, for extended-range forecasts, the preferred option is to define the percentile thresholds from reforecasts (Vitart and Robertson, 2018). The reforecasts of HWI for SME and MME are obtained by sliding 5-days window and 9-points running average on the ensembles of SME and MME respectively. Then based on the ensemble means of reforecast and the percentile shown in Figure 1A, the thresholds of different ensembles and different forecast validations ( Figure 3) are calculated. Finally, the probabilistic forecast of heatwave for each SME and MME will be obtained by calculating the percentage of members with HWI greater than the corresponding threshold. Figure 3 shows that the value of the threshold varies among different ensembles and different forecast validations. The threshold of each ensemble shows a decreasing trend with the forecast timeliness, indicating that the numerical models quickly drift toward their own climatology. So, it is necessary to set a threshold along with forecast time. In terms of overall intensity and location, the threshold of the United Kingdom is closest to the observation, indicating that the HWI climatological distribution of the United Kingdom matches the observation best. This view can be visually proved by the largest correlation of the United Kingdom threshold with the observational threshold (digital in the lower right corner of each panel in Figure 3). The results implied that the United Kingdom may perform a higher forecast skill in HWI forecast. Although the threshold of CMA is the largest among SMEs, MME and its magnitude are closest with the threshold of observation ( Figure 1B), the center of the large values has shifted to the north.
The definition of HWI and the forecasting thresholds for SMEs and MME make the extended-range forecast of heatwave events possible. As shown in Figure 4, the ETS of ECMWF for HWI (green line) reaches up to 0.3 in the first 10days forecast and maintain greater 0.2 in the 10-30-days forecast, while the ETS of daily maximum air temperature for 35°C (blue line) is less than 0.2 during the 1-30-days forecast. This contrast shows that the daily maximum temperature and fixed thresholds are not fit for extendedrange prediction of HW events. It is necessary to define the HWI and set its forecasting thresholds for SMEs and MME. Besides, the better forecasting skill of the HWI (green line) than the 5-days running average of daily maximum temperature (red line) proves the necessity of spatial average in the definition of HWI. Frontiers in Earth Science | www.frontiersin.org January 2022 | Volume 9 | Article 810579

Bayesian Model Averaging Probabilistic Forecast of HWI
The BMA model used in this study followed Raftery et al. (2005) and is only briefly described here. The BMA method generates a total PDF of HWI amount by weighted averaging PDFs estimated from bias-corrected forecasts of individual ensemble members, which is defined as p y f 1 , ..., f K , y T K k 1 w k p k y f k , y T where y is the HWI quantity, f is the forecast of a particular ensemble member, k is the index of the ensemble member, K is the total number of the ensemble members, p k [y|(f k , y T ) is the conditional PDF of y given that f k is the best among the ensemble, and w k is the nonnegative posterior probability of the kth ensemble member being the best among all members, they add up to 1. The poor POD, MSR, FAR, and ETS of raw ensemble members for heatwave events (dash line in Figure 5) shows the bias-corrected requirement for each ensemble before doing BMA. In our study, linear regression is applied to correct bias for each SMEs. The parameters are obtained by training with some observation/forecast pairs. It is indicated that bias correction has significantly improved the forecasting skill of ECMWF (solid line in Figure 5). The bias-corrected forecasts of the individual ensemble will contribute to the forecasting capability of BMA.
The BMA was developed initially for quantities whose PDFs can be approximated by normal distributions, such as temperature and sea level pressure. Whether normal distribution fit the PDF of HWI is necessary to be examined. FIGURE 4 | The ETS of ECMWF for daily maximum air temperature greater than 35°C (blue line), 5-days running average of daily maximum temperature (red line), and HWI (green line) greater than their relative thresholds.
Frontiers in Earth Science | www.frontiersin.org January 2022 | Volume 9 | Article 810579 As we are mainly concerned with the probabilistic forecast of heatwaves, only the distributions of observed HWI conditional on forecasting heatwaves are shown in Figure 6 for different lead times and ensembles. It is shown that the normal distribution fits HWI for all of them. Especially, the distribution of EC is better approximated by normal shape than other SMEs as its value varies more smoothly. On the basis of the normal distribution of HWI, the following steps can finish the training of BMA. The conditional PDF by a normal distribution is centered at a linear function of the forecast, a k + b k f k , so that p k [y|(f k , y T ) in 1) is a normal PDF with mean a k + b k f k and standard deviation σ, expressed as y|(f k , y T ) ∼ N(a k + b k f k , σ 2 ). As a result, a k , b k , w k and σ are the model parameters required to estimate on the basis of a training dataset consisting of ensemble forecasts and verifying observations. The a k and b k can be estimated by simple linear regression for the training dataset. The σ and w k will be estimated from the training data by maximum likelihood (Fisher 1922), which can be operated by the expectationmaximization (EM) algorithm (Dempster et al., 1977;McLachlan and Krishnan 1977;Raftery et al., 2005). At last, the trained parameters substituting into 1) will obtain the BMA model, which can be used to get the probabilistic forecast by a certain forecast value f k .
In our implementation, the training set consists of a sliding window of forecasts and observations for the previous m samples, where m 7,12, 17, 22.27. The certain number of samples when the verification metrics tend to be stable is taken as the optimal sliding training number. Here we adopt the mean absolute error (MAE) and continuous rank probabilistic score (CRPS) as the verification metrics. As shown in Figure 7, the short-term forecasts (blue lines and green lines in Figure 7) require about 12 training samples to have the best MAE and CRPS scores. But for longterm forecasts (yellow lines and red lines in Figure 7), 20 samples obtains the best probabilistic forecast skill with optimal CRPS score ( Figure 7B). Since this article focuses on the probabilistic forecasting of the extended period, 20 sliding samples are selected for BMA model training. Figure 7C shows the average weights of each ensemble trained from BMA when using 20 sliding samples. BMA automatically adjusts the weights for each ensemble according to their performance during the training period so that BMA may engage a better forecasting skill than SMEs and MME.
The definition of the threshold for HWI of BMA is similar to SMS and MMS, which is based on the percentile from reforecasts (Figure not shown). According to the PDF and Frontiers in Earth Science | www.frontiersin.org January 2022 | Volume 9 | Article 810579 8 threshold for HWI of BMA, a probabilistic forecast of BMA for heatwave is obtained.

VERIFICATION FOR HEATWAVES
Seeking the continuous days with HWI greater than extreme percentile thresholds respectively can identify the observed and forecast heat waves. But it is not easy to quantitatively verify the forecasting heatwaves against observed heatwaves. Some of the investigations only verify the forecast skill of heatwave qualitatively on the case studies (Hudson et al., 2016;Mandal et al., 2019). To solve this problem, the daily forecasting HWI is verified against daily-observed HWI using the traditional verification method. It implied that the verification of heatwaves is executed by decomposing these events into individual days. These deterministic and probabilistic reforecasts of SME, MME, and BMA are verified for the 1999-2010 JAS period at each grid point of the domain.

Deterministic Verification for Heatwaves
The PCC, POD, MSR, FAR, and ETS are used to evaluate each deterministic HWI reforecasts that is the ensemble mean of their members for SMS and EMS. The PCC shown in Figure 8 indicates that the HWI of BMA possesses the best pattern correlation to observed HWI with a coefficient up to 0.92. Though the PCC of BMA decreases with lead time, its value is still over 0.8 in the extended range. For SMS and EMS, the UK displays the best pattern correlation with observation, which is consistent with the rank of the spatial correspondence between the threshold of ensembles ( Figure 3) and observation ( Figure 1B). The POD, MSR, FAR, and ETS scores for heatwave forecasts are presented in Figure 9. ETS skills of BMA, ECMWF, and MME are the best among all reforecasts and are comparable across themself for all lead times. The POD, MSR, and FAR show consistent skill with ETS. The CMA performs the worst among all forecasts for all scores. Given this, we have tried to improve the BMA without CMA, based on ECMWF, UK,   Frontiers in Earth Science | www.frontiersin.org January 2022 | Volume 9 | Article 810579 10 and NCEP. The performance of BMA based on three models is almost the same as BMA based on four models. Therefore, BMA based on four models is adopted in this paper.

Probabilistic Verification for Heatwaves
The probabilistic performance for heatwave of BMA is compared to the MMEs and SME for the JAS start dates of the common period 1999-2010. The BS, reliability, and BSS are employed to verify the performance. Since our study focused on the dichotomous extreme events and CRPS measures the overall probabilistic performance, it is not employed here. BS is used in the assessment of probabilistic forecasts of events exceeding extreme thresholds, which are set separately for observation and forecast. Figure 10 shows the BS score at each lead day for heatwave forecast of BMA, SMSs, and MMS. As it shows, the performance of BMA is comparable (superior) to that of MME and ECMWF at small (large) lead times. The result suggests that the BMA model, based HWI, is fit for the probabilistic forecast of heatwaves in the S2S time scale.
As pointed out by Murphy (1973), the algebraic decomposition of the Brier score is expressed as three FIGURE 11 | The reliability diagram for heatwave for MME (blue) and BMA (red) at lead days 10 (A), 15 (B), 20 (C), 30 (D) respectively. The forecast frequency in each probability bin is represented below on the bar plots for each forecast with the same color. The number inside each figure is the reliability which is the integral area from the diagonal line to forecast reliability line with weighting from the sample frequency.
Frontiers in Earth Science | www.frontiersin.org January 2022 | Volume 9 | Article 810579 11 terms, reliability, resolution, and uncertainty. According to the research of Weisheimer and Palmer (2014) for the case of seasonal forecasts, reliability is the most important aspect to determine how good a forecast is. Reforecasts for every point of the domain are pooled together so as to provide robust estimates. The reforecasting probabilities are categorized into 12 bins, 0.0, 0.05, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0. For a given forecast probabilistic bin, reliability is the correspondence between the forecasting and the observed probability of a heatwave. The comparison of reliability diagrams is only made for BMA and MME owing to the best BS score of them among all reforecasts. Blue and red lines in the upper part of Figure 11 show this comparison between MME and BMA at lead days 10, 15, 20, 25. The forecast frequency in each probability bin is represented below the bar plots in Figure 11. For small probabilistic bins (less than 0.2) in all lead days, MME and BMA points are nearly located at the diagonal line, showing perfect reliability. However, BMA points are closer to the diagonal line at large probability bins (greater equal than 0.2), showing greater confidence for the occurrence of a heatwave. The overall performance of reliability is intuitively seen from the weighted area between the dashed and solid lines, which are the numbers in the bottom right corner of Figure 11. The good reliability of BMA is due to the much better calibrated predictive probability density function. The shapes of the bars represent the sharpness of probabilistic forecasts, which is another important aspect of probabilistic verification. On lead day 10, the ECMWF is sharper than BMA with more forecasting frequency in the lowest probability bin. But this superiority decreases with increasing lead-time, histograms of both forecast probabilities indicate high sharpness. The result implies that the better performance of BMA for BS mainly takes advantage of reliability, not sharpness.

DISCUSSION AND SUMMARY
This study proposes a heatwave index (HWI) in the sub-seasonal time scale for observation and forecast. We have examined the qualification of HWI definition from available observations, which indicated that a newly defined index is able to represent heatwaves that actually occurred. On this basis, several methods are constructed for the probabilistic forecasting of HWI. We have evaluated the performance of these methods, including deterministic and probabilistic forecasts. The result shows that the probabilistic forecast performance of BMA is the best even though the deterministic performance of BMA is comparable to the MME and ECMWF. It means that the BMA model has demonstrated its value for heatwave forecast in the extendedrange prediction.
The outcome of the BMA model is the daily probability of HWI in the sub-seasonal forecast. In the real-time forecast, decision-makers are more concerned about the period of a heatwave, including start date, end date, persistent days, and the probability of the whole event. This information provides decision-makers with the amount of confidence associated with heatwaves, which is valuable for deciding if, when, and how many precautionary measures should be taken. Figure 12 shows an example of transformation from the daily probability of HWI to probability of heatwave in reforecast initiated from June 19, 2010 to July 3, 2010 at grid 24°N, 113°E. In this region, there existed a significant heatwave from July 1, 2010 to July 19, 2010 shown in the below part of Figure 12. The forecasting period of a heatwave should be the days with a probability greater than 50 percent. The forecasting probability of the whole heatwave should be the average probability during this period. Taking the forecast initiated from June 19, 2010 as an example, the forecast heatwave occurs during July 3, 2010 to Frontiers in Earth Science | www.frontiersin.org January 2022 | Volume 9 | Article 810579 July 8, 2010 with the probability of heatwave equal to 82. The BMA model has successfully predicted this heatwave 12 days in advance with much higher confidence, which is confirmed by the observations. Two important points should be taken when using the HWI to predict a heatwave. First, the actual start date and end date of the heatwave may deviate by one or 2 days whatever for forecast or observation since the 5-days running average is used in the definition of HWI. Second, the heatwave defined on each point is not restrained to this point. It represents a region centered on this point with 4.5 × 4.5°. Meanwhile, similar steps could generate the quantitative thresholds of the HWI for other regions if it is requested from stakeholders. The conclusions from this study could be summarized as follows: 1). The HWI could be modified to simulate reality through observations and raw (re)forecast from the worldwide forecast system; 2). A multi-model ensemble (equal weight or poor man ensemble) could improve the forecast skills partially from comparing individual model forecasts; 3) The BMA process has demonstrated the best forecast skills around all forecasts in terms of deterministic (ensemble mean) and probabilistic forecasts (forecast reliability); 4) The implementation of HWI could present the spatial details and day-to-day hot extreme for an extended range to help to give earlier warning and/or make decisions further in advance.

DATA AVAILABILITY STATEMENT
The raw data supporting the conclusion of this article will be made available by the authors, without undue reservation.