Varying performance of eight evapotranspiration products with aridity and vegetation greenness across the globe

The wide application of the evapotranspiration (ET) products has deepened our understanding of the water, energy and carbon cycles, driving increased interest in regional and global assessments of their performance. However, evaluating ET products at a global scale with varying levels of dryness and vegetation greenness poses challenges due to a relative lack of reference data and potential water imbalance. Here, we evaluated the performance of eight state-of-the-art ET products derived from remote sensing, Land Surface Models, and machine learning methods. Specifically, we assessed their ability to capture ET magnitude, variability, and trend, using 1,381 global watershed water balance ET as a baseline. Furthermore, we created aridity and vegetation categories to investigate performance differences among products under varying environmental conditions. Our results demonstrate that the spatial and temporal performances of the ET products were strongly affected by aridity and vegetation greenness. The poorer performances, such as underestimation of interannual variability and misjudged trend, tend to occur in abundant humidity and vegetation. Our findings emphasize the significance of considering aridity and vegetation greenness into ET product generation, especially in the context of ongoing global warming and greening. Which hopefully will contribute to the directional optimizations and effective applications of ET simulations.


Introduction
Terrestrial evapotranspiration (ET), as a pivotal element of such land-atmosphere interaction processes as what happens to water, carbon, and energy cycle, is constituted of soil evaporation, vegetation transpiration and water surface evaporation (Gao et al., 2016;Tramontana et al., 2016;Zhang et al., 2016;Liu et al., 2021). On the land surface, 60% of precipitation (Pre) is returned to the atmosphere through ET, consuming half of the solar energy reaching at the surface (Pan et al., 2020). Consequently, ET draws significant interest from hydrology to climate disciplines. Researchers aim to understand the allocation of energy and water at the land and its feedbacks , identify dominant control factors of ET variation across regions (She et al., 2017;Zhang et al., 2021), and investigate the impact of ET on the hydrological cycle under climate change (Gu et al., 2020;Weerasinghe et al., 2020). Hence, accurate estimation of ET is crucial for various scientific communities such as hydrology, ecology, climatology, and agriculture.
There is no denying that existing ET products have considerable potential to facilitate the estimation of hydrological and energetic components and their inherent hydroclimatic variability. For instance, global ET estimates at arbitrary spatial and temporal scales can be compiled by the conventional flux formula (or Land Surface Model (LSM)) and the remote sensing observations about surface temperature, soil moisture and vegetation cover ratio Miao and Wang, 2020). Recently, the boom in machine learning methods has also facilitated the acquisition of global ET datasets (Jimenez et al., 2011;Alemohammad et al., 2017;Jung et al., 2017), such as model tree, random forest, or artificial neural networks combining observed flux data as inputs. However, these products simultaneously involve some uncertainties derived from the model structural flaws, input-datasets errors (e.g., meteorological forcing, land surface, and parameters related to vegetation), model-parameter errors and scale-scaling issues (Badgley et al., 2015;Michel et al., 2016;. However, the existing terrestrial ET products widely vary in performance and even oppose long-term trends, indicating the nonnegligible uncertainties. For instance, it has been reported that while potential evapotranspiration (PET) trends have declined over the last 50 years, ET has shown an increasing trend according to the evapotranspiration paradox (Mao et al., 2015;Zhang et al., 2015;. However, Jung et al. (2010) added that the increase in global terrestrial ET has ceased or even reversed from 1998 to 2008. Therefore, a comprehensive evaluation of ET products is a prerequisite for model optimizations and global climate-change research, especially, on a regional scale.
ET measurements from the Eddy Current Covariance (EC) site have become typical reference data to validate ET products at the point scale. Nevertheless, EC systems generally suffer from energy imbalance, which resulting in ET measurement errors. And the mismatch in spatial scale between EC observations and ET estimates (points and grid cells) is another limitation. Furthermore, EC sites sparsely spread over spaces, which challenges the evaluation of ET products on a regional scale (Pan et al., 2020;Xie et al., 2022). An alternative approach is terrestrial water balance method, i.e., ET calculated from the terrestrial water balance (observed Pre minus the sum of runoff (Q) and total water storage change (TWSC)) is applied as the truth value to validate ET products at the basin scale . Over the last 2 decades, considerable attention has focused on the regional scale (US, African, and Qinghai-Tibetan Plateau), while less on the global scale. For example, Vinukollu et al. (2011) conducted a global evaluation on the ET estimates derived from three process-based models (Surface Energy Balance System (Su, 2002), Penman-Monteith-Mu algorithm (Penman, 1948;Mu et al., 2007), and Priestly-Taylor-Fisher of Jet Propulsion Laboratory algorithm (Priestley and Taylor, 1972;Fisher et al., 2008) based on 26 basins worldwide, and suggested a root mean squared difference (RMSD) of 118-194 mm/yr and a deviation of −132 to 53 mm/yr between the water balance ET and the estimated annual ET.
However, the total water storage change (TWSC) at the annual scale has often been disregarded in previous studies (Pre directly minus Q), yet the water budget is unbalanced due to human abstraction, glacial snowmelt, and other activities affecting water storage Zhong et al., 2020). For example, Zeng et al. (2012) found that the TWSC cannot be ignored in estimating ET at an annual scale, especially in regions with relatively low ET values. As a result, the annual reference ET must take into account TWSC. Although the Gravity Recovery and Climate Experiment (GRACE) satellite launched in 2002 offers a promising future to the TWSC, the limited GRACE satellite data right now makes it problematic to assess ET products with longtime data, especially the pre-2002 data, and subsequently difficult to explore ET trends. Recently, the reconstruction of GRACE facilitates the evaluation on long-time series of ET products. More importantly, the deficiencies exist in the simultaneous evaluation of ET products at the global scale with various levels of dryness and vegetation greenness. The ET variation in different regions is closely related to local conditions: ET is limited by water in dry conditions and by energy in wet conditions; ET is higher in highly-vegetated areas and lower in sparsely-vegetated areas. We postulate that these conditions may affect the performance of ET products. For example, Majozi et al. (2017) assessed the accuracy and precision of four ET products in two South African ecoregions and showed that no one ET product performed best in both zones; Ershadi et al. (2014) found that the performance of European and North American ET models varied among zones and the models with relatively high accuracy differed across zones; Kim et al. (2012) concluded that the Moderate Resolution Imaging Spectrometer (MODIS) MOD16 ET for Asian woodland cover was more accurate than for other biomes. Consequently, there is a need to fully understand the simulation capacity of ET products under heterogeneous conditions (areas with different levels of aridity and vegetation greenness), which will be favorable to developing strategies for adapting to the climate change.
Here, the current study is not to compare the various models, but to investigate how the performances of eight global ET products vary with water and energy conditions or vegetation greenness over 1981-2010. In doing so, the globally distributed 1,381 basins were taken into consideration and segmented according to their aridity and vegetation conditions. Then, we illustrated differences in the performance of the products changing with aridity and vegetation based on the terrestrial water balance ET. The model performance of the ET product was evaluated using newly popular metric-Kling-Gupta efficiency (KGE), considering the magnitude, variability of the ET and its coefficient to the product. Additionally, the 1981-2010 period was selected, for the knowledge about this period is relatively lacking and the reconstructed the total water storage anomaly (TWSA) products are reliable before 2002. Finally, we discussed the potential reasons for our results. collected from 11 databases, as listed in Table 1 ( Holmes et al., 2013;Arsenault et al., 2016;Awange et al., 2019;Arsenault et al., 2020;Chagas et al., 2020;Coxon et al., 2020;Almagro et al., 2021;Fowler et al., 2021;Klingler et al., 2021). Considering that the Q dataset is derived from different sources; several criteria were implemented to control the dataset quality with reference to some well-established data processing methods (Beck et al., 2015). The details related to the criteria used in this study are as follows: 1 The final database retains a hydrological station only once, based on the latitude and longitude information of hydrological station; 2 If a station has missing data for more than 15% per day from 1981 to 2010, the station was removed; 3 The basin area controlled by the hydrological station must be able to cover two or more 0.5°grids.
Finally, 1,381 stations met these criteria. The global distribution of 1,381 hydrological stations is shown in Figure 1.

Precipitation and GRACE datasets
To reduce the uncertainties in precipitation data, we selected three global gridded precipitation datasets (GPCC, CPC-Unified, and CRU TS4.05) at 0.5°resolution based on precipitation gauges. GPCC precipitation dataset, was selected, for it is widely considered as the precipitation reference dataset (Becker et al., 2013). More importantly, CPC-Unified gauge-based analysis of global daily precipitation at 0.5°resolution (1979-present) was interpolated from the QC station reports, which incorporates the effects of topography (Chen et al., 2008).
Total water storage anomalies (TWSA), monitored by NASA's GRACE satellites via satellite gravimetry, are currently used for retrieving the exclusive data of TWSC in hydrological and climatic applications (Landerer and Swenson, 2012;Long et al., 2014;Jing et al., 2020a). Notably, the GRACE TWSA observation data only covers the period 2002-2017 (Jing et al., 2020b). Consequently, the two constructed TWSA datasets (i.e., GRACE-REC and GRID-CSR-GRACE-REC) were chosen, covering the data from 1981-2010 at a spatial resolution of 0.5°. GRACE-REC datasets were generated, using a statistical model trained with GRACE observations, consisting of six reconstructed TWSA datasets derived from two different GRACE observation products and three different meteorological forcing datasets (Humphrey and Gudmundsson, 2019). As for GRID-CSR-GRACE-REC,  reconstructed the GRACE observations by developing a methodological framework to compare three methods, including

ET products
Eight ET products using different methods were collected in this study: one remote sensing product (GLASS), two reanalysis products (ERA5-Land and MERRA-2), four LSM-based products (GLEAM-3.5a, E2O-En, PML and GLDAS2.0-Noah), and one machine learning-based product (MTE). The basic information of the ET products is shown in Table 2.
To estimate terrestrial ET, Global LAnd Surface Satellite (GLASS) ET products used the Bayesian model averaging (BMA) method to ensemble five process-based ET algorithms (Yao et al., 2014;Xie et al., 2022), i.e., MODIS ET product algorithm (Penman, 1948;Mu et al., 2007;Mu et al., 2011), revised remote-sensing-based Penman-Monteith ET algorithm (Yuan et al., 2010), Priestly-Taylor-Fisher of Jet Propulsion Laboratory ET algorithm (Fisher et al., 2008), modified Satellite-Based Priestley-Taylor ET algorithm (Yao et al., 2013), and semi-empirical Penman ET algorithm of the University of Maryland (Wang et al., 2010). It outperforms the five algorithms by using ground-based data of 2000-2009 collected from 240 EC gauges worldwide on all continents except for Antarctica. The ensemble algorithms, integrating multiple algorithms to generate the product, reduces the uncertainties of a single algorithm and ensures the accuracy of the product. The dataset used in this study is the product with the longest time series and the finest grid, spanning from 1982 to 2018 at a grid of 0.05°. ERA5-Land (Muñoz-Sabater et al., 2021), an enhanced global dataset for the land component of the fifth generation of European ReAnalysis (ERA5), was published by the European Centre for Medium-Range Weather Forecasts (ECMWF) in 2021. The core of ERA5-Land is the ECMWF surface model: the Carbon Hydrology-Tiled ECMWF Scheme for Surface Exchanges over Land (CHTESSEL). Four meteorological state fields (i.e., temperature, humidity, wind speed, and pressure at the surface) are available in the ERA5, from the lowest level of the model (level 137), which is 10 m above the surface. Surface fluxes involve downward shortwave, longwave radiation and total liquid, and solid precipitation. Compared with latent heat data from 65 EC gauges worldwide, ERA5-Land ET performs better than previous versions such as ERA5 and ERA-Interim (Albergel et al., 2018), benefiting from the enhancements on the ECMWF surface model. The dataset used in this study spans the period from 1979 to 2021 and has a grid of 0.1°.
The second Modern Era Retrospective-Analysis for Research and Applications (MERRA-2) reanalysis (Rodell et al., 2011), a widely used atmospheric reanalysis dataset, is provided by Global Modeling and Assimilation Office (GMAO) in NASA. It is produced by the upgraded Goddard Earth Observing System model Version 5 (GEOS-5), along with its associated data assimilation system (DAS) Version 5.12.4, which replaced the original MERRA and MERRA-Land reanalysis. MERRA-2 alleviates some of the deficiencies of the MERRA and MERRA-Land product, such as certain biases and imbalances in the water cycle as well as the false trends and jumps in precipitation associated with changes in the observing system. The dataset used in this study spans from 1980 to 2021 and has a grid of 0.58°× 0.625°.

Variables Products Methods Time span/Resolution Website
The Global Land Evaporation Amsterdam Model (GLEAM), a set of algorithms, including a potential evaporation module, stress module, and rainfall interception module, dedicates to estimating the terrestrial evaporation and root-zone soil moisture from the satellite data, which consists of soil evaporation, canopy transpiration, interception loss, snow sublimation, and openwater evaporation (Martens et al., 2017). Among these modules, the potential evaporation module uses the Priestley-Taylor equation, and the stress module is represented by the semiempirical relationship between vegetation optical depth and rootzone soil moisture. A vital feature of this product is that the Gash analytical model is used to estimate interception loss.
To develop the global water reanalysis on the multi-scale water resource assessment and related research projects, the EartH2Observe (E2O) project also used the reanalysis-based forcing data to drive ten models: five global hydrological models (GHMs), four Land Surface Models (LSMs) with extended hydrological scenarios, and one simple water balance model (WBM) (Schellekens et al., 2017). The forcing dataset is an adjustment of the ERA reanalysis dataset combining the terrestrial meteorological element observations and Climate Research Unit (CRU) datasets. The E2O-En product has proven to be an accurate reanalysis data and been widely used for the multiscale water resource applications (Schellekens et al., 2017). The generated data from ten models were arithmetically averaged to alleviate the potential errors and uncertainties of the individual model.
The GLDAS is a global assimilation and modeling system developed jointly by NASA, Goddard Space Flight Center (GSFC), and NOAA (Rodell et al., 2004;Rodell et al., 2011). The system provides the near real-time land-surface information from ground and satellite observations, by driving four LSMs. Here, the ET product derived from GLDAS2.0-Noah is adopted in our study.
The Model Tree Ensemble (MTE) product, a data-driven estimate (Jung et al., 2009), was compiled using a global monitoring network (the database of the FLUXNET), the meteorological and remote-sensing observations, and a machinelearning algorithm. Its forcing data include a harmonized the Fraction of Absorbed Photosynthetically Active Radiation (FAPAR) product from three sensors (AVHRR (Tucker et al., 2005), SeaWiFS (Gobron et al., 2006), MERIS (Gobron et al., 2008), a remote-sensing-based global land-use, and products of climate variables based on observations. However, the lack of measurements makes it impossible to calculate ET in cold and dry deserts; this may result in a slight underestimation of global ET.

AI and LAI data
Another monthly Pre and potential ET dataset (1901-2020) was chosen from CRU TS4.05, to calculate the aridity index (AI): the ratio of Pre and potential ET. GLASS LAI product was compiled by AVHRR from 1981-2000 and by MODIS from 2001 to 2018 (Xiao et al., 2013). To generate continuous and smooth data, GLASS LAI used a temporal-spatial filtering algorithm to remove cloud contamination from the reflectance data. The vital component of this product is the algorithm to train a general regression neural networks (GRNNs), using fused LAI from MODIS and CYCLOPES products and reprocessed MODIS reflectance for each vegetation type on observation sites (Xu et al., 2018). The dataset spans the period from 1981 to 2018 and has a grid of 0.05°. Please note that all gridded datasets were aggregated to an annual temporal resolution and a spatial resolution of 0.5°. The spatial patterns of AI and LAI of global 1,381 basins are shown in Figure 1.

Water balance ET
Eight ET products were assessed, using water balance equations. The water-balance-based ET is often considered a reference for validating ET products on the annual scale. ET can be calculated based on precipitation (Pre), runoff (Q) and total water storage change (TWSC) in the basin, using the following equation: Due to high correlations with static gravity fields, GRACE does not provide the estimates of total continental water content. In this aspect, TWSA is defined as the residual water content at a given time, which is relative to the water content at a reference epoch. The reference storage corresponds to the average water storage during the early phases of the GRACE mission (Han et al., 2005;Yang et al., 2020). Hence, yearly TWSC is the difference between the December anomaly observation of the current year and that of the previous year, i.e., the yearly TWSC equation is as follows: where i and Dec denote the year (ranging from 1981-2010) and the December, respectively.

Evaluation metrics
Kling-Gupta efficiency (KGE) and its three components are used to further evaluate the eight ET products (Kling et al., 2012). KGE is an objective performance metric, which comprehensively combines the components of the key performance statistics (correlation, bias and variability). The KGE formulation is defined as follows: where R is Pearson's correlation coefficient, β is the bias (the ratio of the estimates and observation means), γ is the variability [the ratio of the coefficients of variation (CV)].
where μ and σ denote the mean and the standard deviation, respectively; e and o denote the estimate and the observation. Note that the ranges of KGE, R, β and γ with the optimum value of 1.0 are −∞-1.0, −1.0-1.0, −∞-+∞ and −∞-+∞, respectively. A comprehensive diagnosis was carried out on the performance of ET products in capturing ET characteristics at the temporal and spatial scale. Please note that the hit of ET trend directions for each product was also evaluated, using the ratio of truly captured ET trend directions including positive and negative trends. For example, TPR (FPR, TNR, FNR) denotes the ratio between the number of basins that the ET products truly (falsely, truly, falsely) identify the Frontiers in Environmental Science frontiersin.org observed positive (positive, negative, negative) ET trend as positive (negative, negative, positive) ET trend and the number of all basins. The sum of TPR, FPR, TNP, and FNR is equal to 100%. To systematically assess the spatial and temporal capture performance of the ET products, we assessed both the spatial dynamic of ET climatological value, temporal variability and trends, and the temporal dynamic of ET for each basin.

Aridity and vegetation categories
If global basins are diagnosed in overall terms, the information about the performance of ET products under given conditions will be lost. Meanwhile, it is important to examine how ET products vary with water and energy conditions or vegetation greenness, since the ET process is affected by the complex mechanisms of energy, water cycle and vegetation and the strong variability in both space and time. Therefore, the aridity and vegetation categories were created without considering their changes during the evaluation period, for the sake of simplicity. Specifically, the aridity index (AI) is characterized by the long -term climatic aridity condition of a region, for example, the higher AI value indicates the drier condition. The threshold of multiyear-average AI was set at 1.5, based on the conventional definition, i.e., basins with AI>1.5 are classified as the dry basins and those with AI≤1.5 are classified as the wet basins . As for vegetation, the LAI is widely applied as the proxy of vegetation greenness, with high values suggesting high greenness. Based on the LAI value for each basin at the evaluation period, the evaluation metrics were re-classified in three categories, i.e., LAI<1, 1 ≤ LAI<2 and LAI≥2, which were defined as the LAI-I, LAI-II, and LAI-III, respectively (Jimenez et al., 2011;McCabe et al., 2016), with regard to the intensity of greenness (from brown to green). Figure 2A shows the spatial pattern of the mean annual value of ET during 1981-2010. The high values (>1,000 mm) mainly existed in the Brazilian coast, the GulfofMexico and Atlantic coasts in America, the African coast, and the Oceania East coast. Specifically, the ET decreased from east (west) to west (east) across the North (South) America, and from southeast to northwest across China. By contrast, the spatial variability of ET CV was not line with the ET value: the high ET occurred in the Amazonian Plain and Brazilian plateau, whereas high ET CV occurred in South China ( Figure 2B). Additionally, the ET tended to increase in the Eurasia and Brazilian plateau, while decreasing in the Amazonian Plain ( Figure 2C). Overall, about 20% of basins showed the significant trends, and the significant increases were mainly in the Northwest China, Europe, and the midwest U.S, while the significant decreases were mainly in the Congo Basin and Amazonian Plain) ( Figure 2D). In conclusion, ET regarding magnitude, temporal variability and trend showed the high spatio-temporal heterogeneity.

Overall assessment of ET products
All ET products could reproduce the spatial distribution for climatological values of ET with high spatial R values ≥ 0.90 (Table 3). Among these products, the PML performed slightly better than the other products with the highest and R value of 0.96, though not with optimal β and γ. The β values for most ET products were consistently around the optimal value of 1.0, except for GLASS (1.27 for β) and MERRA-2 (1.22 for β), suggesting that the magnitudes of climatological values were well captured by most ET products. However, the spatial variabilities of ET tended to be Regarding the temporal variability, all ET products generally underestimated the CV (0.34≤ β ≤0.89), but evidently overestimated its spatial variability (1.21≤ γ ≤2.39). Moreover, the spatial distribution of ET CV were poorly captured be most ET products, with GLEAM-3.5a having the maximum R value of 0.25 among the eight ET products. Overall, the KGE values were mostly negative, ranging from −0.80 (MTE) to 0.04 (MERRA-2), indicating that most ET products had limited KGE-based ability to simulate ET temporal variability. In the view of the ET trend, the directions (i.e., upward and downward) could be hit by most products, with 59.29% for PML ≤ TPR + TNR≤65.66% for MERRA-2. However, the FPR, near to and even larger than the TNR, suggested that the negative trends would be misidentified as positive trends, especially for GLASS (39.68% versus 3.04%). The R values ranged from 0.09 (MTE) to 0.36 (MERRA-2) indicating that GLASS and MERRA-2 with values above 0.30 could capture the ET trends in space. Except for reanalysis products underestimating the ET trend, all others overestimated the ET trend, with β larger than 1.0. By contrast, all products underestimated the spatial variability of the ET trend, with −0.19 (MERRA-2)≤ γ ≤0.18 (GLEAM-3.5a).
All KGE values were negative, indicating that these poor overall performance of these ET products in capturing the ET trend. Figure 3 shows the metrics of β, γ, R, and KGE for 1,381 basins. The majority of ET products overestimated ET at more than 50% of basins, especially GLASS and MERRA-2 which overestimated ET at above 92% of basins ( Figure 3A). However, PML, GLDAS2.0-Noah and MTE underestimated the ET at more than 50% of basins. Spatially, the β values displayed evident spatial differences, with most of ET products greatly overestimated the ET in China, Europe, and North America. Considering the metric of γ ( Figure 3B), all ET products tended to underestimate the ET temporal variabilities at over 70% of basins. When γ <0.2, ET products, especially MTE and GLASS, underestimated the ET temporal variabilities at around 30% of basins worldwide. Additionally, the overestimates of ET temporal variabilities tended to be at American Midwest. About the spatial patterns of β and γ, it is worth noting that the higher ET magnitude estimates were accompanied by lower ET variability estimates, since the ratio of basins having β >1.0 outweighed the ratio of basins having γ <1.0 for most ET products except PML, GLDAS2.0-Noah, and MTE ( Figures 3A,B). Regarding temporal fluctuation ( Figure 3C), positive R values were observed for 59.30% (MERRA-2) to 84.50% (ERA5-Land) of basins, especially MERRA-2 with R > 0.6 at nearly 20% of basins, indicating that ET products had a broad R-based ability to simulate ET temporal fluctuation. High R values (around 0.8) mainly appeared in the Midwest United States, South Africa, Western Australia. However, the average R values for all ET products were slightly low, ranging from 0.06 for GLDAS2.0-Noah-0.24 for ERA5-Land. Based on KGE ( Figure 3D), negative KGE values were found in 47.65% (E2O-En) to 71.76% (MTE) of basins, with general negative basin-averaged KGE values (−0.14 (GLASS) to 0.03 (E2O-En)), indicating that all ET products had the limited overall performance for temporal scale.

Validation by aridity regimes
In terms of the climatological values of ET under dry and wet conditions (Figure 4), except GLASS under all conditions and MERRA-2 under wet condition, the ET products could reproduce the magnitudes of ET with 0.86 for PML≤ β ≤1.17 for ERA5-Land under dry condition and with 0.96 for MTE≤ β ≤1.07 for ERA5-Land under wet condition, which was consistent with the results presented in Section 3.1 (Table 3). In particular, most of the ET products underestimated the water balance ET above 1,200 mm (Figure 4), which mainly occurred in Amazonian Plain and Brazilian Plateau ( Figure 2). As for γ, most of the ET products could generally detect the spatial variability for the climatological values of ET under dry and wet conditions, corresponding to a range of 0.69 (ERA5-Land)≤ γ ≤1.32 (GLASS) and 0.72 (ERA5-Land)≤ γ ≤1.07 (GLASS), respectively. Broadly, the spatial variability estimates of ET under dry condition tended to be higher than those under wet condition (represented as γ dry > γ wet) except ERA5-Land and MTE. Regarding R, the ET products had a high R-based ability to simulate spatial distribution of ET with R > 0.8 under dry and   Frontiers in Environmental Science frontiersin.org 09 wet conditions. Meanwhile, the ET products could better represent the spatial distribution of climatological ET (except for GLASS, ERA5-Land, and MTE) under dry basins than wet basins (represented as R_dry > R_wet). As for KGE, ET products exhibited the high overall performance on climatological ET conditioned by aridity, especially generating the highest KGE values for GLDAS2.0-Noah (0.89) under dry condition and PML (0.94) under wet condition.
With β at~1.0, the magnitude of temporal variability of ET tended to be more easily simulated under dry condition, compared with wet condition ( Figure 5). As for γ, the spatial variability of ET temporal variability was generally overestimated by ET products under all aridity conditions, with 0.87 for GLDAS2.0-Noah≤ γ ≤1.47 for ERA5-Land under dry condition, and 0.79 for GLASS≤ γ ≤1.95 for PML under wet condition. As for R, the ET products could detect the spatial distribution of ET CV under dry basins, of which the highest R value was 0.76 for MERRA-2, followed by 0.69 for E2O-En. However, under wet condition, the ET products presented a contrasting performance, compared with dry condition, with R values ranging from −0.07 to 0.12. Considering KGE, similar to R, the ET products could not simulate the ET CV under wet condition, whereas, under dry condition, ERA5-land, MERRA-2, GLEAM3.5a, E2O-En and GLDAS2.0-Noah showed better overall performances, generating a KGE above 0.40.
Taking the ET trend into consideration (Figure 6), more than 50% of the total number of basins were located in the first and third quadrants, with 46.89% for ERA5-Land ≤ TPR + TNR≤78.00% for MERRA-2 under dry condition and 54.20% for GLASS ≤ TPR + TNR≤63.26% for ERA5-Land under wet condition. This indicates that most of the ET products can capture the ET trend directions. Despite that, it is worth noting that FPRs outweighed the TNRs under wet condition. This suggested that under the wet condition, these products tended to change the negative ET trends to the positive ET trends. Based on β, under wet condition, most of the ET products (except ERA5-Land and MERRA-2) tended to underestimate the magnitude of ET trend, with −19.63 for GLASS≤ β ≤-4.93 for GLEAM-3.5a. By contrast, under dry condition, the underestimations of the ET trend got relieved, with −2.06 for ERA5-Land≤ β ≤4.76 for GLASS, except that general underestimations still existed in dry condition. In addition, ET products underestimated the extreme ET trends over the wet basins (<−5 and >5 mm yr −1 ), which mainly occurred in the Amazonian Plain and Brazilian Plateau ( Figure 2). As for the spatial variability of ET trend, the γ values were around zero for all ET products under wet condition, ranging from −0.09 (GLEAM-3.5a) to 0.18 (ERA5-Land), while the γ values were more deviated from optimal value (1.0) for most ET products under dry condition, ranging from −23.21 for GLEAM-3.5a to 23.53 for E2O-En. Overall, all ET products exhibited limited R-based ability to simulate spatial distribution of the ET trend, with 0.09 for GLASS ≤ R ≤ 0.65 for MERRA-2 under dry condition and 0.05 for MTE ≤ R ≤ 0.33 for GLASS under wet condition. Furthermore, the overall performance for each ET product under all aridity conditions was poor with −23.25 (GLEAM-3.5a)≤KGE ≤ 0.44 (MERRA-2) under dry condition and −19.66 (GLASS)≤KGE ≤ −1.05 (ERA5-Land) under wet condition. Notably, the overall performances were generally worse for the latter. Temporally, as shown in Figure 7A, regarding β, except PML, GLDAS2.0-Noah, and MTE, the   Frontiers in Environmental Science frontiersin.org magnitudes of ET were overestimated at 51.33% for GLEAM-3.5a to 98.89% for GLASS of dry basins, and at 51.66% for GLEAM-3.5a to 96.46% for MERRA-2 of wet basins. The basin-averaged β values for the ET products (except GLASS and MERRA-2) were both near to 1.0 under dry and wet conditions ( Figure 7B). The basin-averaged γ values under dry condition were also close to 1.0 for most products, while under wet condition the values were overwhelmingly low, ranging from 0.17 for MTE to 0.72 for MERRA-2. As for R ( Figure 7C), more than 50% of basins exhibited a value over zero for most products under all conditions. Despite that, each ET product showed a higher R-based ability to simulate ET temporal fluctuation under dry condition than wet condition, with average R values ranging from 0.15 for GLDAS2.0-Noah to 0.46 for MERRA-2 under dry condition and −0.03 for MERRA-2 to 0.17 for ERA5-Land under wet condition. As for KGE ( Figure 7D)

Validation by vegetation conditions
From perspective of climatological ET, the magnitude and spatial variability of ET could be represented by most of the ET products across all vegetation conditions (Figure 8), with both β and γ around 1.0. However, most of the ET products (excluding GLASS) also underestimated the ET values above 1,200 mm under LAI-III condition, which mainly exist in Amazonian Plain and Brazilian Plateau (Figure 2). Concerning R, the capacity to simulate the spatial distribution of climatological ET increased first, and then decreased as vegetation became greener for most ET products except GLDAS2.0-Noah, In terms of KGE, most ET products show good KGE-based performance. In addition, GLASS, ERA5-Land, MERRA-2, E2O-En, PML, and MTE showed that the KGE-based performance was the best under LAI-II condition.
In terms of the ET CV (Figure 9), most ET products (except GLASS and MTE) reasonably estimated ET magnitude under LAI-I condition, with 0.86 for PML≤ β ≤1.34 for ERA5-Land. However, the β values were limited for other vegetation conditions, with 0.22 for MTE≤ β ≤0.62 for PML under LAI-II condition and 0.14 for PML≤ β ≤0.67 for MERRA-2 under LAI-III condition. The β values for the ET temporal variability decreased as the vegetation turned green for each ET product. And the γ values for the spatial variability of ET temporal variability tended to be overestimated under all vegetation conditions. For R, all the ET products (except PML) had the limited R-based ability to simulate the spatial distribution of ET temporal variability, with vegetation greening. For example, the R values under LAI-I, LAI-II, LAI-III conditions ranged from 0.24 to 0.69, 0.24 to 0.38, and −0.09 to 0.13, respectively. Similar trends were occurred to KGE, except that the overall performance of KGE was even worse than that of R-capacity.
In the view of ET trend (Figure 10), its condition is similar to the aridity condition. The ET products could hit the ET trend directions,

FIGURE 11
Box plots of evaluation metrics for ET products under LAI-I, LAI-II, and LAI-III conditions. (A-D) represent the KGE and its components R, β and γ, respectively. The blue, red and green represent LAI-I, LAI-II, and LAI-III conditions, respectively. The dashed lines represent the basin-averaged value.
Frontiers in Environmental Science frontiersin.org 13 with 50.73% for PML ≤ TPR + TNR79.88% for MERRA-2 under LAI-I condition, 52.58% for GLASS ≤ TPR + TNR≤66.39% for PML under LAI-II condition, and 52.09% for PML ≤ TPR + TNR≤66.85% for ERA5-Land under LAI-III condition. Additionally, FPRs outweighed the TNRs for ET products (except ERA5-Land and MERRA-2) under LAI-II and LAI-III conditions, for example, for GLDAS2.0-Noah, FPR versus TNR was 39.27% versus 8.82% under LAI-II condition, and 40.96% versus 7.32% under LAI-III condition, indicating that the ET products tended to misidentify the negative ET trends as positive ET trends. Based on β, except ERA5-Land and MERRA-2, the ET products tended to seriously underestimate the magnitudes of ET under LAI-III condition, with −8.21 for GLASS≤ β ≤-3.15 for MTE. And the values of β were much larger than 1.0 under LAI-II condition (excluding ERA5-Land and MERRA-2), suggesting that the overestimation occurred in LAI-II condition. As for γ, all the ET products underestimated the spatial variability of the trends (excluding MERRA-2, GLEAM-3.5a, E2O-En and MTE for LAI-I condition). As for R values, the ET products (except GLASS, MTE and PML) showed lower correlations with the greening of vegetation. Interestingly, the ET trends were remarkably overestimated by most products in LAI-II condition, and slightly underestimated under LAI-I and LAI-III conditions. As for KGE, most of the ET products had bad performance with negative values under all conditions. Especially under LAI-II and LAI-III conditions, they had almost no simulability.
Temporally, the basin-averaged β values were around the 1.0 for all vegetation conditions ( Figure 11A), though the temporal magnitudes of ET were either overestimated or underestimated by the ET products. Considering γ (Figure 11B), the basinaveraged values for all the ET products significantly decreased with the vegetation turning green, and were overestimated under LAI-I condition, but underestimated under the other vegetation conditions. It is worth noting that as vegetation was getting greener, the R-based ability for all the ET products was significantly constrained ( Figure 11C). Specifically, all the ET products consistently performed, and the average R value and the basin percentages of the R values over zero decreased with vegetation greening. Figure 11D clearly shows that, like R-based ability, the basin-averaged overall performances of all the ET products decreased, as the vegetation was getting greener, except that the KGE values were lower than R values.

Validation by dynamic aridity or vegetation conditions
In this study, the simulations of ET derived from the eight methods were evaluated by the water balance ET of global 1,381 basins under various water, energy, and vegetation conditions. Since water, energy, and vegetation are crucial for accurately simulating ET, the lack of sufficient their information, caused by the lack of ET algorithm, forcing data and calibration methods, affects the performance of ET simulation (Xu et al., 2019;Elnashar et al., 2021;Li et al., 2022;Yu et al., 2022). As is shown, the comprehensive performance of ET products (Figures 7, 11) and the capture of ET variance (Figures 5,9) regularly decrease, with the humidity and vegetation greenness increasing. These phenomena imply that the accuracy of the ET simulations may decrease, when the regional climate is wetting and the global vegetation is greening (Mankin et al., 2017;Lian et al., 2021;Zhang et al., 2022). Additionally, the ET products tend to misidentify the negative trends as the positive trends, especially under wet and LAI-III conditions, implying that the estimates of ET trends may be overestimated across the globe or in wet and LAI-III conditions (Figures 6, 10). These issues will be further discussed in the following.
In terms of the impact of water and energy denoted by AI, ET process in dry or wet regions can be conceptualized as a water-or energy-limited process, respectively: ET under dry conditions is water-limited, in that it is constrained by the soil moisture available for ET, while ET under wet conditions is energy limited, since there is sufficient soil moisture available for ET. Therefore, the maximum rate and temporal variations of ET proceeds are determined by atmospheric water demand (potential evapotranspiration) rather than soil moisture (Draper et al., 2018). All the ET products could better capture the mean annual value of all aridity conditions. However, the ET CV in wet basins tend to be more remarkably underestimated than in dry basins, by the ET products except GLASS and PML (Figures 5, 7). Indeed, wet zones have more active land-atmosphere coupling than dry zones, in that the inevitable ET algorithm errors or data forcing errors magnify the uncertainties under wet zones. For instance, Penman-Monteith method (GLDAS2.0-Noah, MERRA-2 and PML) is primarily driven by net radiation (R n ) under wet zones using a linearized approximate solution (Gao, 1988;Grignon, 1992;Leca et al., 2011), which is sensitive to low vapor pressure deficit (VPD) and may induce considerable problems in the extreme conditions (such as the water balance ET higher 1,200 mm ( Figure 4) and extreme ET trends ( Figure 6) and the soil evaporative term (Bai and Liu, 2018;Blatchford et al., 2020). More importantly, the presence or absence of ET products TWSC components in simulating ET under dry and wet areas cannot be ignored. However, most ET methods do not have an aquifer storage component, and LSMs lack a good representation of groundwater withdrawal for agricultural depletion, such as irrigation Zeng and Cai, 2018). Additionally, the errors in the ET estimates and differences among the ET products are also mainly dependent on various inputs .
The surface variables also control the ET process, especially vegetation Zheng et al., 2022). Similarly, the response of ET products to vegetation was investigated. Regarding the mean annual value, we found that the simulability of datasets first increased and then decreased, with the increase of vegetation density (Figure 8), in line with the Lu et al. (2021). In addition, we also confirmed that the comprehensive performance (KGE) of ET products decreases as the vegetation is getting greener (Figures 8,  10). The first reason for this is that whether ET algorithms take the LAI or vegetation dynamics into consideration. For example, GLEAM-3.5a model lacks vegetation-related information, though it considers the vegetation optical depth, which may result in lower accuracy in high vegetation regions (Martens et al., 2017;Xu et al., 2019;Qiu et al., 2022). Another reason is that the ET algorithm do not comprehensively consider the vegetation process in hydrology or energy cycle. MERRA-2 overestimates the interception loss fraction defined as the fraction of rainfall, i.e., rainfall intercepted by the canopy and reevaporating back into the atmosphere without infiltrating into the soil or causing surface runoff (Reichle et al., 2011;Bosilovich et al., 2017;Gelaro et al., 2017;Reichle et al., 2017;Hinkelman, 2019), which could explain why the MERRA-2 generally has the highest β under various LAI conditions among the eight ET products (Figures 8, 11, and Lv et al. (2020)). The last easily neglected issue is related to the model forcing data. One aspect of the issue is the forcing data errors. The accuracy of LAI dataset is impacted by the leaf shadowing (Mehrez et al., 1992), especially tall and dense vegetations. Besides, shaded leaves are not light-saturated, leading to diffuse sunlight conditions and then having a higher fraction of FAPAR (the fraction of photosynthetically active radiation absorbed by the canopy) (Jimenez et al., 2011;He et al., 2013;Xu et al., 2019). Another aspect of the issue is the forcing data settings, for instance, MTE ET product was generated from machine learning method by compiling the 253 globally distributed flux towers data and remote sensing data, including vegetation information (FAPAR). We speculated that the varying performance of MTE product with various LAI conditions was probably driven by data settings. For example, the vegetation was used to do split not regression, which results in inadequate vegetation information (Jung et al., 2010). Or ERA5-Land was used to generate land elements data including ET, by using a static monthly climatology of a fixed land use and leaf area index (LAI) (Muñoz-Sabater et al., 2021). And GLDAS2.0-Noah also uses a static land use, though with high spatial resolution (Rodell et al., 2004). Therefore, they ignored the change of land cover and cities, and lost more frequent LAI anomalies during the reanalysis period (Muñoz-Sabater et al., 2021).
The model calibration methods also have a significant impact on the performance of ET simulation. One problem concerning the methods is that the ET simulations are often calibrated with the mean annual value not the variance and trend of actual ET, though considering multiple calibration metrics. Another problem is that the data used for calibration are often EC site data that are not representative of the regional scale (Bai and Liu, 2018;Xu et al., 2019). In addition, as far as we know, the ET products except GLASS and MTE are accompanied by component data such as soil evaporation, vegetation evapotranspiration and water surface evaporation, but these data are not calibrated with sufficient actual measurements (Swanson, 1994;Brunel et al., 1997;Chen et al., 2014).

Uncertainties
The uncertainties in Pre and TWSA products are the largest source of uncertainties in assessing the ET products . According to the water balance budget, the assessment of globalscale ET products needs to rely on grid-scale Pre and TWSA products, although the global-scale observatory data is difficult to collect. As for the three Pre products selected in this study (GPCC, CPC-Unified, and CRU TS4.05), the uncertainties derive from the number of stations used, the time homogeneity and the quality control procedures (Trenberth et al., 2014;Sun et al., 2018). However, these products are interpolated from an unprecedented number of station data and are the most reliable precipitation products currently available . Regarding the TWSA data (GRACE-REC and GRID-CSR-GRACE-REC), the uncertainties arise mainly from the models used for the reconstruction (pre-2002) and the driving data (Gyawali et al., 2022). However, the correlation of GRACE-REC with yearly streamflow anomalies have median value of around 0.60 over 1981-2010 (Humphrey and Gudmundsson, 2019); the GRID-CSR-GRACE-REC has high correlation with Global Mean Sea level with R of 0.91 . We further investigated the uncertainties in water balance evapotranspiration defined as the CV of the six Pre-TWSC-Q combinations, and found that most of the basins with uncertainties of <0.10 and uncertainties above 0.10 were located mainly in the Midwest USA and Southwest China (rainfall gauges are more sparsely distributed in high mountain areas) and the Arctic (Figure 12). In addition, Q data may be affected by human harvesting of deep groundwater and inter-basin water transfers . However, TWSC can reasonably take into account the impact of human activities on Q. Moreover, in validating the model, only the terrestrial water balance (not the atmospheric water balance) is considered, and the measured evapotranspiration Frontiers in Environmental Science frontiersin.org values lack the cross-validation to further reduce the error with the true values (Li et al., 2019). The generalizability of our results to other regions of the world may be subject to additional uncertainty, as the basins included in this study do not cover the entire globe. However, it is important to note that the performance of evapotranspiration products varies with dryness and vegetation greenness, and it is necessary to ensure that all types of dryness and vegetation greenness are covered (Figure 1). To minimize errors caused by different spatial resolutions, all ET products were reinterpolated linearly to 0.5°before evaluation. Furthermore, our analysis is based on observed ET using the water balance method, which represents the average ET of watersheds controlled by hydrological stations, reducing the uncertainty caused by a single grid point to some extent. The scale effect on ET product performance related to aridity and vegetation greenness response needs further exploration in future research.

Conclusion
This study conducted a comprehensive assessment of terrestrial ET products to improve ET products. In this study, drawing on the data of water balance ET from 1981-2010 collected from 1,381 basins, we examined eight ET products: one remote sensing product (GLASS), two reanalysis products (ERA5-Land and MERRA-2), four LSM-based products (GLEAM-3.5a, E2O-En, PML and GLDAS2.0-Noah), and one machine learning-based product (MTE). Besides, to gain a deeper insight into the eight ET estimates under various conditions, the potential impact of aridity and vegetation greenness were taken in consideration. The evaluation results are summarized below: (1) In view of the performance at the global scale, the ET products had advantages in capturing the mean annual value of ET, with relatively high KGE values, among which the PML performed the best with 0.89 for KGE. Despite that, the ET products had limited KGE-ability to simulate the ET variability with highest KGE of 0.04 for MERRA-2 and the trend with highest KGE of −1.47 for GLEAM-3.5a. In addition, the ET products tended to underestimate the ET temporal variability and overestimate its spatial dynamics, while they tended to overestimate the ET trend and underestimate its spatial dynamics. It is worth noting that the ET products tended to misidentify the negative ET trend as positive trend.
(2) For each basin, the ET products always overestimated the ET values and underestimated the ET temporal variability at more than 50% of basins. And the ET products had a wide R-based ability to simulate the ET temporal fluctuation, for the ET products had positive R values at 59.30% (MERRA-2)-84.50% (ERA5-Land) of basins. The high R values mainly appeared in the Midwest United States, South Africa, Western Australia. However, all ET products showed the limited KGE-ability at the temporal scale. (3) As for different aridity regimes, the performances of ET products were completely opposite in dry and wet areas. Spatially, the ET products showed lower ability to capture the temporal variability and the trend of ET under wet condition than dry condition. And overall, the ET products tended to misidentify the negative ET trend as positive trend, which only existed in wet condition. Temporally, the overall performances of ET products were limited under wet condition, for the ET products performed the negative KGE values under wet condition, and the positive KGE values under dry condition at more than 60% of basins. (4) Considering the dynamic performances with varying vegetation, the spatial and temporal performances of ET products were strongly affected by vegetation greenness, which is similar to the situation with aridity regimes. Spatially, as vegetation became greener, the performance of simulated climatological ET increased first and then decreased, and gradually limited the ability to simulate the spatial distribution of ET temporal variability. Meanwhile, the ET products tended to misidentify the negative ET trend as positive trend under lush vegetation condition. Temporally, the basin-averaged overall performances of all the ET products decreased, as the vegetation was getting greener.
Overall, the performances of ET products were poor in wet or vegetated areas, suggesting that the accuracy of ET products may decline in the future when the climate becomes wetter and the vegetation becomes greener. Therefore, this work is hopefully to improve our understanding about the spatio-temporal performance of the ET products, and contribute to the directional optimizations and effective applications of ET products.

Data availability statement
The raw data supporting the conclusions of this article will be made available by the authors, without undue reservation.

Publisher's note
All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article, or claim that may be made by its manufacturer, is not guaranteed or endorsed by the publisher.