Statistical Analysis of the Weather Impact on Robusta Coffee Yield in Vietnam

Weather and climate strongly impact coffee; however, few studies have measured this impact on robusta coffee yield. This is because the yield record is not long enough, and/or the data are only available at a local farm level. A data-driven approach is developed here to 1) identify how sensitive Vietnamese robusta coffee is to weather on district and provincial levels, 2) during which key moments weather is most influential for yield, and 3) how long before harvest, yield could potentially be forecasted. Robusta coffee yield time series were available from 2000 to 2018 for the Central Highlands, where 40% of global robusta coffee is produced. Multiple linear regression has been used to assess the effect of weather on coffee yield, with regularization techniques such as PCA and leave-one-out to avoid over-fitting the regression models. The data suggest that robusta coffee in Vietnam is most sensitive to two key moments: a prolonged rainy season of the previous year favoring vegetative growth, thereby increasing the potential yield (i.e., number of fruiting nodes), while low rainfall during bean formation decreases yield. Depending on location, these moments could be used to forecast the yield anomaly with 3–6 months’ anticipation. The sensitivity of yield anomalies to weather varied substantially between provinces and even districts. In Dak Lak and some Lam Dong districts, weather explained up to 36% of the robusta coffee yield anomalies variation, while low sensitivities were identified in Dak Nong and Gia Lai districts. Our statistical model can be used as a seasonal forecasting tool for the management of coffee production. It can also be applied to climate change studies, i.e., using this statistical model in climate simulations to see the tendency of coffee in the following decades.


INTRODUCTION
Coffee is one of the most traded agricultural commodities in the world. It is a major cash crop and an important foreign exchange earner for many coffee producing countries. Robusta coffee (Coffea canephora Pierre ex A. Froehner) is the second most economically significant species of coffee plants (FAO, 2019; United States Department of Agriculture -Foreign Agricultural Service, 2019).
Several studies have indicated high sensitivity of coffee to weather (Craparo et al., 2015;Kath et al., 2020Kath et al., , 2021 and climate (Davis et al., 2012;Bunn et al., 2015b;Moat et al., 2017). The interactive effects of precipitation and temperature define where coffee can be grown as an economically profitable crop as well as year-to-year variability in coffee yield and quality. High sensitivity to changing precipitation patterns and increasing temperature have raised strong concerns regarding future viability of coffee to support many coffee farmers' livelihoods and meet future coffee demand in a climate change context. For example, climate change is expected to significantly reduce areas suitable for coffee growth (Bunn et al., 2015b;Imbach et al., 2017). Furthermore, for robusta coffee in Vietnam and Indonesia, every 1°C increase in mean minimum/maximum temperatures (above 16.2/24.1°C) during the growing season was suggested to lead to a 14 %-yield reduction (Kath et al., 2020). Another study indicates that precipitation-sensitivity is more important for robusta coffee than arabica coffee, the latter being more sensitive to temperature (Bunn et al., 2015b).
Although weather and climate play an important role in agriculture, few studies have addressed the weather impact on coffee yield, especially for robusta coffee. Long-term coffee yield time series for several contrasting environments are required for a robust statistical assessment; however, such data are mostly lacking. Aggregated data on different administrative levels are publicly available from several countries; yet again, most countries have a limited time record and/or report yields only for large spatial areas. Thus, the few studies that assessed weather impact on coffee yield have used yield datasets over some specific locations (e.g., few Brazilian municipalities or farm-level data) with limited time series (the maximum of 10 years) (de Oliveira Aparecido et al., 2017;Valeriano et al., 2018;Kath et al., 2020); while previous global studies made use of indirect coffee information such as the occurrence of coffee-production without yield data (Bunn et al., 2015a, Bunn et al., 2015b. This study focuses on robusta coffee of the Central Highlands of Vietnam, the biggest robusta coffee producing country, using 19 years of data. Previous studies analyzed differing weather/climate variables from different sources, such as data from surface meteorological stations, interpolated high-resolution climatology data or gridded time-series weather data from satellite and/or models. For example, de Oliveira Aparecido et al. (2017) used weather measurements obtained in automated meteorological stations to forecast coffee yield in Minas Gerais, Brazil. This strategy is only viable in areas with a high density of meteorological stations covering a substantial spatial extent, a long-enough time record, and good quality-check procedures. This is not available in most coffee growing areas. Many studies have also used gridded weather data as major inputs for regional yield forecasting (Zhang and Huang, 2012;Valeriano et al., 2018). Bioclimatic variables based on climatology from WorldClim (Hijmans et al., 2005) have widely been used with species distribution models (O'Donnell and Ignizio, 2012;Bunn et al., 2015a;Bunn et al., 2015b;Läderach et al., 2017;Gomes et al., 2020;Davis et al., 2021).
The approach to link weather to coffee yield can take several forms, from statistical approaches as outlined above to processbased models that intend to integrate the current understanding of plant growth and development (Rodríguez et al., 2011;Rahn et al., 2018;Vezy et al., 2020). The process-based approach is often exemplified at a local scale (e.g., a farm level). This study thus prefers a statistical approach to explore the weather variables affecting coffee production on a larger spatial scale (e.g., the district and/or provincial levels). The challenge with statistical models is the limited time-series yield data (less than 20 years of data). Because 1 year of data represents one sample, it is not easy to obtain enough samples to evaluate the weather information and build impact models. Moreover, once a statistical model is calibrated on historical data, it is particularly challenging to quantify the model performance in different regions or weather conditions, using years not included in the historical record. We aim to tackle this using information content techniques, dimension reduction techniques, multi-linear models, and generalization estimation techniques.
Our study focuses on the coffee yield of the leading robusta coffee-producing provinces in the Central Highlands of Vietnam on province and district levels. The objectives of this study are: 1) to propose a simple approach that identifies the most important weather predictors related to coffee yield; 2) to propose a mathematical framework to predict the weather-related variability of coffee yield; and 3) understand the weather response of Vietnamese robusta coffee.

Study Area
Vietnam is the second-largest coffee producer worldwide after Brazil and the leading robusta coffee producer, with more than 40% of global robusta coffee in 2019 (FAO, 2019; United States Department of Agriculture -Foreign Agricultural Service, 2019). All Vietnamese robusta coffee comes from the Central Highlands, which lies on a series of contiguous plateaus with an elevation ranging from 200 to 1,500 m, with an annual average temperature of about 22°C and an annual rainfall of 1800-2000 mm per year. Due to the long dry season of 5 months, coffee farms are irrigated (starting in January/ February up to the rainy season). The Central Highlands region includes five provinces: Kon Tum, Gia Lai, Dak Lak, Dak Nong, and Lam Dong; each province is divided into several districts. Figure 1 presents the study area in the Central Highland of Vietnam, including its elevation ( Figure 1A) and the coffee information (i.e., average planted area, yield, and production in Figures 1B-D, respectively). Here, we focus on high-intensity robusta coffee production districts, i.e., 20 districts with a coffee planting area higher than 10 thousand hectares ( Figure 1B). Figure 1D shows that Dak Lak and Lam Dong districts produce much more coffee than other districts. From 2000 to 2018, Dak Lak produced on average 35.36% of Vietnam's total coffee (about 1,200 tons), followed by Lam Dong with 27.33%, Dak Nong with 14.82%, and Gia Lai with 13.27%. The coffee harvest season occurs from October to January.

Yield Data
The provincial and district level coffee statistics were obtained from the General Statistics Office of Vietnam (GSOV) for a period of 19 years, from 2000 to 2018 ( Figure 2A).
Long-term trend identification-To analyze the yield data, the long-term trend was first determined. This long-term trend in coffee yield often describes the changes in agricultural practices (e.g., agro-management and irrigation practices) or indirect factors affecting these practices, such as changes in coffee price (Mathieu and Aires, 2016;Miao et al., 2016). Consequently, suppressing this trend from the yield time series allows removing the impact of non-weather related factors.
Anomaly calculation-Analysis of weather impact on coffee yield was based on the absolute yield-anomalies with respect to the previous long-term trend. The coffee yield for year t is defined as y(t), and the long-term trend value is represented by y(t). Both y(t) and y(t) are in 10 3 kg per hectare (10 3 kg·ha −1 ). The absolute yield-anomaly a(t) is defined as the percentage variation around the long-term trend: If a(t) > 0, then the yield in year t is higher than average, and vice versa.
Trend identification-Yield steadily increases from 2000 to 2018 ( Figure 2A). Therefore, a simple linear function was used to define the robusta coffee trend: where y 0 is the yield in the year 2000, and α is the annual rate of increment. In Figure 2B, the robusta coffee yield anomalies from 2000 to 2011 show more variability, while fluctuations are lower in the second half period (i.e., 2012-2018). This might be due to the coffee price drop at the beginning of the 21st century, leading to less investments in agronomic management and therefore being more sensitive to weather fluctuations (Miao et al., 2016). In contrast, in the second period, robusta coffee yield became more stable and weather-independent, most likely due to improving agricultural practices.

Phenological Processes and Relationship With Weather
As we aim to study the weather impact on robusta coffee yield, it is necessary to understand the phenological processes and how they are affected by weather. In general, a coffee plant takes roughly 3 years to develop, from seed germination to first flowering and then fruit production (Arcila-Pulgarín et al.,  Frontiers in Environmental Science | www.frontiersin.org June 2022 | Volume 10 | Article 820916 4 2002; Wintgens, 2004). It is productive for about 30 years (Wintgens, 2004) but can be more than 50 years, or even up to 80 years with good management. The phenological cycle of coffee trees varies depending on various factors such as the species, variety, weather, and agriculture practices. Figure 3 exemplifies broad phenological stages for robusta coffee in the Central Highlands of Vietnam. Mature coffee trees undergo several plant growth and developmental stages before harvesting: • Flower bud formation and growth: During this period, the development of a serial bud into a flower bud is largely controlled by plant hormones activated by the change in day length and/or a drop in temperature (Descroix and Snoeck, 2004) (Majerowicz and Sondahl, 2005). At the end of the growing phase, these buds enter into a dormant phase until flowering is stimulated. • Flowering: The dormancy of the flower buds is usually broken by sudden relief of water stress (e.g., the first rains of the wet season, irrigation) and/or a drastic fall in temperature. Water is necessary for this process, but heavy rain could cause an acute form of floral atrophy. The flowers are often pollinated within 24-48 h. • Fruit development, including pinhead (a very young fruit) grain expansion, and grain (bean) formation: After pollination, the very young fruits (e.g., pinheads) remain dormant with low water requirements. From the second to third month after blossoming, the fruits start to swell, which increases the water requirements. Thus, this process usually happens during the onset of the rainy season. • Maturation (fruit ripening): Between the sixth and the eighth months after fertilization, the fruit reaches maturity, which occurs during the beginning and/or the middle of the dry season since adequate temperature and sunlight promote the ripening of coffee fruits.
The time required from flowering until fruit ripening varies according to various factors (e.g., variety, climate condition, management). In general, it is about 9-11 months for robusta coffee.

Weather Data
Monthly-mean data-The monthly-mean of different weather variables, including total precipitation and 2 m temperature data among others, were collected for the period 2000-2018, from the ERA5-Land (covering land areas of the ERA5 re-analysis of the European Center for Medium-Range Weather Forecasts (ECMWF) (Hersbach et al., 2018)), with a spatial resolution of 0.1°× 0.1°(about 10 km × 10 km). The data are available from https://cds.climate.copernicus.eu/cdsapp#!/home. The data is projected from its original 0.1°× 0.1°regular grid into the resolution of the selected administrative levels (e.g., provincial and district levels) to match the resolution of coffee yields, presented in Figure 4. Another alternative could be analyzed by, for instance, agroclimatic zones. However, there is a need to gather coffee yield by corresponding zones in this case. This step can mix the coffee information or require yield data at a higher resolution. As a result, our study will work at the same resolution levels provided by the coffee yield information.
Weather variables-In the literature, statistical approaches often consider precipitation, air temperature (minimum, mean, and maximum), solar radiation, and/or other variables derived from these basic variables (de Oliveira Aparecido et al., 2017;Valeriano et al., 2018;Kath et al., 2020Kath et al., , 2021. This study selects monthly total precipitation (P) and average temperature (T) as explanatory variables for robusta coffee yield. Other variables such as solar radiation, maximum and minimum temperature were also assessed, but they were excluded because they either show low correlations to the yield anomalies or are highly correlated to the precipitation and temperature variables. The weather data will be evaluated for n = 19 months: from the beginning of the bud development to the peak of the harvest season H 0 (as shown in Figure 3). The value n varies from 1 to 24 (shown in Figure 5), depending on the coffee phenology in the study locations. Here, we use n = 19, resulting in 38 potential inputs (i.e., 19 realizations (chronological data) for P and T series), for modelling Vietnamese robusta coffee yield.
Similar to the yield data, we considered the trend in weather anomalies, but the trend of about 2 decades is not significant compared to the inter-annual variations (Mathieu and Aires, 2016). Therefore, the long-term trend is omitted, and relative anomalies will be computed from the long-term average. This average value is calculated for each of the n months before H 0 . Furthermore, a 3-months moving average is applied to reduce the variability at the monthly scale. This variability would introduce instabilities caused by the short time series of the dataset (Dinh and Aires, 2022).

Weather-To-Yield Impact Model
There are many regression models that could be considered for measuring the impact of weather anomalies on coffee yield. As regression models are purely data-driven, a learning database is required to empirically describe the relationships between the inputs and the desired coffee yield. Depending on the available data, models can be chosen from domains such as traditional statistics, Bayesian statistics, machine learning, or artificial intelligence. Yield modeling has been based on linear regression (Kouadio et al., 2018), random forest (Kouadio et al., 2018), neural networks (Mathieu and Aires, 2018), or mixed-effect models (Mathieu and Aires, 2016).
Ideally, the complexity of the model is adjusted to the complexity of the relationship under consideration, i.e., the effect of weather on coffee phenology. Thus, a complex regression model would be necessary to model all the processes involved in plant growth and development (described in Section 2.1.3). However, in practice, we are often limited to more simple statistical models due to the limited number of samples (Dinh and Aires, 2022).
As a result, a simple multiple linear regression method is used here to model the relationship between the observations of coffee yield anomalies a(t) and the weather input anomalies X i . This model is formulated as: where n is the number of input variables. The unknown coefficients α i (i = 1, 2, . . ., n) are the model parameters that need to be optimized based on the dataset of samples β = {(s 1 , s 2 , . . ., s m ); m = 1, 2, . . ., m}, where m is the number of samples (i.e., number of years multiplied by the number of locations when several locations are mixed in β). The samples are composed of one output and the n inputs: s = (a, X 1 , . . ., X n ). This model is simple, but it has the advantage of having a limited number of parameters, and these parameters α i can be easily interpreted because they represent the sensitivity of coffee yield a to the input X i . Figure 5 presents the datasets and methods used. In detail, the yield anomaly is induced from the yield for year t, i.e., yield at the peak of the harvest season H 0 . We then select n potential values from the 2-years time series before H 0 for each weather variable. As mentioned in Sect. 2.1.4, this study uses n = 19; thus, the inputs X i will be chosen from 38 potential inputs (i.e., 19 realizations for each variable P and T).  Other more sophisticated models (e.g., neural network models with different hyperparameters and number of layers) were tested, but this complexity was detrimental, as explained in more detail in the following.

Regularization of the Model
One major difficulty when performing a regression is that the model complexity can be too high compared to the description of the relationships in the database. When the model is too complex, it is said to be over-parameterized. This leads to over-training (or over-fitting), i.e., the model artificially fits the learning database very well, but it is unable to predict data not present in the learning database, meaning that the model is not reliable and cannot be used in practice.
In order to avoid over-training, regularization techniques need to be used. Firstly, we reduced the number of potential inputs by selecting only four important weather variables out of the 38 potential inputs, using correlation score.
Secondly, Principal Component Analysis (PCA) was applied over the four selected variables to compress this information into two inputs. By reducing the number of inputs, PCA reduces the number of parameters in the model, therefore regularizing the model structure and reducing the over-training problem. Furthermore, PCA enables the inclusion of more synthetic inputs, with more robust integrative information on robusta coffee. This preprocessing facilitates the work of the regression, so it helps to regularize the problem. We then compared the two approaches: the four selected weather variables and two PCA components to identify the more robust model that could better deal with over-training.
We assessed the Pearson's correlation coefficient (COR) and Root Mean Square Error (RMSE = 1 n n i 1 (a est (i) − a obs (i)) 2 ) between the estimated yield anomalies (a est ) and the observed yield anomalies (a obs ) as two diagnostics of the model quality. The COR is uniteless and RMSE is in (10 3 kg·ha −1 ).

Assessment of the Model Using the Leave-One-Out
Generally, the model is trained over a learning dataset, and its generalization ability is tested over a so-called generalization database. Since the overall database in this study only includes 19 years (i.e., 19 samples), splitting it into two such databases would be critical. However, with such a limited database, estimating the true generalization ability of the model and determining the optimal model parameters is a real challenge.
The leave-one-out method (Kogan et al., 2013;Zhao et al., 2018;Li et al., 2019) is used here. Given N available samples (i.e., 19 years), the model is calibrated N times, using N − 1 samples (leaving one sample out). The resulting model is tested on the sample not used in the learning. Since this is reproduced N times, we will obtain N generalization estimates for each sample. In the end, N estimated yield anomalies (a est ) will be compared with the real yield observations (a obs ). The same diagnostics COR (i.e., generalization score) and RMSE were considered to measure the model quality over these N samples.

Sensitivity to the Model Spatial Scale
Each region/district differs in terms of elevation, weather, and perhaps even agricultural practices. Thus, calibrating a localized model could be beneficial if the model can take into account these specificities. However, this is not always possible as sufficient yield data are often unavailable or the weather data is too coarse; therefore, it is of interest to evaluate the differences between scales. Also, when working at a large scale, we can compensate for all small region uncertainties, thus gain better results. Hence, working at each spatial scale has advantages and disadvantages, depending on the problems and applications. Therefore, it is necessary to study this scaling aspect and identify at which spatial scale it is more optimal to work on.
In this study, the coffee yield data are available at both provincial and district scales. Therefore, we analyzed our regression model at both scales to better understand the following questions: 1) at which scale can yields be better forecasted, 2) do we gain more when we go to higher spatial resolution. We first set up the two models at the provincial and district levels. However, to perform comparison between the models, we applied the provincial model to the district scale to compare it with the district model, and we also integrated the district model prediction at the provincial scale to compare these results to the provincial model.

Spatial Variability of Weather Data
Large spatial variations can be observed in the weather (including P and T) over the study districts in the Central Highlands ( Figure 4). Overall, Dak Lak and Dak Nong show higher temperatures and lower precipitation than Lam Dong and Gia Lai. The selected Lam Dong districts have the lowest annual mean temperature (2010-2018) and high precipitation. The weather in three study districts in Gia Lai varies greatly, especially for the annual mean temperature ranging from 22.5 to 26°C. Districts with high coffee yield (e.g., Lam Dong districts) are located in high elevation areas, high precipitation, and low/moderate temperature.

Weather-Yield Correlation Analysis
Dak Lak provincial level case study- Figure 6 presents the correlation between the coffee yield anomalies and the weather variables (P and T variables) at the Dak Lak provincial level. (Results of other provinces (i.e., Gia Lai, Dak Nong, Lam Dong). are presented in Supplementary Figure S1). An opposite relation is observed between precipitation and temperature. During the fruit maturation period in Dak Lak from H −2 to H −1 , robusta coffee trees benefit from higher temperature rather than increased water availability, as shown by the positive correlations between temperature and absolute yield anomalies, while the precipitation variables show negative correlations. Additionally, precipitation and temperature differ in their correlation signs at other phenological stages (i.e., bud development, flowering, and fruit development). The end of the rainy season of the previous year (H −13 ) and the peak of the rainy season of the year of harvest (H −4 ) show the highest correlations with yield anomaly, both are positively correlated with precipitation (0.50 and 0.55, respectively) and negatively correlated with temperature (-0.45 and -0.40, respectively).
Variable selection for all selected districts of the Central Highlands- Figure 7 shows the 2 months with the highest correlations between weather variables and yield anomalies, e.g., H −5 and H −13 for Ea H'leo district. In other words, the four selected weather variables (mentioned in Section 2.2.2) are the precipitation and temperature of these 2 months. In addition, these selected months vary from district to district which could be due to the climatic variation among districts. Overall, the most important month (i.e., shown in Figure 7A) ranges from the sixth to the third month before the harvest (H −6 to H −3 ), corresponding to the end of fruit development stage (the bean formation). In details, in the areas at higher elevations (i.e., 600-1,000 m, as shown in Figure 1A), the weather at 6 months ahead of the harvest is the most important variable, while at lower elevations (i.e., 200-600 m), the coffee yield anomalies are impacted more by the precipitation and temperature at H −3 . The second most important month  (i.e., shown in Figure 7B) is related to the vegetative growth and bud development stages, i.e., H −17 to H −12 .

Regression Models
Dak Lak district-level case study- Table 1 shows the learning and generalization scores, or the correlations and RMSE between observed and estimated yield anomalies over the Dak Lak districts, for the two input configurations: the four selected weather variables and the two PCA components.
The correlation for the model using four selected variables is much higher over the training dataset than over the generalization (Table 1). This indicates that over-training is occurring when using too many inputs. For instance, in Buon Ma Thuot, the correlation between a est and a obs reduces approximately by 0.4 and the RMSE by 0.07 × 10 3 kg·ha −1 . On the other hand, the learning and generalization scores are more similar and the RMSE difference is much smaller when using PCA, which means that over-training is reduced. Besides, we obtained better generalization scores when using two PCA components instead of the four selected variables. A more realistic generalization correlation/score (0.55) with a lower RMSE (0.17 × 10 3 kg·ha −1 ) is obtained for robusta coffee in Buon Ma Thuot and the average correlation of 0.49 for the whole province of Dak Lak. It confirms that PCA is a good regularization technique, which will be used in the subsequent models. The generalization correlation ranges from about 0.4 to 0.6, which shows considerable differences among districts. Different environmental conditions can explain these differences, not only between but also within the provinces (Kath et al., 2020).
All selected districts in the Central Highlands- Figure 8A shows the final generalization score, i.e., the correlation  between observed and estimated yield anomalies for generalization over the selected districts in the Central Highlands. The explained variance (in %), which equals the square of the correlation coefficient, is also indicated ( Figure 8B). Figure 8 shows that the model works best over the Dak Lak districts, followed by Lam Dong, while lower generalization scores are found for Kon Tum and Dak Nong districts. Weather explains up to 36% of the variation in robusta coffee yield anomalies, approximately 25% in Dak Lak province on average, and about 16% for the whole Central Highlands. These values could appear to be low, but it should be kept in mind that the objective here is not to forecast the whole coffee yield anomalies, but only their weather-sensitive part. It is possible that weather/climate explains more of the yield variability, if a larger sample were available allowing for a more complex model (i.e., additional predictors that might capture other weather sensitive phenological stages).

Yield Estimation
District level case study in Dak Lak- Figure 9 presents the observed and predicted coffee yield anomalies, as well as the corresponding coffee yield (in 10 3 kg·ha −1 ) time series and its uncertainties in one district of Dak Lak (i.e., Cu M'gar).
As mentioned above, we obtain a correlation of 0.6 between the observation of yield anomalies a obs and its estimation a est in Cu M'gar. Therefore, the model explains 36% (= 0.6 2 ) of the variation in coffee yield anomalies in this district. The correlation between y obs and y est (0.72) is certainly higher than the yield anomalies correlations due to the use of the trend that is known in our case without uncertainty (Figure 9). The prediction seems to work better in the 2000-2011 period compared to the 2012-2018 period.
Assessment of model quality-Besides the yield anomalies, it is also interesting to compare the estimated coffee yield (y est ) and the observed yield (y obs ). The yield forecasting for year t is estimated as y est (t) y(t) + a est (t), where y(t) is the yield trend (Section 2.1.2) and a est (t) is the forecasted yield-anomalies.
In addition to assessing model quality, we compared two cases: the average state given by the long-term trend (i.e., no weather information) and our weather impact model estimation. The average state is the most straightforward guess/prediction without weather information, resulting in a typical year with no weather anomaly. We compared this simple guess to our model estimation to see how much we gain by using weather information. Figure 10 presents the results of the yield anomalies (in 10 3 kg·ha −1 ) and production (in 10 6 kg) over the selected districts in 2011 of the three cases: 1) the true observation, 2) the average state with no weather information, and 3) our weather impact model. As indicated in panel (A2), the true observation of the total production over the selected districts is 935 × 10 6 kg in 2011. For this year, our weather impact model shows a similar estimation compared to the true observation (i.e., 926 × 10 6 kg). Although our model seems to overestimate one Dak Lak district (with an absolute difference of about 16 × 10 6 kg), the integrated prediction is closer to the true observation. Figure 11 shows the generalization score of different models when applying to different spatial scales (i.e., province and district). Firstly, considering the comparison at the provincial These scores indicate that the integration of the district model is more interesting than the provincial model. Similarly, we consider the two cases at the district scale: applying the provincial model to the district scale (A2), using the district model (B2). Although some Lam Dong districts show high generalization scores (0.6 or 0.7), the average score over the whole study area is very low (0.25) in (A2), i.e., the lowest one among four tested cases. Meanwhile, when using the district model, the overall score is much higher (0.39). In short, as expected, for robusta coffee in the Central Highlands, it is more advantageous to use the model calibrated at the district scale as it is able to take advantage of the local specificities, even if more ambitious task.

DISCUSSION
Our study explored how the relationship between phenological processes of the coffee plant (i.e., bud formation and growth, flowering, fruit development, and maturation) and weather affects yield at district and provincial scale in Vietnam. Each phenological stage requires specific weather characteristics for optimal yield. Finally, due to the perennial nature of the coffee plant, climatic effects in the previous year can affect the preceding year's yield. This understanding is essential in the study of weather patterns that affect coffee yield.
Due to the limited availability of coffee yield data, a simple regression model was used to relate weather with robusta coffee yield over the Vietnamese coffee area. The analysis indicates that precipitation and temperature during 1) vegetative growth of the previous year and 2) bean formation are most influential in determining coffee yield anomalies. The exact timing of these variables varies among coffee districts due to the difference in elevation, climate, and/or agricultural practices. Overall, the model can forecast the coffee yield anomalies from three to 6 months before the harvest, 6 months for the districts located at a higher elevation with cooler temperature and therefore slower development rates. However, the accuracy of such a yield forecast and its value for decision-making would need to be tested in more detail and depend on seasonal climate forecasts for the different districts of interest.
Identifying an appropriate statistical model for a relatively complex crop like coffee is not easy, particularly as there are limited observations (only about 20 years of coffee yield). This explains why using regularization strategies such as favoring simple models (e.g., linear regression) over complex ones and applying good quality assessment diagnostics (e.g., PCA, leaveone-out) are required. In particular, we used the leave-one-out method to avoid potential over-training when a limited number of samples is available.
Our findings indicated that, with about 20 years of data, weather could explain up to 36% (from 16 to 25% on average) of the variation in robusta coffee yield anomalies in the Central Highlands. This value is significant as it means that weather can affect up to one quarter of the farmer revenue. Some Lam Dong districts can show higher weather sensitivity. Other coffee districts in Gia Lai or Dak Nong are not so sensitive to weather, which could partially be due to the adequacy of these regions to the coffee production and/or higher stability of the weather in these regions.
Our results for robusta coffee in Dak Lak agree with Kath et al. (2020), who showed that rainfall has a positive effect, while temperature has a negative effect during the growing season (Mar-Sept). One advantage of our approach is that we can point out the month with the highest correlation, while (Kath et al., 2020) only looked at two seasons (flowering and fruit growth). As phenology dynamically responds to prevailing weather conditions, the key moments that affect yield vary in time and space. For instance, Figure 6 shows a positive correlation score from H −9 to H −3 (i.e., the growing season), with the highest score at H −5 for robusta coffee in Dak Lak. However, the different districts have different key moments where weather most affects yield. In our example of Dak Lak, we found a low correlation score of precipitation at H −11 /H −10 (i.e., the beginning of the flowering season) (Figure 6), while Kath et al. (2020) concluded that rainfall has a negative effect on robusta coffee yields during the flowering season (Jan-Mar). It is unclear whether this difference is due to the different spatial scales of used weather data or due to the underlying yield data set.
It also showed that Vietnamese coffee farmers have improved their agricultural practices considerably in the last 10 years, leading to coffee yield becoming less weather sensitive. In addition, by applying our model at different spatial scales, while each scale has its advantages and disadvantages, our model for robusta coffee in the Central Highlands shows more interesting results over the district level.
This study highlights the importance of selecting spatially dynamic weather variables for optimal modelling. This needs to be atomized to apply this model approach at larger scale. Therefore, a practical automatic selection procedure needs to be developed. Secondly, given the relatively high weather sensitivity already detectable with simple linear regression using a small yield database, there is a need to consolidate high quality yield data from all coffee growing regions to enable the development of more complex data driven approaches and elucidate key weather sensitive coffee phenological stages and related yield forecasting skill. We expect to obtain better forecasting scores in regions where coffee is rainfed and thus is more sensitive to the weather. This work can be extended following different approaches. Firstly, other input variables could be analyzed. For example, other agro-climatic indicators, characterizing plant-climate interactions for global agriculture, are available from 1951 to 2099 at the ECMWF (Hersbach et al., 2018). Some variables could be relevant for coffee, such as (very) heavy precipitation days, the maximum number of consecutive dry days, frost days, etc.). The addition of input variables should not come at the cost of over-training; however, these variables could be interesting for further studies, for instance, when applying over other regions where more yield data are available. Secondly, we could study cluster analysis, which helps identify coffee groups with the same environmental characteristics. We can then define the regions suitable for coffee production and analyze their long-term evolution due to