Machine learning generated streamflow drought forecasts for the conterminous United States (CONUS): developing and evaluating an operational tool to enhance sub-seasonal to seasonal streamflow drought early warning for gaged locations

Hammond, John; Goodling, Phillip; Diaz, Jeremy; Corson-Dosch, Hayley; Heldmyer, Aaron; Hamshaw, Scott; McShane, Ryan; Ross, Jesse; Sando, Roy; Simeone, Caelan; Smith, Erik; Staub, Leah; Watkins, David; Wieczorek, Michael; Wnuk, Kendall; Zwart, Jacob

doi:10.3389/frwa.2025.1709138

ORIGINAL RESEARCH article

Front. Water, 07 January 2026

Sec. Water and Artificial Intelligence

Volume 7 - 2025 | https://doi.org/10.3389/frwa.2025.1709138

This article is part of the Research TopicAdvancing Machine Learning for Climate and Water Resilience: Techniques for Precipitation ForecastingView all 3 articles

Machine learning generated streamflow drought forecasts for the conterminous United States (CONUS): developing and evaluating an operational tool to enhance sub-seasonal to seasonal streamflow drought early warning for gaged locations

John Hammond¹^*

Phillip Goodling²

Jeremy Diaz³

Hayley Corson-Dosch⁴

Aaron Heldmyer⁵

Scott Hamshaw⁶

Ryan McShane⁵

Jesse Ross⁷

Roy Sando⁸

Caelan Simeone⁹

Erik Smith¹⁰

Leah Staub¹

David Watkins¹¹

Michael Wieczorek¹

Kendall Wnuk¹

Jacob Zwart¹¹

¹U.S. Geological Survey, MD-DE-DC Water Science Center, Catonsville, MD, United States
²U.S. Geological Survey Water Resources Mission Area, Earth System Processes Division, Catonsville, MD, United States
³U.S. Geological Survey Water Resources Mission Area, Integrated Information Dissemination Division, Reston, VA, United States
⁴U.S. Geological Survey Water Resources Mission Area, Integrated Information Dissemination Division, Madison, WI, United States
⁵U.S. Geological Survey, WY-MT Water Science Center, Cheyenne, WY, United States
⁶U.S. Geological Survey Water Resources Mission Area, Integrated Modeling and Prediction Division, Bristol, VT, United States
⁷U.S. Geological Survey Water Resources Mission Area, Integrated Information Dissemination Division, Los Angeles, CA, United States
⁸U.S. Geological Survey, WY-MT Water Science Center, Helena, MT, United States
⁹U.S. Geological Survey, Oregon Water Science Center, Portland, OR, United States
¹⁰U.S. Geological Survey, OK-TX Water Science Center, Austin, TX, United States
¹¹U.S. Geological Survey Water Resources Mission Area, Integrated Information Dissemination Division, San Francisco, CA, United States

Forecasts of streamflow drought, when streamflow declines below typical levels, are notably less available than for floods or meteorological drought, despite widespread impacts. We apply machine learning (ML) models to forecast streamflow drought 1–13 weeks ahead at 3,219 streamgages across the conterminous United States. We applied two ML methods (Long short-term memory neural networks; Light Gradient-Boosting Machine) and two benchmark models (persistence; Autoregressive Integrated Moving Average) to predict weekly streamflow percentiles with independent models for each forecast horizon. ML models outperformed benchmarks in predicting continuous streamflow percentiles below 30%. ML models generally performed worse than persistence models for discrete classification (moderate, severe, extreme) but exceeded the benchmark models for drought onset/termination. Performance was better for less intense droughts and shorter horizons, with predictive power for 1–4 weeks for severe droughts (10% threshold). This work highlights challenges and opportunities to advance hydrological drought forecasting and supports a new experimental forecasting tool.

Graphical Abstract

USGS Streamflow Drought Assessment and Forecasting Tool interface showing a U.S. map with drought conditions. Points indicate varying drought levels: no drought, moderate, severe, and extreme. A chart below displays observed and forecast conditions for 09/07/25, showing an extreme streamflow drought in Muskegon River, Evart, MI.

Graphical Abstract

Highlights

• Machine learning models used to forecast weekly streamflow percentiles for 1–13 weeks

• Many models underperformed compared to a benchmark forecasting current conditions

• Models trained to focus on streamflow percentiles <30 best for drought forecasting

• Neural network models outperform benchmarks in predicting drought-related percentiles

• ML models outperform benchmarks in predicting drought termination for 1–4 weeks

Introduction

Drought is a complex phenomenon that poses significant challenges to water resource management across the United States. Drought encompasses various types—meteorological, agricultural, hydrological—with distinct definitions and impacts by sector (Wilhite and Glantz, 1985; American Meteorological Society, 1997; Heim, 2002) necessitating a comprehensive understanding for effective management. Meteorological drought refers to a prolonged period of below-average precipitation, whereas agricultural drought impacts crop production due to insufficient soil moisture. Hydrological drought is defined as a lack of water in the hydrological system, manifesting itself in abnormally low streamflow in rivers and abnormally low levels in lakes, reservoirs, and groundwater (Van Loon, 2015).

Hydrological drought has widespread and recurring impacts on industrial water supply, municipal water supply, hydropower, thermoelectric power, river navigation, irrigation, water quality, and aquatic organisms (Wlostowski et al., 2022). Hydrological drought duration and severity have increased in the southern and western United States during recent decades (Dudley et al., 2020; Hammond et al., 2022), and drought events are projected to be more impactful and widespread by the end of the 21st century given continued changes to precipitation and evapotranspiration dynamics (Cook et al., 2020). Five especially impactful hydrological drought events lasting longer than 3 years and covering more than 50% of the area of the conterminous United States (CONUS) have occurred from 1901 to 2020 (McCabe et al., 2023).

Despite the impacts of hydrological drought for many sectors, there is a notable gap in the provision of accurate and timely information regarding both existing and forecasted hydrological drought conditions, especially when compared to the number of tools available to assess and forecast meteorological and agricultural drought. The importance of addressing this gap has been underscored by federal partners and programs such as the National Oceanic and Atmospheric Administration's National Integrated Drought Information System and multiple Department of the Interior bureaus, including the U.S. Geological Survey—alongside various stakeholder groups such as agricultural organizations, energy utilities, municipal and regional planners, and the general public (Skumanich et al., 2024). Collectively, these parties stress the need for improved monitoring and predictive capabilities related to drought conditions because the implications of drought span economic, social, and ecological dimensions.

Developing a hydrological drought assessment and prediction tool could improve the ability to coordinate management decisions and prepare for potential impacts. Unlike existing precipitation forecasting tools, forecasting streamflow drought requires accounting for storage (snow and groundwater), human modifications (diversions and reservoirs), and complex terrestrial processes with incomplete data for each of these categories. Physically based models are often developed with the goal of representing peak streamflows and/or long-term water budgets, and accuracies of modeled flows often decrease during severe streamflow droughts (Simeone et al., 2024). While there have been improvements in modeling droughts over the past four decades, effective ways to translate and pass on information from decision makers to users could still be further developed (Mishra and Singh, 2011).

The U.S. Geological Survey Water Mission Area Drought Program is working to advance early warning capacity for hydrological drought occurrence, onset, and termination at multiple intensity levels using machine learning (ML) models. For the remainder of this paper, we will focus on streamflow drought, a subset of hydrological drought focused on drought in streams and rivers. Specifically, we focus on streamflow drought defined as observations of streamflow that fall below a given threshold for that streamgage location and time of year. With the growing potential for ML models for hydrological applications (Shen, 2018), we investigated ML models for streamflow drought forecasting and compared their performances to simpler persistence and ARIMA models. Given that not all processes are adequately represented in available process models, particularly for periods of drought, we use ML models with the goal to emulate these processes well enough in the internal model states to produce a tool that can provide actionable forecast information.

The goal of this paper is to provide documentation supporting a new operational tool for sub-seasonal to seasonal (S2S) streamflow drought forecasting in the CONUS. This tool was prototyped to complement existing water forecasting tools primarily focused on flooding in the next 10 days such as (1) NOAA National Water Model forecasts (https://water.noaa.gov/assets/styles/public/images/wrn-national-water-model.pdf); (2) National Weather Service (NWS) Hydrological Ensemble Forecast Service (HEFS; https://www.weather.gov/dmx/hefs_info) or seasonal water supply NWS River forecast centers water resources forecasts (https://www.cbrfc.noaa.gov/us/about.html); and (3) National Resource Conservation Service (NRCS) water supply forecasts (https://www.nrcs.usda.gov/resources/data-and-reports/water-supply-forecast-predefined-reports).

The objectives of this paper are to (1) apply ML models to determine feasibility of forecasting drought occurrence, onset, and termination at multiple intensity levels for 1–13 weeks (~1–90 days) in advance, (2) incorporate data and methods to attempt to improve forecast performance in areas with heavily regulated streamflow including areas below dams, (3) document an experimental operational drought assessment and forecast tool that includes forecasts and forecast uncertainty and sets a baseline of performance that can be improved in future work, and (4) identify the maximum number of weeks ahead that the best-performing ML model can reliably forecast drought properties at gaged locations in CONUS. To our knowledge, this is the first study to evaluate ML drought forecasts across > 3,000 gages in CONUS and, particularly, to provide an operational prototype spanning multiple drought intensities and horizons. that the model addresses a gap in the availability of present and future streamflow drought conditions using a novel approach to predict departures from typical seasonal conditions.

Background

The complexity of defining drought is a critical factor impacting the ability to manage it effectively. Traditionally, drought has been categorized into several types based on its characteristics and the sectors it affects. Depending on the combination of water use type, management constraints, and location, different ways of defining hydrological drought may be more useful than others (Sarailidis et al., 2019; Skumanich et al., 2024). The varying definitions and thresholds for these types of droughts can lead to confusion and inconsistencies in monitoring and response strategies (Heim et al., 2023; Sutanto and Van Lanen, 2021). The time scale (daily, weekly, monthly) and approach used (seasonally varying threshold vs. fixed thresholds) to identify drought can lead to substantially different quantification of drought. For example, prior work has shown that daily threshold methods identify 25%−50% more drought events than monthly methods and monthly analyses show longer average drought durations (Sutanto and Van Lanen, 2021). Variables like streamflow, soil moisture, and precipitation are often converted to percentiles or standardized indices to identify drought because this provides a way to compare data across different locations and time periods, allowing for consistent drought classification and intensity assessment. The National Drought Monitor (Svoboda et al., 2002) and many state drought plans use thresholds in streamflow percentiles and groundwater percentiles to categorize areas as being in moderate drought (below 20^th percentile), severe (below 10^th percentile), extreme (below 5^th percentile), though there is considerable variation in the thresholds and indicators that individual states use to classify periods of drought.

Overview of drought prediction

Predicting droughts (and floods) is challenging for a number of reasons including incomplete process understanding and representation, relatively short observation records compared to the return periods of extreme events, non-stationarity in processes controlling extremes, and incomplete data on human-water interactions (Brunner et al., 2021). Despite these challenges, developments in recent years have led to advances in the ability to improve the accuracy and lead time of meteorological and hydrological drought forecasts, and have suggested future prospects for additional improvements including data assimilation and ML (Fung et al., 2020; Hao et al., 2018). Additionally, Sutanto et al. (2020) found that hydrological drought forecasts outperform meteorological forecasts, motivating a shift in focus toward more relevant forecasting for impacted sectors. Collectively, these studies underscore the need for ongoing research and collaboration to enhance prediction accuracy and mitigate the impacts of extreme weather events.

Prior drought prediction efforts using physically based models

Process-based hydrologic models that simulate the underlying physical processes of the hydrological cycle (e.g., precipitation, evapotranspiration, runoff, and groundwater flow) can be used for drought-specific prediction. National-scale hydrologic models, such as the National Water Model (Cosgrove et al., 2024) and the National Hydrological Model (Regan et al., 2019), have been systematically evaluated for their ability to simulate streamflow droughts across thousands of U.S. Geological Survey (USGS) gages (Simeone et al., 2024). These models can be used to classify drought and non-drought periods and quantify drought severity, duration, and intensity. The National Water Model generally most accurately simulates drought timing while the National Hydrological Model most accurately estimates drought magnitude, and both models estimate drought more accurately in wetter regions. Despite advancements in process model development, challenges remain in simulating the most severe drought events, especially in drier regions, and in capturing the complexity of surface-subsurface interactions (Husic et al., 2025; Johnson et al., 2023; Towler et al., 2023). Ongoing research aims to improve model physical consistency, data assimilation, and integration with ML techniques for better forecasting.

Statistical and simple machine learning models

Statistical models have been widely used for streamflow and drought prediction. Autoregressive Integrated Moving Average (ARIMA) models are frequently used for streamflow and water supply prediction due to their ability to model and forecast time series data with trends and seasonality (Montanari et al., 1997). In streamflow forecasting, ARIMA models have demonstrated reliable performance, often outperforming simpler models, especially for monthly and annual predictions (Modarres, 2007; Sabzi et al., 2017). For hydrological drought prediction, ARIMA models are typically used to forecast streamflow indices such as the Streamflow Drought Index (SDI), providing short-term outlooks on drought conditions (Modarres, 2007; Myronidis et al., 2018). However, while ARIMA models are effective for stationary or near-stationary time series, their accuracy can be limited by the non-stationarity and volatility inherent in streamflow data. To address this, hybrid approaches that combine ARIMA with decomposition techniques or volatility models (such as a Generalized Autoregressive Conditional Heteroskedasticity, or GARCH, model) have been developed, significantly improving prediction accuracy for both high and low flows and better capturing structural breaks and regime changes in streamflow records (Wang et al., 2018, 2023; Ji et al., 2025; Khazaeiathar and Schmalz, 2025). Recent studies have also shown that wavelet-ARIMA models can outperform traditional ARIMA in drought forecasting by better handling multi-scale variability (Rezaiy and Shabri, 2023). Despite the rise of machine learning (ML) and more complex hybrid models, ARIMA remains a valuable tool for streamflow and drought prediction, particularly when data are limited or when a transparent, interpretable model is needed (Kontopoulou et al., 2023). In an alternate approach, Austin (2021) used maximum likelihood logistic regression (MLLR) models to forecast the probability of monthly hydrological droughts in streams and rivers across the northeastern United States. These MLLR models use winter streamflow measurements (October–February) to estimate the likelihood of drought conditions in the following summer months (July–September), enabling predictions 5–11 months in advance with up to 97% accuracy. Logistic regression has also been shown to perform competitively with more complex models in recent drought persistence studies (Hussain et al., 2025).

Machine learning for streamflow and streamflow extreme prediction

Recent studies have increasingly focused on leveraging more complex ML techniques including deep learning to enhance daily streamflow prediction, demonstrating significant advancements in both accuracy and applicability. The Long Short-Term Memory (LSTM) neural network approach has been demonstrated to increase accuracies of rainfall-runoff models compared to the SAC-SMA + Snow-17 processed based combination commonly used for streamflow prediction, highlighting the potential of neural networks for hydrological forecasting (Kratzert et al., 2018). Other studies, such as those by Arsenault et al. (2022), Cho and Kim (2022), and Kratzert et al. (2019) highlighted the increased accuracy of LSTMs in continuous streamflow prediction in ungaged basins, further emphasizing the potential of ML in hydrology. LSTM models can effectively represent various types of dammed basins, with smaller dams modeled implicitly and large degree-of-regulation reservoirs explicitly, as long as dammed basins are present in the training dataset (Ouyang et al., 2021).

Using multiple approaches and ML architectures can improve understanding of different aspects of a prediction problem (De la Fuente et al., 2023). While LSTM models are generally more accurate for daily streamflow prediction, ensemble tree models (e.g., random forest models, boosted regression tree models) can produce accurate and unbiased spatial hydrological predictions, offering flexibility and informative maps compared to alternative statistical techniques (Hengl et al., 2018). Both neural network and tree-based models are universal function approximators (Watt et al., 2020) but operate through different means (i.e., neural neworks—high dimensional, nonlinear time series; tree-based—highly branched decision making). These models may offer complementary strengths to enable ensembling or model selection (e.g., using different models in different regions or for different aspects of drought prediction).

Several studies have specifically focused on the prediction of streamflow extremes, including floods and droughts, to demonstrated the potential for using ML or ML-hybrid models (e.g., Cho and Kim, 2022) to increase accuracies of extreme predictions. Based on model diagnostics, LSTM models predict streamflow with higher accuracy compared to the National Water Model largely because of the channel routing scheme (Frame et al., 2022). Tounsi et al. (2022) demonstrate how hybrid models that combine ML with traditional techniques can increase accuracies of drought predictions by accounting for complex interactions within hydrological systems. Hybrid models combining process-based hydrologic models and ML algorithms increase accuracies of streamflow simulation in diverse catchments across the CONUS, especially where process-based models do not accurately simulate streamflow (Konapala et al., 2020).

In one of the first ML focused studies on streamflow drought prediction, (Hamshaw et al., 2023) utilized LSTMs for regional streamflow drought forecasting up to 2 weeks ahead in the Colorado River Basin, showcasing the effectiveness of these models in capturing complex hydrological patterns and setting a benchmark for their skill in short term forecasting for gaged and ungaged locations. Frame et al. (2022) showed that deep learning models, such as LSTMs and mass-conserving LSTM variants, can accurately predict extreme rainfall-runoff events more accurately than conceptual and process-based models. Additionally, Eng and Wolock (2022) evaluated various ML methods across the CONUS, confirming their potential to predict low flows more accurately than traditional hydrological models at the annual scale.

Hybrid approaches that combine the process understanding of conceptual or climate models with the predictive power of ML—such as Physically Guided Deep Learning (PGDL) or LSTM-climate model hybrids—have been shown to more accurately simulate the timing and magnitude of extreme events compared to standalone ML and purely process-based models while maintaining physical plausibility in outputs Bhasme et al., (2022); Vo et al., (2023). These hybrid models can reduce biases and uncertainty and better detect drought or flood occurrences, especially at longer lead times. However, the effectiveness of such constraints depends on the quality of the process-based model, the nature of the extremes, and the specific implementation; in some cases, adding process- or physics-guidance can even reduce model accuracy (Hoedt et al., 2021; Krishnapriyan et al., 2021). As ML methods continue to evolve, their integration into drought prediction frameworks is expected to enhance the understanding and forecasting of drought impacts, ultimately supporting better resource management in increasingly variable climates.

Methods

In this section, we provide details on the site selection criteria (Section 3.1), datasets used and preparation to model inputs (Section 3.2), model setup and versions (Section 3.3), and model evaluation (Section 3.4). In brief, both tree-based and neural network models were trained and evaluated using streamflow and explanatory variable data from the period 2000–2020 for the CONUS (Figure 1). We focused on the 2000–2020 period based on the availability of long-term meteorology reforecast data, which were limiting compared to the longer records available for observed streamflow. These ML models were compared to two benchmark models: (1) a simple persistence model (predicting no change between most recent observation and forecast period) and (2) an ARIMA model. Separate regression models were built to predict weekly streamflow percentiles for each lead time (1, 2, 4, 9, and 13 weeks). Models were evaluated for their performance in matching the correct drought intensity category (moderate drought <20^th percentile, severe drought <10^th percentile, extreme drought <5^th percentile) and for their ability to correctly forecast the timing of onset and termination of drought events, with a focus on 1, 2, 3, and 4 weeks ahead. Model performance is defined as a measure of how accurately a model's outputs match observations.

Figure 1

Figure 1. Conterminous United States (CONUS) streamgages with long-term complete streamflow record used for training streamflow drought forecasting models overlayed on Hydrologic Unit Code (HUC) 2 region boundaries from the Watershed Boundary Dataset (Luukkonen et al., 2024).

Site selection

We selected 3,219 USGS streamgages based on two criteria: (a) streamflow time series were required to include at least 95% of days in each climate year (April 1 to March 31) and (b) streamgages were required to have at least 8 of 10 complete climate years for decades from 1981–2020 (e.g., 2000–2009 and 2010–2019) following the methods in Simeone (2022). Of these sites, 31% were dam-impacted, 31% were ice-impacted, 21% were non-perennial, and 14% were snow-dominated. Not all sites fit one of these categories, and may be rain dominated, non-ice impacted, without dam influence and perennial.

Data and preparation of model inputs

We obtained daily streamflow data from 1981 to 2020 from the USGS National Water Information System (NWIS, U.S. Geological Survey, (2025) using the R package dataRetrieval (Hirsch and DeCicco, 2015; R Core Team, 2024, version 4.4.1). We then converted 7-day average daily observed streamflow to percentile values to identify drought via consistent thresholds across streamgages. We computed de-seasonalized streamflow percentiles, hereafter variable percentiles, with the unbiased Weibull plotting position (e.g., Laaha et al., 2017) using a variable threshold for each day of the year using only the values for a 30-day window surrounding that day from all years of record. The 30-day window was selected for more inclusion of seasonally relevant data and to provide a fuller empirical distribution to rank against which generates a smoother and more continuous percentile time series. We implemented a modified version of the combined threshold level and continuous dry period methods (Simeone et al., 2024; Van Huijgevoort et al., 2012) to handle the zero-flow measurements (<0.00028 cubic meters per second; <0.01 cubic feet per second). This method breaks ties between zero-flow days for percentile rankings based on the number of preceding zero-flow days, where days with more preceding zero-flow days received lower percentile rankings. Figure 2 shows streamflow percentile time series during drought for selected sites spanning highly regulated, intermittent, snow-dominated, and rain-dominated endmembers within our dataset. We note that our definition of drought describes the departure of streamflow from typical values for each week of the year at each site, not necessarily the lowest flows observed during the entire period. By predicting variable streamflow percentiles, we are predicting a deseasonalized time series that allows for the identification of wetter than normal or drier than normal conditions any time of the year.

Figure 2

Four graphs illustrate river discharge over time for different reaches. Top left shows snow-dominated discharges at Clearwater River, ID from 2015 to 2016. Top right features intermittent discharges at Delaware River, NM from 2002 to 2003, with zero flows marked. Bottom left presents regulated discharges at Blue River, CO from 2001 to 2002. Bottom right depicts rain-dominated discharges at Clinch River, TN from 2007 to 2008. Each graph displays discharge rates with shading indicating variance.

Figure 2. Example streamflow drought occurrence for sites with different streamflow regimes: snow-dominated, USGS 13340000 Clearwater River at Orofino, Idaho; intermittent streamflow, USGS 08408500, Delaware River near Red Bluff, New Mexico; highly regulated, USGS 09050700 Blue River below Dillon, Colorado; rain-dominated, USGS 03528000 Clinch River above Tazewell, Tennessee (U.S. Geological Survey, 2025). Black lines show the 7-day average daily streamflow for the example year, the red line shows the moderate drought threshold represented by the 20^th percentile variable (deseasonalized) percentile, orange fill shows periods of streamflow drought, and gray shading shows the 25^th to 75^th interquartile range of the 7-day average daily streamflow for climate years (April 1—March 31) 1981–2020. The brown dashed line represents the 20^th percentile fixed drought threshold, which does not account for seasonality like the variable drought percentile does.

While streamflow percentiles are continuous quantities, droughts are fundamentally events classified by thresholds. We set percentile thresholds of 5%, 10%, 20% for drought identification, where the 10% flow equates to the flow value that is exceeded 90% of the time. We do not perform pooling (either pre-modeling or post-hoc) in our modeling analysis. The 20%, 10% and 5% thresholds approximately correspond to the U.S. Drought Monitor's D1 (moderate), D2 (severe), and D3 (extreme) drought classifications, respectively.

To develop models to predict streamflow percentiles, we prepared watershed average time series of several gridded datasets and watershed average values of static watershed properties including land cover, topography, human landscape and water regulation. For gridded meteorological variables, we used gridMET (Abatzoglou, 2013). For land surface model output on soil moisture, we used NLDAS2 (Mitchell et al., 2004) and for snow water equivalent we used Broxton et al. (2019). Climate teleconnections including the Pacific-North American Pattern (PNA) and El Niño-Southern Oscillation (ENSO) were obtained from https://www.cpc.ncep.noaa.gov/data/teledoc/telecontents.shtml. We obtained forecast meteorology from the Global Ensemble Forecast System (GEFS, Zhou et al., (2017) for 1 to 10 days and from the North American Multimodel Ensemble (NMME, Kirtman et al., 2014) for 1–3 months, while forecast streamflow was obtained from the Global Flood Awareness System (GLOFAS, Alfieri et al., 2013) for 4–9 weeks. Finally, reservoir inflow, storage, and release were obtained for more than 500 sites with long-term records from ResOpsUS (Steyaert et al., 2022). Refer to Supplementary Table S1 for details on the time series datasets used and Supplementary Table S2 for a list of all static watershed attributes, and for a full list of the rolling average time series variables used in developing our models, please refer to the data dictionary provided with model inputs in the accompanying data release (Hammond, 2025).

To summarize briefly, daily values from the sources above were aggregated to weekly scale with precipitation and potential evapotranspiration summed and temperature, SWE, soil moisture and SPEI averaged. Variables with monthly values were repeated for the weeks until the new monthly value was available. Once weekly values were calculated, a number of rolling means (ranging from 30 to 365 days) and percentile transformations were preformed to create additional model inputs. During model development, feature selection was performed independently for each ML architecture to maximize performance (e.g., including weather forecasts) and reduce complexity (e.g., remove unnecessary watershed attributes). Feature selection was primarily manual, with one variable retained from highly correlated variable pairs (e.g. retained snow water equivalent over remotely sensed snow covered area). Rather than relying on a priori feature selection, we allowed LSTM and LightGBM to identify the most relevant variables and optimal variable combinations for each forecasting horizon and drought intensity, enabling a data-driven approach to feature importance and model specificity as in Hamshaw et al. (2023) and Dadkhah et al. (2025). LSTM and LightGBM architectures internally identified and weighted the most informative variables during training. The major considerations when selecting input variables and retaining one feature over another were at the data-source-level considering the operational availability of new data as well as the latency of new data, which are both essential for operational forecasting problems.

Modeling approaches

Using a common set of streamflow data and explanatory variables, we applied the two ML methods and two benchmark model approaches to make weekly forecasts of the streamflow percentile for 1, 2, 4, 9, and 13 weeks ahead. We decided to create independent models forecasting each week horizon rather than creating a model that forecasts multiple weeks at a time. This decision was based on (a) prototype models showing poorer performance when forecasting multiple weeks, (b) a motivation for flexibility in model training and greater performance rather than forecast consistency and fewer models, and (c) broader success of this approach in ML forecasting Makridakis et al., (2022a),b. We elected to directly predict the target streamflow percentile, which has typical annual seasonal patterns removed. While this was a more difficult modeling task than predicting streamflow, an earlier modeling effort (Hamshaw et al., 2023) found lower model performance when first predicting streamflow and then converting to streamflow percentile as a post-processing step. We also decided to approach drought forecasting as a regression problem—predicting the numeric value of the streamflow percentile—rather than to predict drought classes directly because initial experimentation showed improved drought class prediction when postprocessing predictions of continuous streamflow percentiles. While our percentiles do not explicitly account for long-term trends in streamflow, we account for the potential influence of monotonic trends over the 2000–2020 period in the design of our training and testing splits. We train our models on a central period from October 1^st, 2002 to September 30^th, 2018, leaving the first part of the record (October 1^st, 2000 to September 30^th, 2002) and the last part of the record (October 1^st, 2018–March 30^th, 2020) for model testing.

Long short-term memory neural networks

LSTM models are a popular form of recurrent neural networks trained on time series data (Hochreiter and Schmidhuber, 1997). The mathematics behind LSTMs are well documented in a plethora of studies—for one of the more prominent examples, refer to Kratzert et al. (2018). Conceptually, LSTMs learn to preserve older information that is deemed relevant to the present and to forget older information when current data represents noteworthy updates. The LSTM prediction function consists of multiple rounds of nonlinear transformations which distill input data into high dimensional hidden vectors that are optimized for relevance to the model output (here, streamflow percentiles).

Because of the temporal awareness of the LSTM, we did not use input variables that were manually lagged or averaged through time. Instead, we provided LSTMs with sequences of antecedent data. A final affine transformation reduced the LSTM hidden vector into an output vector of size three. The three elements of the output vector correspond to a deterministic prediction and the lower and upper bound of the 90% prediction interval (refer to the Uncertainty Quantification section). We trained one model for the entire CONUS for each forecast horizon independently.

A validation set of 9-09-2013 to 6-29-2015 was used to tune LSTM hyperparameters (e.g., dropout rate, hidden dimension size, and early stopping epochs). This period was sufficiently representative of hydrologic drought, in that gage-days below 10% occurred approximately 10% of the time. “LSTM-all” models were trained to predict all streamflow percentiles, while “LSTM <30” models used a training set limited to streamflow percentiles below 30% (“low percentiles”). Refer to the Uncertainty Quantification subsection for the loss function.

LSTM hyperparameters were optimized using the random search method to efficiently sample the range of plausible hyperparameters while avoiding uninformative grid duplication when sampling hyperparameters with low effect Bergstra and Bengio, (2012). The final hyperparameters used in this study were a sequence length of 32 weeks (sampled from 13-104 weeks), dropout rate of 0.0 (sampled from 0 to 1), and hidden dimension size of 82 (sampled from 32 to 1024). This was applied to all LSTM models. Each LSTM model was trained with an early stopping patience of 5 epochs with 1,000 maximum possible epochs. Different forecast horizons and training sets resulted in different final epochs; this value varied between 2 and 29, with nearer horizons training for more epochs and with LSTM <30 training for more epochs. This was conducted using Python 3.11 and PyTorch 2.5 (Ansel et al., 2024).

Light gradient boosted models

Decision trees and ensembles of these models (e.g., random forests; gradient boosted decision trees or “GBDTs”) are also ubiquitous in hydrologic ML modeling (e.g., Eng and Wolock, 2022; Goodling et al., 2024; Pham et al., 2021; Ransom et al., 2022; Tokranov et al., 2024). Individual trees learn a series of decision thresholds based on the provided input variables which minimize error for predicting the intended output. Gradient boosting is the process of sequentially learning an ensemble of decision trees where subsequent trees (i+1) are optimized on the residual errors of the previous tree (i). Here, we use LightGBM, an implementation of GBDTs which provides exceptional run times (Ke et al., 2017) in addition to being highly competitive in both deterministic prediction (Makridakis et al., 2022a) and uncertainty quantification (Makridakis et al., 2022b).

By default, decision tree methods do not have any time awareness. To remedy this for time series data, it is common to train the model using rolling antecedent summaries of temporal data to provide context and recent memory (Pham et al., 2021). Here, we provided the model with the rolling average values for temporal variables at three antecedent horizons (4, 13, and 52 weeks) and the most recent observed conditions. Time series predictor variables were provided as an untransformed and as percentile-transformed using the same approach as the streamflow percentiles (using the Weibull plotting position).

We fit one GBDT for each forecast horizon and for each prediction target for all the CONUS. Similar to the LSTMs, we trained for a deterministic prediction and the lower and upper bound of the 90% prediction interval; unlike the LSTM methods described above, the three outputs from the GBDTs are produced independently by separate models.

These models were developed using the R programming language (R Core Team, 2024) and the “lightgbm” package version 4.5.0 (Shi et al., 2020). The parameters controlling the LightGBM models were evaluated through manual grid search adjustment to balance computational speed and performance. Experimentation showed relatively low sensitivity in testing period performance to hyperparameter adjustment. Parameter values were left to the default except for the following: num_iterations (set to 1,000 after sampling 50, 100, 200, 100, and 2000), -num_leaves (63 after sampling 31 and 63), min_data_in_leaf (100 after sampling 20, 100, 200, and 2000), max_depth (7 after sampling 7 and no max depth), bagging_fraction (0.1 after sampling 0.01, 0.1, 0.5, 0.9 and 1), bagging_freq (10 after sampling 1, 5, 10 and 100), and max_bin (127 after sampling 127 and 255). Predictions were made for quantiles 0.05, 0.5, and 0.95 using the built-in objective = “quantile” and metric = “quantile” inputs and adjusting the parameter alpha accordingly. For reproducibility, the data preparation and modeling workflow in the R language was developed into a pipeline using the Targets framework (Landau, 2021; R Core Team, 2024). “LightGBM-all” models were trained to predict all streamflow percentiles, while “LightGBM <30” models were only trained on data for which the observed streamflow percentile was below 30 percent.

Uncertainty quantification

Understanding forecast uncertainty is crucial for making informed decisions because all forecasts have a degree of imprecision and acknowledging that uncertainty helps users weigh potential outcomes and make more effective choices. We incorporated uncertainty quantification into our ML modeling through the use of quantile regression and the pinball loss function (Bassett and Koenker, 1978).

We produced three outputs per forecast horizon: the median point estimate (q = 0.50; ŷ₂) and the bounds of the 90% prediction interval (q = 0.05 and 0.95; ŷ₁ and ŷ₃, respectively). For each forecast horizon, we trained three independent GBDTs and one multi-output LSTM; these were different due to differences in flexibility of the underlying software. We optimized each GBDT using the simple pinball loss function ( $L_{q} (y, \hat{y_{i}})$ ) for the specified quantile (q) where:

\begin{array}{l} L_{q} (y, \hat{y_{i}}) =_{(\hat{y_{i}} - y) (1 - q) i f \hat{y_{i}} \geq y}^{(y - \hat{y_{i}}) q i f \hat{y_{i}} < y} \end{array}

We optimized the LSTMs using a multi-term loss function for all three quantiles and additional terms to penalize unrealistic crossing of quantiles:

\begin{array}{l} L_{0.05} (y, ŷ_{1}) + L_{0.50} (y, ŷ_{2}) + L_{0.95} (y, ŷ_{3}) + max (ŷ_{1} - ŷ_{2}, 0) \\ + max (ŷ_{1} - ŷ_{3}, 0) + max (ŷ_{2} - ŷ_{3}, 0) \end{array}

Through this custom loss function and multi-output prediction, we aimed to improve the consistency of the LSTM forecasts by reducing the degree of independence between forecast outputs.

Persistence model

Time series often display temporal autocorrelation where the previous time stamp value is informative of the subsequent time stamp value. As such, a persistence model is a commonly used baseline to evaluate forecast performance against (Makridakis et al., 2020, 2022a; Zwart et al., 2023b,a). Here, we contextualized ML forecast performance against a persistence model of the last observed streamflow percentile. In the case of a 4-week forecast horizon, the persistence model forecast for January 31, 2000, would be the observed streamflow percentile from January 3, 2000. We expected this to be a strong baseline for long droughts, but this baseline has no ability to predict onset or termination of drought, which is a primary motivation for this work. Additionally, we do not attempt to quantify uncertainty with this deterministic, baseline model.

ARIMA models

We fit gage-level ARIMA models to weekly drought time series using the R-based auto.arima() method (Hyndman et al., 2009). First, we fit models using the training period data. While fitting a model to the training period, auto.arima() automatically selects the optimal ARIMA model parameters (p, d, q) by minimizing the corrected Akaike Information Criterion (AICc), with constraints (p <5, d <2, q <5). These parameters are saved along with the fitted coefficients and used for the step-forward forecasting models during the test period. For each station, a rolling window approach was used where the fitted model was provided with antecedent data (i.e., prior p weeks leading to the current time step) to forecast using the forecast() function (Hyndman et al., 2009) for every horizon up to h, the maximum horizon. This is how we generated multi-horizon forecasted time series for the full test period. Stations with fewer than h observations during the test period were omitted from ARIMA forecasting and overall model comparison.

Post-processing

While the models predicted the numeric value of the historical streamflow percentile, predictions were not initially limited to the common domain of percentiles (between 0 and 100%). We addressed this by truncating the predictions to this domain in post-modeling steps. We did this because we determined through early user feedback that corresponding volumetric streamflow in cubic feet per second (ft³/s) predictions were of interest, but there is no credible way to translate a −5% streamflow percentile into ft³/s using the historical record.

We post-processed LSTM predictions to ensure that the median prediction was equal to or within the bounds of the 90% prediction interval using a simple clamp function which adjusted the median. Previously, we mostly eliminated this problem with the custom loss function (refer to section 3.3.3), but prior to clamping values, we still identified low-magnitude discrepancies (e.g., a median of 20.1% and a prediction interval upper bound of 20.0%). Thus, while prediction targets representing different points in the forecasted distribution could no longer illogically cross each other, those distinct prediction targets could ultimately be equal in rare cases. If the median were to be equal to one of the prediction interval bounds, then the mathematical interpretation would be that our methods predicted a 45% chance of exactly that streamflow percentile.

We explored empirical distribution matching (Belitz and Stackelberg, 2021) to bias-correct models which failed to predict droughts at an appropriate rate; for example, a model could fail to deviate sufficiently from the mean of 50% streamflow percentile. This method used the training set to determine what remapping of predictions would result in the correct distribution of observed streamflow percentiles. For example, we could determine that a value ≤ 10% should be predicted as often as our model was originally predicting values ≤ 20%, so a prediction of 20% should generally be corrected to a prediction of 10%; this was learned over a grid of all streamflow percentiles. Ultimately, we only found this to be beneficial for the ARIMA models. As a result, only ARIMA results use this bias-correction method.

Model evaluation

We modeled streamflow drought by predicting streamflow percentiles and converting those percentiles to binary drought classifications using specified thresholds. Through early and repeated experimentation, we found this approach to be more performant than predicting drought classes directly, and this approach provides the opportunity to forecast continuous streamflow at gaged locations with long and complete records of streamflow. We evaluated the models using all three thresholds (i.e., 20%, 10%, and 5%) but focus specifically on the 10% or “severe” threshold in several figures for simplicity.

Due to the nature of droughts as extreme events and the low percentile thresholds, a simple classification accuracy measure provides poor information regarding forecast skill. If a model never predicted drought, it would have accuracies between 80% and 95% for different drought thresholds; these high accuracies would be entirely driven by true negatives (TNs; correctly predicting no drought when no drought occurs).

Cohen's kappa is a more sophisticated measure of classification accuracy which only reports a value above 0 if the model's agreement with observations is higher than a random allocation of those forecast values (Cohen, 1960). In the case of never predicting drought, all predictions are identical, so a random allocation of predictions is the same and has the same skill. Therefore, never predicting drought yields a Cohen's kappa value of 0. This same rationale and result would apply to always predicting drought. Additionally, if we had a model that correctly learned that 10% droughts occur only 10% of the time but forecasted without correlation to real-world droughts, then due to that lack of correlation, it would also have identical skill to a random allocation. Therefore, just learning the correct but uncontextualized rate of drought also yields a Cohen's kappa value of 0.

Cohen's kappa is synonymously used in the field of meteorology under the name “Heidke Skill Score” (Hyvärinen, 2014) and has been used by other benchmarking studies to measure the performance of drought prediction by process-based hydrologic models (Simeone et al., 2024). It can be mathematically expressed as:

\begin{array}{l} \frac{2 (T P \times T N - F P \times F N)}{(T P + F P) (F P + T N) + (T P + F N) (F N + T N)} \end{array}

Where true positives (TPs) are correct predictions of droughts, false positives (FPs) are incorrect predictions of drought, and false negatives (FNs) are incorrect predictions of non-drought. Other metrics, such as balanced accuracy and F₁ score, are similarly concerned with measuring performance during class imbalance. However, we found Cohen's Kappa to provide the most conservative estimates of performance for our range of TPs (0 to 10%) and TNs (0 to 90%)—refer to Supplementary Figure S2 for visualization. Therefore, Cohen's Kappa provides several desired properties for evaluating drought predictions.

The simplest application of Cohen's Kappa is to identify the model's skill in predicting a binary “drought” or “not-drought” for each forecast horizon and severity threshold. For clarity, we henceforth refer to this as “overall Kappa.” This application can be used for identifying the horizon at which the model is skillful and for model intercomparison. However, the skill of models to predict when a drought will start and end is a greater motivation for this study. To accomplish this evaluation, we bring together the multiple forecast values and evaluate the first crossing of the threshold (Figure 3). We evaluate the skill of the models to identify the first onset or termination of the drought at horizons of 1–4 weeks, again using a binary metric (“drought onset” or “no drought onset”) and using Cohen's Kappa to quantify onset and termination skill. For clarity, we henceforth refer to this as “onset Kappa” or “termination Kappa.”

Figure 3

Line graph titled “Drought Onset Performance” showing percentiles over time. Black line with circles represents observed data, and blue line with circles represents predicted data. A red dashed line marks the threshold. Labels indicate observed and predicted first crossings. Two horizontal bars below show predicted and observed onsets, spanning today to four weeks.

Figure 3. Schematic indicating how forecasts at multiple horizons are aggregated to derive drought onset performance. A similar method is used for deriving drought termination performance.

Cohen's Kappa can be used as a model metric for model intercomparison but can also be difficult to conceptualize in physical terms. We also report sensitivity and specificity, two commonly- applied performance statistics, defined as:

\begin{array}{l} S e n s i t i v i t y = \frac{T P}{T P + F N}; S p e c i f i c i t y = \frac{T N}{T N + F P} \end{array}

Where true positives (TPs), false positives (FPs), false negatives (FNs), and true negatives (TN) represent the outcomes of drought (positive) and non-drought (negative) predictions. Sensitivity is therefore the proportion of all observed drought events that were correctly forecast, while specificity is the proportion of non-drought events that were correctly forecast. An ideal forecasting system will have high sensitivity and specificity.

While our primary evaluation metrics represent performance of binary drought classes, we also report a metric of the model's ability to describe the numeric percentile value. Due to its usage within multi-component model evaluations, we selected the Kling-Gupta efficiency (KGE) metric (Gupta et al., 2009). This metric incorporates Pearson correlation, bias, and variability between simulated and observed values. We used a version of the metric with a modification that ensures the bias and variability ratios are not cross-correlated (Kling et al., 2012). This goodness-of-fit metric can range from negative infinity to 1, with values closer to 1 indicating better performance. Using the mean of the observed series to predict returns a value of −0.41; we use this value as a performance benchmark following Knoben et al. (2019). While typically applied to the full range of the data, we are primarily interested in model performance at low percentiles. We therefore also report KGE for only data where observed streamflow is below the 30^th percentile.

By several evaluations, our model evaluation was indifferent to modeling that only recreated perfect seasonality. The prediction of percentiles rather than absolute streamflow means that a perfectly seasonal prediction is always 50%. KGE assigns this mean prediction a far-from-one value of −0.41. Likewise, Cohen's Kappa was concerned with classification around the 5, 10, and 20% thresholds, so a constant prediction of 50% or typical seasonality for that gage would result in a metric of zero. Similarly, Cohen's Kappa would report a metric of zero if the model was able to recreate deseasonalized streamflow percentiles conditions only as low as a threshold value (for example, 21%).

We performed a first-order examination on potential controls of model performance by relating our overall Kappa metric to several descriptors of each gage site. We selected the following controls for this examination: watershed drainage area, the ratio of annual maximum snow water equivalent divided by total precipitation (a measure of snow influence), the number of days of mean streamflow could be contained by reservoirs in the watershed (a measure of reservoir influence), average drought duration over the historical period, average fraction of time each winter that ice affects streamflow records at the gage (a measure of ice influence), annual average precipitation, 30-year average number of consecutive days with measurable precipitation, average maximum monthly days of measurable precipitation, and minimum monthly days of measurable precipitation. We correlated performance using the rank-based correlation coefficient (Kendall's Tau).

To quantify model uncertainty and reliability, we evaluated 90% prediction intervals by considering what proportion of observed streamflow percentiles they contained (“capture”) and their average width. Ideally, capture is 90%. Excessively low capture (e.g., 70%) is indicative of unjustified certainty (“overconfidence”) and excessively high capture (e.g., 100%) is indicative of unnecessary width (“underconfidence”). Ideally, width approaches 0 for highly accurate forecasts, but width must necessarily expand where error or modeling difficulty is higher.

As with other performance metrics, we analyzed uncertainty quantification measures against potential explanatory information—most notably, we later report on how uncertainty quantification varies by observed streamflow percentile and HUC2 region. Model output and the modeling code supporting the experiments in this paper are provided in McShane et al. (2025).

To provide some interpretation of the otherwise opaque models, we computed feature importance for the best LightGBM and LSTM models. LightGBM Shi et al., (2020) software permits automatic computation of “gain” feature importance which quantifies the training set error reduction achieved as a result of the decision splits using a given input. For LSTM models, we computed a measure of permutation feature importance (refer to Molnar, 2025) and reported the percent increase in test set error when a variable is randomly shuffled while all other variables are held constant. Because these results came from different software and described fundamentally different models, their methods and resulting units are not identical, but they both quantify what variables the models rely on to generate more accurate predictions (relative to an untrained or disturbed scenario, respectively).

Results

We evaluated two ML model architectures and reference statistical models at multiple scales and with multiple metrics to identify the architecture that most accurately simulates streamflow drought onset and termination forecasting and to convey the expected performance. Model performance is defined as a measure of how accurately a model's outputs match observations. First, we report overall forecast performance metrics, which independently evaluate performance of predicted values and observed values for each horizon for all weekly timesteps (Section 4.1). Next, we report performance metrics derived from aggregating the independent weekly forecasts into streamflow drought events (Section 4.2). We then report performance metrics on the uncertainty estimates made for each model (Section 4.3). Finally, we provide feature importance measures (Section 4.4), examples of the forecast time series (Section 4.5), and initial experimentation into including reservoir data where available to improve models in heavily regulated areas (Section 4.6), to support our discussion of the utility of the models.

Overall performance metrics

First, we examined overall model performance using the distribution of the Cohen's Kappa metric at individual gages within CONUS (Figure 4). The overall performance across all models is worst for extreme (5^th percentile) and best for the moderate (20^th percentile) droughts. Greater overall performance is also observed for shorter forecast horizons, as might be expected. The persistence model is a strong model, with generally informative (median Cohen's Kappa value greater than zero) forecasts out to at least 9 weeks even for the most extreme percentiles. The best overall performing ML model is the LSTM <30, with informative forecasts out to 4 weeks at the 10^th percentile threshold. However, its median performance exceeds the persistence model only at the 1 and 2-week horizon for the 10th percentile threshold. Both configurations of the LightGBM model have similar overall performance to the ARIMA model and, at the 10^th and 5^th percentile thresholds, match or exceed the performance of the LSTM-all. Overall Cohen's Kappa performance was found to weakly correlate with historic drought duration, intensity, and severity; places with longer and more severe droughts had better overall performance across all models, including the persistence model (Supplementary Figure S1). Other measures characterizing the impact of reservoirs, drainage area, snowiness, and in-stream ice impacts had near zero correlation with the patterns of overall Cohen's Kappa performance metric.

Figure 4

Box plots show Cohen's Kappa classification performance across different forecast horizons for three thresholds: 5th, 10th, and 20th percentiles. Models compared include LSTM, LightGBM, ARIMA, and persistence, depicted in various colors. Performance varies with thresholds and models.

Figure 4. Distribution of the overall Cohen's Kappa performance statistic for the models evaluated in this study. Classification performance reported at the 5^th, 10^th, and 20^th percentile thresholds for each forecast horizon. Boxplots are ordered left to right in same order as legend top-to-bottom.

Similar patterns to the overall Cohen's Kappa performance were apparent for the KGE goodness-of-fit statistic (Figure 5). While the LSTM <30 model did not perform well for all percentiles (as expected), it was the best performer below the 30th percentile. The LSTM <30 model alone was more informative than a long-term mean at forecast horizons greater than 2 weeks. A similar effect was observed for the LightGBM <30 model, though to a lesser extent. Below the 30^th percentile, the LightGBM <30 model had similar performance to the persistence model and ARIMA model.

Figure 5

Box plot comparing performance of different models (LSTM, LightGBM, ARIMA, persistence) in terms of Kling-Gupta Efficiency. Top panel shows all percentiles; bottom shows percentiles below thirty. X-axis is forecast horizon in weeks. A dashed line at -0.414 indicates mean performance. Models are color-coded. Outliers are excluded.

Figure 5. National distribution of the Kling-Gupta Efficiency metric to quantify the performance of the predicted forecasts with observed percentiles. The performance is reported for all percentiles and for only observed percentiles below 30. Some extreme outliers are present within the <30 percentiles KGE values at sites whose percentiles rarely fell below this level; outliers are not shown in this figure. The horizontal dotted line is −0.41, the performance of a mean value (Knoben et al., 2019). Boxplots are ordered left to right in same order as legend top-to-bottom.

National distributions can obscure regional patterns of model performance; to further investigate model performance, we examined regional model performance. For this examination, we focused on the 10^th percentile threshold overall Kappa metric and displayed the median, 25^th, and 75^th percentile at individual gages within each HUC2 region (Figure 6). The persistence model generally has the best overall performance in most regions and at longer horizons. However, the LSTM <30 model performance is greater at the 1-week horizon for eastern regions and for the pacific northwest. All models, including the persistence model, have lower performance within central parts of the country (Upper and Lower Mississippi, Tennessee, Ohio, Great Lakes). The ML models have greater performance at long horizons (>4 weeks) in the dry and mountainous southwestern United States than in the wetter east coast.

Figure 6

Grid of scatter plots displaying the 10th percentile Cohen's Kappa classification performance across various U.S. regions. Plots show performance for different models, including LSTM-all, LSTM<30, LightGBM-all, LightGBM<30, ARIMA, and persistence, with forecast horizons of 1 to 13 weeks. Each plot represents a different region, such as Pacific Northwest, Great Basin, and New England. The y-axis indicates Cohen's Kappa values ranging from 0 to 0.8, and the x-axis shows forecast weeks. Each model is represented by a distinct color and symbol.

Figure 6. Median (point) and 25th–75th quantile range (lines) overall Cohen's Kappa for each model for all gages within each region. The 10th percentile threshold was used for this figure. Sub-panels are generally arranged geographically. Region boundaries shown in Figure 1.

While the overall Kappa metric is an effective distillation of overall classification performance, we also examined the proportion of forecasts within drought and the component parts of classification performance (true positive, false positive, true negative, and false negative). For this evaluation, we focused on the 10^th percentile threshold (Figure 7). During the testing period, approximately 11 percent of all observations were below the 10^th percentile threshold. These observations can result in either true positive or false negative; the sensitivity indicates the proportion of correct results. Both the LightGBM and LSTM models had greater sensitivity for short horizons than long horizons, with the best performance by the LSTM <30 model. While the persistence model had a greater sensitivity at long forecast horizons than the ML models, it also had the highest number of false positive predictions (lowest specificity). All models except for the persistence model predicted fewer drought events for long horizons, indicating a tendency for models at long horizons to predict higher percentiles (Figure 8). Excluding the persistence model, the model least affected by this tendency was the LSTM <30 model.

Figure 7

Bar charts showing classification error components for drought prediction models using a tenth percentile drought threshold. Each model—LSTM-all, LSTM<30, LightGBM-all, LightGBM<30, ARIMA, and persistence—has two charts for observed drought and non-drought. Sensitivity and specificity values are provided for forecast horizons from one to thirteen weeks, indicating false positive, false negative, and true positive percentages.

Figure 7. Classification error components for all gages within CONUS showing the relative proportion of true positives, false positives, and false negatives. Numeric values of sensitivity and specificity are shown. Components for the 10th percentile threshold are shown. True negative is the most common result and is the remainder of the values (adding up to 100%) for each model.

Figure 8

Bar chart titled “Proportion of Drought Forecast by Horizon” showing six panels comparing ARIMA, LightGBM-all, LightGBM<30, LSTM-all, LSTM<30, and persistence models. Each panel displays drought prediction percentages over forecast horizons of one, two, four, nine, and thirteen weeks. A red dashed line indicates a ten percent threshold. Varying colors represent different models.

Figure 8. Proportion of forecasted streamflow to be below the 10^th percentile threshold for each model as a function of the forecast horizon. The dashed red line indicates the 10^th percentile nominal rate, though the true percent of drought occurrences in the model testing period shown is approximately 11%.

Onset and termination performance metrics

Droughts are events that have a beginning and an end. Our duration metrics quantify the ability of multiple independent forecasts to describe the presence and timing of drought onset and termination. For this analysis, we use the ARIMA model as a benchmark because the persistence model has no ability to represent onset and termination. Not all gages experienced multiweek droughts during the testing period, so for this examination, we pool all events nationally or in each region to calculate performance statistics. Nationally, drought onset sensitivity, or the proportion of droughts that we correctly forecasted would appear at some point in the 13-week window, was low, with the best model being the LSTM <30 model with a sensitivity of about 22% (Figure 9). Within that 22% of correctly forecasting drought onsets, the best onset performance was distinguishing if the drought would onset in less than or greater than 1 week. Again, the LSTM <30 model performed best and had a Cohen's Kappa value of about 0.41. All models had a high onset specificity, indicating they correctly forecasted non-drought periods that lasted the full 13-week forecast horizon. The lowest-performing model was LSTM <30 with 96% onset specificity. Most models had a high termination sensitivity, which is the proportion of droughts that we correctly forecasted would end at some point in the 13-week window. This demonstrates a tendency to predict lower proportions of drought for further horizons. The LightGBM-all model best distinguished if droughts terminated in less than 1 week or greater than 2 weeks. Models had a lower drought termination specificity, which is the proportion of droughts that we correctly forecasted would last greater than 13 weeks. The persistence and ARIMA models used as baselines had the best performance, likely representing prolonged steady conditions resulting in >13-week droughts. The LSTM <30 model had the second highest drought termination specificity.

Figure 9

Bar charts comparing model performances. Left charts show sensitivity and specificity for onset and termination using LSTM, LightGBM, ARIMA, and persistence models. Right charts display Cohen's Kappa for correctly forecasted onset and termination across horizon cutoffs, less than one to four weeks. Each model is color-coded: LSTM-all (orange), LightGBM-all (blue), LSTM-30 (brown), LightGBM-30 (green), and ARIMA (teal).

Figure 9. National model performance for predicting the presence and timing of drought onset and termination. The first column displays the sensitivity (proportion of observed events correctly forecast to occur at any horizon) and specificity (proportion of non-events correctly predicted not to occur) for onset and termination. The second column displays the models' ability to correctly identify onset or termination at different horizon cutoffs. The second column only represents the proportion of the data shown in the “sensitivity” panels of the first column because performance can only be computed where onset/termination is both predicted and observed within the forecast horizon.

We also examined regional patterns in drought termination performance because most drought termination events were correctly identified (unlike drought onset events). In general, the ML model performance exceeds the ARIMA model at representing termination within almost all regions out to 4 weeks (Figure 10). Drought termination Cohen's Kappa for the best-performing model in most regions was between 0.2 and 0.4. At the 1-week duration threshold, the LightGBM-all model had the best performance in 12 of the 18 regions. Both LightGBM model configurations had similar performance for most regions and durations. At longer durations in 5 eastern and northern regions (1–2; 4–5; 17) the LSTM <30 model configuration had greater termination performance. Most regions had a greater termination kappa performance for short durations than long horizons, though for regions in the southwest (13–15), several models performed consistently from 1 to 4 weeks. Contrary to other regions, in the Souris Red Rainy, Cohen's Kappa values increased through time. Regions 1 and 2 demonstrated elevated termination performance for weeks 3 and 4 using the LSTM <30 model, whereas other regions showed consistent declines in performance with forecast horizon.

Figure 10

Bar charts show regional drought termination performance across eighteen U.S. regions. Each chart compares five models: LSTM-all, LSTM less than thirty, LightGBM-all, LightGBM less than thirty, and ARIMA, across drought durations of one to four weeks. The y-axis measures Cohen's Kappa, ranging from zero to 0.6. Regions include Pacific Northwest, Great Basin, Missouri, Souris Red Rainy, Great Lakes, New England, and others. Performance varies by model and region, indicating differences in predictive accuracy.

Figure 10. Each panel shows the regional termination Cohen's Kappa for droughts lasting up to 4 weeks produced by each model. The right panel indicates the greatest value across all three tile plots. Panels are generally arranged geographically; region boundaries shown in an earlier figure.

Uncertainty quantification metrics

Thus far, we have focused on the performance of the median point estimates produced by our models. We also trained the ML models to produce the bounds of the 90% prediction interval (PI₉₀). We evaluated these PI₉₀ by calculating what proportion of observations they captured (ideally 0.90) and how wide the intervals were.

We found that PI₉₀ capture was not homogenous across streamflow percentiles (Figure 11). For example, we found that streamflow percentiles close to the training set median were overcaptured (e.g., 1.00) while relatively extreme streamflow percentiles were undercaptured (e.g., ≤ 0.50). The models trained on all streamflow percentiles provided capture closer to 0.90 for more streamflow percentiles, but we found that these models captured drought occurrences too infrequently (e.g., most capture values were between 0.50 and 0.80). Meanwhile, LSTM <30 and LightGBM <30 displayed a more even distribution of capture below and above 0.90 for drought conditions. We found that the LSTM <30 was the only model that provided near-ideal capture out to 9 weeks for severe droughts (5–10% streamflow percentiles).

Figure 11

Heatmap showing 90% prediction interval capture for streamflow percentiles across four models: LSTM-all, LSTM<30, LightGBM-all, and LightGBM<30. Rows represent streamflow percentiles, columns represent forecast horizons in weeks from one to thirteen. Colors range from blue (low capture, below 0.50) to red (high capture, up to 1.00), indicating variance in prediction accuracy by model and time horizon.

Figure 11. 90% prediction interval capture by forecast horizon (x-axis) and streamflow percentile bins (y-axis) for all four ML models. A horizontal black line separates streamflow percentiles that belong to our drought definitions. The diverging color mapping is centered on 0.90, the ideal capture proportion for the 90% prediction intervals. The color spacing above 90% is smaller than the color spacing below 90% (e.g., 0.50–0.70 capture is one color and 0.99–1.00 capture is one color); this helps distinguish excessively high capture (1.00) from ideal capture (0.90).

When focusing on low streamflow percentiles (i.e., ≤ 30%), we also found that PI₉₀ capture was not homogeneous across regions (Figure 12). Across models, we saw lower capture for eastern HUC2 regions (i.e., New England, Mid Atlantic, South Atlantic-Gulf, Tennessee, Ohio, and Great Lakes). Additionally, all models except for the LSTM <30 provided lower-than-ideal capture for the Pacific Northwest region across most forecast horizons, and models trained on all streamflow percentiles demonstrated poor capture in the Souris-Red-Rainy region across all forecast horizons. The LSTM <30 (and the LightGBM <30 to a lesser extent) displayed near-ideal capture for the western regions of the country that are most known for drought (e.g., 80% capture out to 9 weeks in the Upper and Lower Colorado, Rio Grande, and Great Basin regions).

Figure 12

Heatmap comparing prediction interval capture across 18 U.S. regions for LSTM-all, LSTM<30, LightGBM-all, and LightGBM<30 models. Colors range from blue to red, indicating 90% prediction capture levels over forecast horizons of one to 13 weeks.

Figure 12. 90% prediction interval capture for streamflow percentiles below 30% by forecast horizon (x-axis) and HUC2 region (y-axis) for all four ML models. The diverging color mapping is centered on 0.90, the ideal capture proportion for the 90% prediction intervals. The color spacing above 90% is smaller than the color spacing below 90% (e.g., 0.50–0.70 capture is one color and 0.99–1.00 capture is one color); this helps distinguish excessively high capture (1.00) from ideal capture (0.90).

We found that PI₉₀ width was fairly constant across streamflow percentiles for models trained on all streamflow percentiles, while LightGBM <30 (and LSTM <30, to a lesser extent) produced wider PI₉₀ for higher, out-of-sample streamflow percentiles (Figure 13). Across models, we found that PI₉₀ width increased for later forecast horizons; LightGBM-all, LSTM-all, and LightGBM <30 increase from average widths of 41–54 at the 1-week horizon to 69–79 at the 13-week horizon. Compared to the other models, we found that PI₉₀ width associated with the LSTM <30 was lower and less sensitive to forecast horizon, increasing from 21% to 23%.

Figure 13

Heatmap with four panels comparing different models: LSTM-all, LSTM<30, LightGBM-all, and LightGBM<30. Each panel displays prediction interval widths (%) for various streamflow percentiles over forecast horizons of 1 to 13 weeks. Color scale on the right indicates interval widths from purple (≤17% ) to yellow (≥80%).

Figure 13. 90% prediction interval width by forecast horizon (x-axis) and streamflow percentile bins (y-axis) for all four ML models. A horizontal black line separates streamflow percentiles that belong to our drought definitions. The continuous color mapping ranges from 17 to 80 with color bins of different sizes (e.g., 17–20 vs. 40–50) to display all models (which have different ranges and sensitivity) at once.

When focusing on low streamflow percentiles (i.e., ≤ 30%), we found that PI₉₀ width displayed some heterogeneity by region (Figure 14). For all models except the LSTM <30, we again found that the eastern HUC2 regions (i.e., New England, Mid Atlantic, South Atlantic-Gulf, Tennessee, Ohio, and Great Lakes) negatively stood out with wider PI₉₀ at earlier forecast horizons. In addition to the HUC2s listed when addressing PI₉₀ capture, we also observed this for the Lower Mississippi region. Beyond the 4-week horizon, the LSTM-all model produced the widest prediction intervals across all regions. All these findings were either not applicable or much less attributable to the LSTM <30 which provided relatively constant prediction interval width across regions and forecast horizons.

Figure 14

Heatmap comparing prediction interval widths for different regions and models. Columns represent models: LSTM-all, LSTM<30, LightGBM-all, and LightGBM<30. Rows indicate regions like New England and California. Colors range from dark purple for the smallest intervals (seventeen percent) to yellow for the largest (eighty percent).

Figure 14. 90% prediction interval width for streamflow percentiles below 30% by forecast horizon (x-axis) and HUC2 region (y-axis) for all four ML models. The continuous color mapping ranges from 17 to 80 with color bins of different sizes (e.g., 17–20 vs. 40–50) to display all models (which have different ranges and sensitivity) at once.

Feature importance

This work was designed to maximize ML predictive modeling performance rather than to advance mechanistic understanding of drought. Consequently, our examination of the features contributing to each model was limited to evaluating the few most important variables in the model to document the necessity of certain datasets to support operational modeling. Both ML methodologies used in this study can model using highly correlated predictor variables, which are certainly present in our dataset. This can complicate the interpretation of simple feature importance measures. However, we display the top 5 most important features used as a first-order assessment (Figure 15). For illustration, we present the two best-performing configurations for each model: LSTM <30 and LightGBM-all. Both models rely heavily on the antecedent percentiles, which is not surprising given the strong performance of the persistence model. The reliance is greater for the shorter forecasts than the longer forecasts. The LightGBM-all model, which has antecedent rolling mean variables explicitly represented, uses longer rolling horizons (365-day and 90-day rolling means) for the 9- and 13-week forecasts and shorter rolling horizons (30-day rolling means) for the 2- and 4-week forecasts. The total percent importance represented by the top 5 predictor variables is less for the LightGBM-all model, which has a greater number of predictor variables, and for further forecast horizons. Both models use the GEFS forecast dataset, which provides meteorological forecast data up to 10 days ahead, within the 5 most important variables at the 1-week and 2-week streamflow percentile forecast horizons. The NMME forecast dataset, which provides meteorological forecast data up to 105 days ahead, is within the top 5 most important variables for streamflow percentile forecast horizons at 4 and 9 weeks. In relative terms, the LightGBM-all model is more reliant on the GEFS and NMME forecast products than the LSTM <30 model, which uses static basin characteristics to a greater degree in the 5 most important variables.

Figure 15

Bar charts displaying the top five variables for predicting streamflow using LSTM and LightGBM-all models over different forecast weeks (1, 2, 4, 9, 13 weeks). Each chart shows variable importance percentages, with data sources color-coded: antecedent percentile, antecedent streamflow, NLDAS soil moisture, static basin characteristics, day of year, GEFS forecast, NMME forecast. Variable streamflow percentile and fixed streamflow percentile consistently rank high in importance across models and weeks.

Figure 15. Bar charts displaying the variable importance of the top 5 most important variables for the LightGBM-all and LSTM <30 models developed for 5 forecast horizons. Variable importance is shown as a percent of the total. X-axis values vary among subpanels. Colors indicate category of predictor variable; y axis label indicates variable name. Variable definitions are available in the associated data release of dynamic model feature inputs (Hammond, 2025) and static model feature inputs (McShane et al., 2025). Supplementary Table S1 and S2 contain variables and their sources. Variable streamflow percentiles demonstrate departures from typical values for the time of the year, fixed streamflow percentiles indicate conditions relative to the entire period of record. Precipitation abbreviated to prec. Temperature abbreviated to temp. Percentile abbreviated to %ile.

Forecast examples

In addition to quantitative metrics of model performance, we also examined the qualitative appearance of the forecasts for this work to support an operational forecasting system that supports decision-making. We picked several endmember examples of individual droughts at specific sites to illustrate the expected behavior within a forecast system developed from the models (Figure 16). The endmembers were derived from the 50^th quantile predictions and include (a) a majority of models missing onset/termination, (b) a majority of models correctly predicting the onset/termination within a 1-week tolerance, and (c) a majority of models incorrectly predicting the onset/termination by more than 30 days. The 5^th and 95^th quantile predictions often encompass the true observed value but are wide enough to consistently span the 10^th percentile threshold, thus limiting their use in probabilistic onset/termination forecasts.

Figure 16

A data visualization consisting of six panels showing streamflow percentile predictions over time. Panels are labeled with site identifiers and descriptions like “Onset Incorrect,” “Termination Correct,” and others. Each panel includes colored lines and dots representing forecast models: LightGBM and LSTM, with correct and miss indicators. A legend identifies these elements and a threshold is shown as a dashed line. Periods range from February to August, showing variations in accuracy and timing.

Figure 16. Example forecasts at six selected USGS gages (corresponding USGS station numbers included at the tops of the graphs, U.S. Geological Survey, 2025) within the dataset to illustrate endmember outcomes of the forecasts. The red vertical line indicates the date of the forecast, with the solid black line indicating antecedent streamflow percentiles and the dotted line indicating observed (future) percentiles. Colored dots are the median forecast value, with vertical bars showing the 5th and 95th percentile predictions. The horizontal dashed black line indicates the 10th percentile drought threshold. “Miss” is when the termination/onset is never predicted by most models but it does occur. “Incorrect” is when there is a large difference between the true and predicted timing of termination/onset.

Evaluating the utility of long-term reservoir observations for improving forecasts in highly regulated areas

We conducted experiments to evaluate the impact of incorporating long-term observations of reservoir storage and outflow into LSTM-all models in the Upper Colorado River Basin, as well as into CONUS-scale LSTM-all, LSTM <30, LightGBM-all and LightGBM <30. Our goal was to determine whether these additional features could enhance model performance, particularly at sites located directly downstream of reservoirs with heavily managed flows. To achieve this, we utilized the ResOpsUS version 2 dataset, which provides time series data on reservoir storage and outflow (Steyaert et al., 2022), as supplementary input features for the ML models.

We initially focused our experiments on the Upper Colorado River Basin due to the high availability of ResOpsUS data and the presence of gaged locations that allowed us to evaluate model performance with and without the additional reservoir features. We also included the distance to the nearest upstream reservoir as an input feature, potentially enabling the models to learn how reservoir storage and outflow might affect downstream locations at varying distances. Our experiments from the Upper Colorado River Basin revealed increased accuracies in streamflow discharge predictions when reservoir storage and outflow were included as input features. However, this increase was primarily observed for sites located directly downstream of the reservoirs (Figure 17). The inclusion of reservoir information yielded mixed results for other areas within the Upper Colorado River Basin.

Figure 17

Panel (a) shows a line graph comparing river discharge from 2015 to 2020. It includes LSTM predictions in orange, LSTM with reservoir data in purple, and actual observations in black. Discharge values range from 100 to 1000 cubic meters per second, showing seasonal peaks. Panel (b) displays a scatter plot of NSE difference values against the nearest upstream reservoir distance in kilometers, ranging from 0 to 60. Most points are below the baseline, indicating differences between LSTM models.

Figure 17. Example hindcast predictions of streamflow discharge using a Long Short-Term Memory (LSTM) model, with and without the inclusion of reservoir storage and outflow time series as input features (a). (b) illustrates the difference in Nash-Sutcliffe Efficiency (NSE; Nash and Sutcliffe, 1970) between the LSTM model that incorporates reservoir time series and the model that does not, plotted against the distance to the nearest upstream reservoir for gaged locations in the Upper Colorado River Basin. The purple diamond in the right panel indicates the gaged location corresponding to the time series displayed in the left panel.

....

We also integrated reservoir information from ResOpsUS into the CONUS-scale LSTM and LightGBM models, but we did not observe a substantial increase in overall performance across all gaged locations. While some sites showed improved predictive accuracy (e.g., Figure 17), only 528 of the 3,219 gaged locations in the CONUS model had upstream reservoirs included in the ResOpsUS database, and only a small subset of these were directly downstream. Consequently, we chose not to incorporate reservoir information into our operational models. We suspect that the limited improvement in performance for the CONUS models may be attributed to the scarcity of gaged locations directly downstream from ResOpsUS locations with strong reservoir influences. Nevertheless, the significant increases in performance observed at sites in the Upper Colorado River Basin indicate that further experimentation to explore the optimal integration of reservoir information in ML models could enhance streamflow predictions at these heavily managed sites.

Discussion

In this study, we applied ML models to determine the feasibility of forecasting streamflow drought occurrence, onset, and termination for 1–13 weeks in advance. We evaluated the models and then compared them to benchmarks to understand their accuracy and uncertainty when predicting each drought property for streamflow drought events at gaged locations in the CONUS. Our results show that hydrological drought remains difficult to predict and outperforming a simple persistence model can be difficult. However, ML models provide information that can be used in drought forecasting and elevate the baseline for continued model improvement (Section 5.1). We place our results in the context of prior streamflow drought prediction work (Section 5.2) before explaining the tradeoffs in selecting models for operational streamflow drought forecasting (Section 5.3). Finally, we discuss remaining challenges and opportunities for further work on hydrological drought forecasting (Section 5.4).

Forecasting streamflow drought onset and termination

Several overarching patterns emerged from the evaluation of two ML architectures and two benchmark model approaches to make weekly forecasts of the streamflow percentile and onset/termination of streamflow droughts. First, drought occurrence evaluation using the overall Cohen's Kappa metric revealed that model performance generally decreases for extreme droughts (5th percentile) and increases for moderate droughts (20th percentile, Figure 4). Model performance also tended to decline with increasing forecast lead time, and across CONUS, the persistence model had the highest overall Cohen's Kappa for all lead times and intensities except for the 1 week severe and extreme intensities, where the LSTM <30 model shows slightly elevated performance. The ML models tested in this study have the tendency for further forecast horizons to predict less drought, which needs to be considered if using long-term forecasts for decision making.

The persistence model, which simply predicts that current conditions will persist, can be difficult to improve upon in many cases because it captures the inherent slow change in drought conditions. Yet, the LSTM <30 model had the highest performance in forecasting weekly variable streamflow percentiles below 30% (the conditions when a location is approaching drought, in drought, or exiting drought) as indicated by the KGE <30 metric for all forecast periods (Figure 5). Thus, the LSTM <30 model provides the most accurate estimation of streamflow percentiles in a streamflow drought context.

Regional analysis of the 10th percentile threshold overall Cohen's Kappa metric indicated that the persistence model typically performs best across most regions, especially at longer horizons (Figure 6). However, the LSTM <30 model more accurately simulated shorter horizons in several eastern regions and for the Pacific Northwest, with overall lower performance observed in the central U.S. These regional patterns in model performance highlight strong geographic controls on drought predictability. Generally, drought was more predictable in places with longer, steadier droughts, and more difficult in wetter places with less seasonal precipitation. The persistence model generally performs best across most HUC2 regions, particularly at longer forecast horizons, yet the LSTM <30 model provides superior 1-week forecasts in several eastern regions and the Pacific Northwest. Model skill is consistently lowest in the central United States—including the Upper and Lower Mississippi, Ohio, Tennessee, and Great Lakes regions—where flow regimes are strongly influenced by lake storage, reservoir operations, urban runoff, and subsurface storage processes that are not fully represented in the current predictor set. In contrast, ML models exhibit comparatively higher performance at longer horizons (>4 weeks) in the dry and mountainous southwestern regions, where aridity and strong snow-related seasonality enhance the predictability of low-flow conditions. Together, these results suggest that drought forecasting skill is enhanced in regions with more persistent hydroclimatic signals (e.g., arid or snow-dominated basins) and reduced in regions where regulated or storage-dominated hydrology weakens the relationship between antecedent conditions and future flow. These patterns also indicate that that models may be lacking inputs needed to capture important drivers of hydrological response in these regions.

Spatial and process-related heterogeneity significantly affect the performance of streamflow drought forecasting models. Recent studies evaluating the performance of national-scale, process-based models point to lake storage and release, urban runoff, and subsurface storage quantification as being key additions to improve model performance in the specified regions Johnson et al., (2023); Simeone et al., (2024); Husic et al., (2025). Husic et al. (2025) demonstrated that spatial variability in catchment attributes—particularly soil water content, precipitation, and land use—drives large differences in model skill across the contiguous United States. Their interpretable machine learning analysis revealed that low soil water content, especially in arid regions, and anthropogenic features such as high road density and lake area, were consistently associated with reduced model accuracy. Similarly, Simeone et al. (2024) found that both the National Water Model (NWM) and National Hydrologic Model (NHM) performed better in wetter eastern regions than in drier western regions, with the NWM more accurately simulating drought timing and the NHM better capturing drought magnitude. However, both models exhibited increased error during the most severe drought events and in regions with complex hydrologic processes, such as those influenced by groundwater or reservoir operations. These findings underscore the need for region-specific calibration, improved process representation, and multi-objective evaluation frameworks to address the challenges posed by spatial and process-related heterogeneity in drought prediction. Further work would be needed to evaluate whether providing ML models with approximations of these variables could improve streamflow drought predictions. The LSTM <30 model identified severe and extreme drought with higher accuracy than the ARIMA model. An examination of classification performance components (e.g., true positives and false positives) highlights that while the persistence model has a higher sensitivity at long horizons, it also produces more false positives. In contrast, the LightGBM-all and LSTM <30 models exhibit greater sensitivity for short horizons with fewer false positives, especially at longer forecast horizons, where they predict fewer drought events despite a significant occurrence rate.

The persistence model generally perfoms well but does not allow for the prediction of a change in drought status. LightGBM-all and LSTM <30 models generally predicted the onset and termination of drought events nationally more accurately than the ARIMA benchmark model and demonstrated skill within a 4-week forecast horizon. However, all models struggled to correctly forecast the onset of drought, with only 22% of national drought onset events identified by the best-performing model (LSTM <30). Although onset sensitivity values are modest (~20–25 %), they are comparable to or improve upon other S2S hydrological forecast systems (Abdelkader et al., 2023; Johnson et al., 2023; Lesinger et al., 2024; Towler et al., 2025; Su et al., 2023) and can still provide actionable early-warning information when used alongside persistence and expert judgment.

Regionally, the LightGBM-all model most accurately predicted drought termination for 11 out of 18 regions at a 1-week duration threshold, while the LSTM <30 model configuration most accurately predicted drought termination in longer durations for specific eastern and northern regions, indicating variability in model effectiveness based on region and duration. The Cohen's Kappa values for regional termination predictions ranged between 0.2 and 0.4 for the best-performing models, suggesting a fair level of agreement in the models' abilities to accurately represent the timing of drought events across different regions and durations.

As an estimate of prediction uncertainty, the evaluation of the 90% prediction intervals (PI₉₀) showed that capture rates varied across different streamflow percentiles, with models overcapturing median streamflow percentiles while undercapturing extreme percentiles. Models trained on all streamflow percentiles more accurately simulated a broader range of streamflow values but struggled with accurately predicting drought occurrences. The LSTM <30 model demonstrated near-ideal capture rates for severe droughts across all horizons, highlighting the importance of model training on relevant data ranges to improve prediction accuracy for low-streamflow conditions. The analysis revealed significant regional variability in capture rates, particularly in eastern HUC2 regions and the Pacific Northwest, where models tended to underperform. While all models quantified uncertainty relatively well in regions most affected by drought, such as the Upper and Lower Colorado and Rio Grande basins (i.e., mostly in the 0.80–0.90 capture range), the prediction intervals were fairly wide across CONUS (e.g., 20–80%) indicating a need for improved modeling approaches.

This study prioritized enhancing ML predictive modeling performance over advancing mechanistic understanding of drought, leading to a limited examination of the features contributing to each model. Simple analysis of the features with the largest contributions to model predictions showed a notable reliance on antecedent streamflow percentiles, particularly for shorter forecast horizons. LSTM and LightGBM models incorporate meteorological forecast datasets, with the LightGBM <30 model relying more on the GEFS and NMME forecast products at short forecast horizons, while the LSTM <30 model emphasized several static basin characteristics and soil moisture in its top features.

Improving upon existing models and setting a benchmark for future improvements

As discussed in the introduction to this paper, drought is a difficult phenomenon to predict. Guidance from prior hydrological drought prediction work (Sutanto et al., 2020; Sutanto and Van Lanen, 2021) and from a series of hydrological drought listening sessions Skumanich et al., (2024) revealed that streamflow droughts identified using variable streamflow percentiles are generally of interest to a wider array of end users because streamflow percentiles that have been deseasonalized allow for the identification of abnormally low flows during typically wet seasons, providing early warning of subsequent droughts. Through comparison to prior streamflow prediction studies, predicting departures from normal conditions appears to be more difficult than predicting volumetric streamflow. Variable streamflow percentiles, streamflow percentiles where the typical seasonal fluctuations of streamflow have been removed, have been shown to be more difficult for models to predict. Hamshaw et al. (2023) showed that the median KGE for a daily streamflow prediction model in the Colorado River Basin predicting 1 week in advance was 0.61, whereas models predicting streamflow percentiles retaining seasonality had a median KGE of 0.67 and models predicting deseasonalized streamflow percentiles had a median KGE of 0.43; however, these models did not incorporate any forecasted meteorology inputs like those used in the ML models of this study. By comparison, the median KGE for the LSTM <30 one week forecast of deseasonalized streamflow percentiles was 0.8 for CONUS.

Similarly, Simeone et al. (2024) showed that when daily streamflow from the National Water Model version 2.1 and from the National Hydrological Model was converted to streamflow percentiles, Cohen's Kappa values were typically lower when identifying droughts using variable drought thresholds and deseasonalized streamflow percentiles as compared to fixed drought thresholds and streamflow percentiles that retain seasonality. In an easier non-forecasting setting, National Hydrological Model and National Water Model had median Kappa values of 0.43 and 0.47 respectively for moderate drought, 0.34 and 0.37 for severe drought, and 0.24 and 0.26 for extreme drought. Most comparably, the median Kappa values for the LSTM <30 1 week forecast in this study were 0.60 for moderate drought, 0.65 for severe drought and 0.60 for extreme drought, improving upon the predictions of existing national scale models. In comparison to the regional patterns in streamflow drought performance from Simeone et al. (2024) which show sharply lower performance for western CONUS compared to eastern CONUS, the models developed in this paper had more similar performance across eastern and western regions.

Selection of model for an experimental operational application

Given that the persistence model cannot forecast the onset of drought and can only predict a continuation of existing conditions, we only consider deploying the remaining models for operational forecasting. Using overall and event-focused model evaluations to guide model selection for operational use, we identified two models as suitable for operational forecasting.

The LSTM <30 model:

(1) Best predicts streamflow percentiles in and adjacent to drought periods based on KGEs

(2) Outperforms all models except for the persistence model in forecasting 1-, 2-, and 4-week severe drought based on the overall Cohen's Kappa

(3) Overall Cohen's Kappa indicates similar performance to other models for moderate droughts of 1, 2, and 4 weeks, and performance exceeding all but the persistence model for weeks 9 and 13

(4) Has near-ideal capture rates for severe droughts across all horizons

(5) Has the narrowest 90% prediction interval

(6) Has the best performance at predicting drought onset.

The LightGBM-all model:

(1) Best predicts drought termination timing up to 4 weeks.

(2) Slightly outperforms the LightGBM <30 model for overall Cohen's Kappa performance forecasts.

KGE values greater than −0.41 indicate that a model improves upon the mean flow benchmark Knoben et al., (2019), and the LSTM <30 exceeds this threshold for all forecast periods, with national interquartile range always above this threshold, but with a considerable drop off in performance after 4 weeks. Landis and Koch (1977) provide guidelines for interpreting Cohen's Kappa as follows: 0–0.20 slight agreement, 0.21–0.40 fair agreement, 0.41–0.60 moderate agreement, 0.61–0.80 substantial agreement, and 0.81–1 almost perfect agreement. Thus, the LSTM <30 performance for predicting moderate and severe drought indicates substantial agreement for week 1 and moderate to fair agreement out to 4 weeks. The LSTM <30 model more accurately simulated drought onset compared to other models, but performance was generally poor for simulating drought onset. The LSTM <30 model also predicted drought termination more accurately than other models in the northeastern U.S. However, both LightGBM models simulated drought termination more accurately than the LSTM <30 model in the western United States and at short forecast horizons. Given this assessment, the LSTM <30 model is deployed to provide forecasts of weekly streamflow percentiles (Figure 18). Alongside this model, we also use the LightGBM-all model to generate predictions that may be used side by side to assess potential future drought termination likelihood. Forecast graphics are provided in volumetric streamflow because path analysis and user testing revealed that web map users were most comfortable and had better context with this display.

Figure 18

USGS streamflow drought assessment map of the United States. Displays various drought levels with dots: no drought in white, moderate in yellow, severe in brown, and extreme in red. The map includes a highlighted section for Muskegon River, Evart, Michigan, indicating an extreme streamflow drought. A graph shows the timeseries of observed and forecast conditions with an increasing trend. A legend explains the drought categories.

Figure 18. Webmap view and forecast graphic view for an example site, USGS 04121500 Muskegon River at Evart, MI (U.S. Geological Survey, 2025), for forecasts made on September 7, 2025. Forecasting web map (Corson-Dosch et al., 2025; https://water.usgs.gov/vizlab/streamflow-drought-forecasts/) currently only shows predictions made using the LSTM <30 model, with the option to download predictions from both the LSTM <30 and LightGBM all models.

Remaining challenges and opportunities for further work on hydrological drought forecasting

Challenges and limitations

Streamflow drought forecasting faces several significant challenges that complicate the accuracy and reliability of predictions. Our operational model has limited forecast skill and greatly increased uncertainty beyond 4 weeks, due to the diminishing importance of antecedent observed streamflow and the degradation of meteorological forecast accuracy Troin et al., (2021). Our model has worse performance for extreme (5%) droughts than severe (10%) or moderate (20%) droughts, a limitation for users interested in the most extreme droughts.

This effort made use of 40-year streamgage records to define streamflow percentiles supported by scientifically calibrated instruments. This strict inclusion criteria yields robust definitions and national consistency but leads to sparse spatial coverage. Streamflow data from shorter records and from non-USGS sources was not explored here. Including diverse streamflow records is promising, but this would likely greatly increase workflow complexity, and the limitations remain unexplored for streamflow drought forecasting (e.g., is there a reduction in accuracy when fewer training data are available and what record is too short to define reliable percentiles). The use of unconventional (Goodling et al., 2025), crowdsourced (Jaeger et al., 2019), and citizen science data (Peterson et al., 2024) is another opportunity which was not explored here. It may be particularly challenging to incorporate crowdsourced data because interpretations of drought may be specific to individual observers or certain cases of water use, so this data could be difficult to harmonize across all participants – (Sharma et al., 2020) discussed this challenge for the topic of lake ice reporting. Ultimately, expansion of this work to ungaged catchments would greatly expand its impact, but it is possible to develop models for ungaged catchments and estimate ungaged accuracy with the existing large and consistent dataset which may be the most robust direction forward.

Additional data constraints beyond observed streamflow availability may inhibit our ability to generate accurate forecasts. For example, there is a notable lack of comprehensive subsurface storage data, particularly groundwater, across CONUS (Kampf et al., 2020), and subsurface storage data are crucial for understanding drought dynamics. Human modifications to the landscape, such as the management of reservoirs, diversions, and irrigation canals, further complicate the natural flow of streams and rivers (Carlisle et al., 2019), making it challenging to model streamflow or departures from normal streamflow conditions accurately. Reservoir storage and release timeseries are not available operationally at the national scale required to integrate into our model; therefore, forecasts in heavily regulated basins are not expected to perform well which may limit the models' utility to users. Moreover, the inherent difficulty in capturing the sub-seasonal transitions between drought and flood or flood and drought conditions adds another layer of complexity (Barendrecht et al., 2024; Brunner et al., 2021; Götte and Brunner, 2024; Hammond, 2025) because these transitions can significantly alter streamflow patterns and exacerbate forecasting uncertainties. Models trained on historical data may underperform when making predictions in the future under non-stationary climate or land-use regimes (Song et al., 2024); we expect that the risk of performance loss is minimized by the diversity of watersheds represented in our CONUS-scale training set and periodic retraining could further reduce performance losses. However, more explicit treatment of nonstationarity via transformer normalization (e,g. Liu et al., 2022; Hua et al., 2025) could improve transferability were nonstationarity is present. Together, these factors highlight the need for improved data collection and modeling approaches to improve streamflow drought forecasting capabilities.

This work was motivated by the need to create a public operational deployment of drought forecasting ML model. Here we note the challenges of operationalizing the model and communicating its forecasts to the public, as also noted by other authors (e.g., De Burgh-Day and Leeuwenburg, 2023). Our model requires the near-real-time ingest and processing of multiple independent datasets (see Supplementary Table S1), each of which are susceptible to downtime, quality issues, and data gaps that could influence our forecasts in unexpected ways. A robust operational data pipeline and forecast quality control will be necessary components of this planned product. This work prioritized maximizing the performance of our ML approaches over interpretability. As a result, our ML-based forecasts are more difficult to interrogate than simpler statistical or physics-based approaches, and stakeholder confidence may take time to build for ML-based forecasts when explanations are limited. However, our decision to embrace complexity to pursue predictability is not unique and this approach is likely the fastest and most accurate path forward given the trajectory of data-driven approaches (refer to Nearing et al., 2020; Zhi et al., 2024).

Despite the advancement in drought forecasting capacity of our LSTM and LightGBM approaches, our modeling framework omits the most advanced DL methods that recent studies have shown can meaningfully improve streamflow forecasting skill. In particular, emerging hybrid and ensemble architectures—such as physics-informed deep learning, the latest DL architectures, and stacked or multi-stream models—have demonstrated improved ability to capture complex multi-scale dependencies and probe the limits of accurate predictability (e.g., Qian et al., 2025; Granata et al., 2024a,b; Granata and Di Nunno, 2024; Le et al., 2025; Modi et al., 2025; Slater et al., 2023). These approaches often incorporate established scientific knowledge or more complex DL equations to better represent physical limitations, long-term memory, abrupt regime shifts, and nonlinear hydrologic responses—capabilities that may exceed those of the standalone LSTM or LightGBM models used here. Our decision to train independent single-horizon models and to rely on standard architectures therefore trades potential gains in accuracy and process adherence for an initial baseline characterized by computational efficiency and scalability. As a result, while our framework performs well across CONUS, it does not leverage the full suite of state-of-the-art techniques that may further improve drought-focused streamflow forecasting, particularly under strong nonstationary or rapidly changing conditions.

Opportunities

In listing these challenges, several opportunities for enhancing streamflow drought forecasting arise, particularly through the integration of ever-improving meteorological forecasts (Gibson et al., 2020), including those generated by ML algorithms (Kaltenborn et al., 2023; Mouatadid et al., 2023; Nguyen et al., 2023; Yu et al., 2023). By training models on data from diverse locations outside of the United States, we can improve the robustness of forecasting tools because more data available for training typically increases the accuracy of ML streamflow prediction (Gauch et al., 2021). For longer-term predictions beyond 4 weeks, a shift toward forecasting hydrological drought conditions at a monthly timestep could provide more reliable insights. Expanding forecasting to include ungaged locations will allow for a more comprehensive understanding of drought occurrence across various regions. Additionally, generating retrospective predictions for ungaged areas could help connect historical drought to both human and ecosystem water availability. Ungaged areas lack observed antecedent streamflow estimates, but regional estimates of antecedent streamflow from gaged locations or from other models could provide this input. Leveraging forecasts from physically based hydrologic models could enhance prediction accuracy by providing estimates of subsurface storage and baseflow. Estimating reservoir storage and release dynamics as well as fine scale multi-sector water use at sites lacking long-term historical records could provide models with data to estimate the effects of human modifications on streamflow conditions.

In addition to streamflow drought predictions and forecasts, the models developed in this study could be adapted for the problems of low flow prediction and forecasting of streamflow percentiles retaining seasonality, rather than the deseasonalized percentiles used in this study for drought prediction. Hamshaw et al. (2023) showed that LSTMs were considerably stronger in predicting occurrence of flows below static thresholds rather than variable thresholds in the Colorado River Basin. Continued experimentation will be needed to evaluate whether developing separate models for different flow extremes (floods vs. droughts) yields the most accurate forecast for each, or whether a single model can be developed to accurately predict both extremes without sacrificing performance for either extreme. This latter model would be advantageous in terms of operational simplicity, but also perhaps for more reliably capturing the transitions between extremes. Recent examples highlight opposite extremes occurring in short succession with communities struggling to recover from the impacts of repeated hydrologic extremes Barendrecht et al., (2024).

Finally, recent advances in hydrological forecasting have highlighted the significant potential of ensemble and meta-learning methods to improve predictive accuracy, generalization, and uncertainty quantification. Ensemble learning approaches—such as boosting, bagging, and stacking—have been effectively applied to streamflow and temperature prediction, often combining diverse machine learning models like LSTM, XGBoost, and Random Forest to capture complex hydrological dynamics and reduce model bias (Tosan et al., 2025). These methods also enhance uncertainty quantification, as demonstrated in stream temperature forecasting across unmonitored basins (Willard and Varadharajan, 2025). Additionally, machine learning post-processing of ensemble forecasts has shown improved skill and reliability over traditional hydrometeorological models (Sharma et al., 2023). On the meta-learning front, recent work has introduced frameworks that automate model selection based on catchment-specific features, enabling more accurate river water level predictions in diverse environments. Transformer-based meta-learning models like MetaTrans-FSTSF have demonstrated strong performance in few-shot flood forecasting scenarios, addressing data scarcity and enhancing adaptability (Jiang et al., 2025). Together, these developments underscore the growing role of modeling strategies in advancing hydrological science.

Conclusions

Given the need for information on current and forecast hydrological drought conditions, we developed an experimental ML tool to forecast streamflow drought occurrence, onset, and duration for more than 3,000 gaged locations across the conterminous United States. We tested two ML model architectures (LSTM and LightGBM), each of which was tested with two configurations (trained on all percentiles and <30 percentiles). We compared their performance to two benchmark models (persistence and ARIMA). We found LSTM <30 to be the overall best-performing ML model for predicting drought occurrence and onset, with narrower and more accurate prediction intervals. However, this model did not outperform the persistence model for drought occurrence and, despite outperforming the benchmark models for drought onset, only correctly predicted onset 22% of the time. Both ML model architectures tended to predict drought occurrence at low rates for further streamflow horizons, resulting in artificially short drought durations. Both ML models shared a strong reliance on antecedent streamflow for shorter forecast horizons and a growing role of forecast meteorology at longer horizons.

The models described here support a new product (https://water.usgs.gov/vizlab/streamflow-drought-forecasts) that provides previously unavailable streamflow drought forecasts to enable the public to better anticipate and prepare for hydrological drought impacts. By producing operationally relevant drought forecasts at more than 3,000 gaged sites, the modeling system fills a long-standing gap between meteorological drought outlooks and the hydrologic conditions that directly govern water supply, ecological stress, and low-flow hazards. The models' dependence on streamflow memory at short lead times and meteorological outlooks at longer horizons underscores the importance of both sustained in-situ observations and continued improvements in subseasonal climate prediction.

These advances are tempered by several limitations. Both ML models underpredicted drought occurrence at longer lead times, leading to shorter estimated drought durations, and drought onset remained difficult to forecast accurately. Skill in regulated basins is further constrained by the absence of high-resolution reservoir operations and water-use data, which limits the models' ability to represent human influences on low flows. Additionally, forecasting deseasonalized streamflow percentiles—while conceptually advantageous for detecting hydrologic anomalies—remains inherently challenging during periods of rapid hydrologic transition.

These limitations point to clear avenues for future research. Improvements in model objectives, training strategies, and post-processing could help address systematic underprediction. Integrating more advanced meteorological forecasts and incorporating physical information—such as soil moisture, groundwater storage, and baseflow estimates from process-based hydrologic models—may enhance performance, particularly at longer lead times. Extending the framework to ungaged basins through regionalization or hybrid ML–process modeling would substantially broaden its operational relevance. Looking forward, developing unified models capable of addressing both drought and flood conditions may further strengthen preparedness for the rapid hydroclimatic shifts expected under a warming climate.

Data availability statement

The datasets presented in this study can be found in online repositories. The names of the repository/repositories and accession number(s) can be found in the article/Supplementary material.

Author contributions

JH: Supervision, Writing – review & editing, Investigation, Conceptualization, Writing – original draft, Funding acquisition, Project administration. PG: Writing – original draft, Writing – review & editing. JD: Writing – review & editing, Writing – original draft. HC-D: Writing – review & editing, Writing – original draft. AH: Writing – original draft, Writing – review & editing. SH: Writing – review & editing, Writing – original draft. RM: Writing – original draft, Writing – review & editing. JR: Writing – original draft, Writing – review & editing. RS: Writing – review & editing, Writing – original draft. CS: Writing – original draft, Writing – review & editing. ES: Writing – original draft, Writing – review & editing. LS: Writing – original draft, Writing – review & editing. DW: Writing – review & editing, Writing – original draft. MW: Writing – review & editing, Writing – original draft. KW: Writing – original draft, Writing – review & editing. JZ: Writing – original draft, Writing – review & editing.

Funding

The author(s) declared that financial support was received for this work and/or its publication. This research was funded by the U.S. Geological Survey (USGS) Water Availability and Use Science Program as part of the Water Resources Mission Area Data-Driven Drought Prediction Project. Computing resources were provided by USGS Cloud Hosting Solutions.

Conflict of interest

The author(s) declared that this work was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Generative AI statement

The author(s) declared that generative AI was not used in the creation of this manuscript.

Any alternative text (alt text) provided alongside figures in this article has been generated by Frontiers with the support of artificial intelligence and reasonable efforts have been made to ensure accuracy, including review by the authors wherever possible. If you identify any issues, please contact us.

Publisher's note

All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article, or claim that may be made by its manufacturer, is not guaranteed or endorsed by the publisher.

Supplementary material

The Supplementary Material for this article can be found online at: https://www.frontiersin.org/articles/10.3389/frwa.2025.1709138/full#supplementary-material

References

Abatzoglou, J. T. (2013). Development of gridded surface meteorological data for ecological applications and modelling. Int. J. Climatol. 33, 121–131. doi: 10.1002/joc.3413