Accuracy of US CDC COVID-19 forecasting models

Accurate predictive modeling of pandemics is essential for optimally distributing biomedical resources and setting policy. Dozens of case prediction models have been proposed but their accuracy over time and by model type remains unclear. In this study, we systematically analyze all US CDC COVID-19 forecasting models, by first categorizing them and then calculating their mean absolute percent error, both wave-wise and on the complete timeline. We compare their estimates to government-reported case numbers, one another, as well as two baseline models wherein case counts remain static or follow a simple linear trend. The comparison reveals that around two-thirds of models fail to outperform a simple static case baseline and one-third fail to outperform a simple linear trend forecast. A wave-by-wave comparison of models revealed that no overall modeling approach was superior to others, including ensemble models and errors in modeling have increased over time during the pandemic. This study raises concerns about hosting these models on official public platforms of health organizations including the US CDC which risks giving them an official imprimatur and when utilized to formulate policy. By offering a universal evaluation method for pandemic forecasting models, we expect this study to serve as the starting point for the development of more accurate models.


Introduction
The COVID-19 pandemic (1) resulted in at least 100 million confirmed cases and more than 1 million deaths in the United States alone.Worldwide, cases exceed 650 million, with at least 6.5 million deaths (2).The pandemic has affected every country and presented a major threat to global health.This has caused a critical need to study the transmission of emerging infectious diseases in order to make accurate case forecasts, especially during disease outbreaks.about the degree of precautions necessary for a given region at a particular time, which regions to avoid travel to, and the degree of risk in various activities like public gatherings.Likewise, models can be used to proactively prepare for severe surges in cases by allocating biomedical resources such as oxygen or personnel.Collecting and presenting these models gives public health officials, and organizations such as the UNITED STATES CENTERS FOR DISEASE CONTROL AND PREVENTION (US CDC) (3), a mechanism to disseminate these predictions to the public, but risks giving them an official imprimatur, suggesting that these models were either developed or endorsed by a government agency.

. Motivation
Since the start of the pandemic, dozens of case prediction models for the US have been designed using a variety of methods.Each of these models depends on data available about cases, derived from a heterogeneous system of reporting, which can vary by county and suffer from regional and temporal delays.For example, some counties may collect data over several days and make it public at once, which creates an illusion of a sudden burst of cases.In counties with less robust testing programs, the lack of data can limit modeling accuracy.These methods are not uniform or standardized between groups that perform data collection, resulting in unpredictable errors.Underlying biases in the data, such as under-reporting, can produce predictable errors in model quality, requiring models to be adjusted to predict future erroneous reporting rather than actual case numbers.Such underreporting has been identified by serology data (4,5).Moreover, there is no universally agreed upon system for assessing and comparing the accuracy of case prediction models.Often published models use different methods, which makes direct comparisons difficult.The CDC has taken in data of case prediction models in a standardized way which makes direct comparisons possible (3).In this study, we use Mean Absolute Percent Error (MAPE) compared to the true case numbers to compare models for normalization purposes.First, we consider which models have the most accurately predicted case counts with the least MAPE.Next, we divide these models into five broad sub-types based on approach, i.e., epidemiological (or compartment) models, machine learning approaches, ensemble approaches, hybrid approaches, and other approaches, and compare the overall error of models using these approaches.We also consider which exclusion criteria might produce ensemble models with the greatest accuracy and predictive power.

. Contributions
A few studies (6, 7) have compared COVID-19 case forecasting models.However, the present study is unique in several aspects.First, it is focused on prediction models of US cases and takes into consideration all CDC models that pass the set inclusion criterion.Second, since these models were uploaded in a standardized format they can be compared across several dimensions such as R 0 , peak timing error, percent error, and model architecture.Third, we seek to answer several unaddressed questions relevant to pandemic case modeling: • First, can we establish a metric to uniformly evaluate pandemic forecasting models?• Second, What are the top-performing models during the four COVID-19 waves in the US and how do these fare on the complete timeline?• Third, are there categories or classes of case prediction models that perform significantly better than others?• Fourth, how do model predictions fare with increased forecast horizons?• Lastly, how do models compare to two simple baselines?

Methodology
This work compares various US-CDC COVID-19 forecasting models by their quantitative aspects evaluating their performance in strictly numerical terms over various time segments.The US-CDC collects weekly forecasts for COVID cases in four different horizons: 1-, 2-, 3-, and 4-weeks, i.e., each week, the models make a forecast for new COVID cases in each of the four consecutive weeks from the date of the forecast.The forecast horizon is defined as the length of time into the future for which forecasts are to be prepared.In the present study, we focus on evaluating the performance of models for their 4-week ahead forecasts.

. Datasets
The data for the confirmed case counts are taken from the COVID-19 Data Repository (https://github.com/CSSEGISandData/COVID-19/tree/master/csse_covid_19_data/ csse_covid_19_daily_reports_us), maintained by the Center for Systems Science and Engineering at the Johns Hopkins University.The data for the predicted case counts, of all the models, is obtained from the data repository for the COVID-19 Forecast Hub (https:// github.com/reichlab/covid19-forecast-hub),which is also the data source for the official US-CDC COVID-19 forecasting page.Both these datasets were pre-processed to remove models that have made predictions for <25% of the target dates covered by the respective time segment.For plotting Figure 1A, the PYPLOT module from the Matplotlib library in Python was used.

. Categorizing models
The models were categorized into five different categories-Epidemiological/ Compartmental (see Table 1), Machine Learning (see Table 2), Ensemble (see Table 3), Hybrid (see Table 4), and Others (see Table 5).This was based on keywords found in model names and going to each model description on their respective web pages and articles.The models which did not broadly fall into these categories were kept in "others".These models use very different methods to arrive at predictions.We comprehensively analyzed 51 models.The CDC also uses an .%, Wave : .%, Wave : .%, Wave : .%, Wave : .%).
ensemble model, and we looked at whether this was better than any individual model.For each model uploaded to the CDC website, MAPE was calculated and reported in this study, and the models were compared wave-wise as shown in various figures.
For each model, the model type was noted, as well as the month proposed.

. Wave definition
A thorough search reveals M.D., epidemiologists, and policymakers do share the same underlying principles of the term "wave".Popular media explain that "the word 'wave' implies a natural pattern of peaks and valleys".WHO stated in order to say one wave is ended, the virus has to be brought under control and cases have to fall substantially, then for a second wave to start, you need a sustained rise in infections.As part of their National Forecasts for COVID cases, the CDC has reported the results from a total of 54 different models at various instances of time during the pandemic.We define the waves, i.e., Wave-I: July, 6th 2020 to August 31st, 2020, Wave-II: September, 1st 2020 to February, 14th 2021, Wave-III: February, 15th 2021 to July, 26th 2021 and Wave-IV: July, 27th 2021 to January, 17th 2022, corresponding to each of the major waves in the US.

. Baselines
The performance of the models was evaluated against two simple baselines.Baseline-I is the "CovidHub-Baseline" (or CDC's baseline), i.e., the median prediction at all future horizons is the most recent observed incidence (i.e., the most recent day).Baseline-II is the linear predictor extrapolation using the slope of change in reported active cases between the 2 weeks preceding the date of the forecast.These baselines are included in the bar charts (shown in Figures 1B, 2).Within each of the waves of interest, only models that made a significant number of predictions are considered for comparison.We only consider models that have made predictions for at least 25% of the target dates covered by the respective time segment for all comparisons in this section.The MAPE was calculated on the 4-week forecast horizon.Figure 1B illustrates the performance of models across all the waves.Same procedure, as in Figure 2, was followed for plotting Figure 1B.

. Model comparison
The performance of all the models is compared, wave-wise and on the complete timeline based on the MAPE (or mean absolute percent error).MAPE is defined as the ratio of absolute percentage

CovidAnalytics-DELPHI (16)
Introduces new states to accommodate for unnoticed cases, as well as an explicit death state.A non-linear curve reflecting government reaction is used to adjust the infection rate.Also, a meta-analysis of 150 factors is used to determine key illness parameters, while epidemiological parameters are fitted to historical death counts & identified cases.The error refers to the difference between the confirmed case counts and the predicted case counts and is calculated as shown in Equation (1) below:

MIT
Here, n = number of data points, A t is the actual value and P t denoted the predicted value.
Figure 1C shows the on-parametric Kruskall-Wallis test results on the category-wise errors, achieved by the models overall as well as wave-wise.The mean of the MAPE values is calculated for each category in model type, for overall and each wave separately.The mean is calculated by adding the MAPE values of all the models in a category and dividing it by the number of models in that category for the corresponding wave.hen, a non-parametric Kruskall-Wallis test is performed to determine if there is a significant difference between the two groups.We used scipy in Python to perform the test.For cases where significant differences were found (for Wave-II), a Mann-Whitney test was further performed.We did not perform the popular ANOVA test since ANOVA conditions of normality (we used a Shapiro-Wilk test for normality on the residuals) and homogeneity of Variances (we used Levene's test for equality of variances) were not met for all waves.

Statistical analysis
The statistical significance of all figures was determined by the non-parametric Kruskall-Wallis test which is followed by a Mann-Whitney test for groups that had a significant difference.Bar plots having a p-value <0.05 (statistically significant), are joined using an asterisk.In Figure 3, box plots are made representing the MAPE over all the predictions of a certain model for the corresponding forecast horizon.A box plot displays the distribution of data based on a five-number summary ["minimum", first quartile (Q1), median, third quartile (Q3), and "maximum"].We used the boxplot function (of seaborn library) in Python to plot it.Seaborn is a Python data visualization library based on Matplotlib.

Results
When case counts predicted by the various US CDC COVID-19 case prediction models are overlaid real-world data, visually several features stand out as illustrated in Figure 1A.On aggregate, models tend to approach the correct peak during various waves of the pandemic.However, some models undershot, others overshot, and many lagged the leading edge of real-world data by several weeks.

. Complete timeline analysis
The MAPE values of all US CDC models were analyzed over the complete timeline and compared to two "Baselines", which represented either an assumption that case counts would remain the same as the previous week (Baseline-I) or a simple linear model following the previous week's case counts (Baseline-II) (see Figure 1B).Here, "IQVIA_ACOE-STAN, " "USACE-ERDC_SEIR, " "MSRA-DeepST, " and "USC-SI_kJalpha_RF" achieve the best performance with low MAPE ranging from 5 to 35%.In a comparison of overall performance, ensemble models performed significantly better than all other model types.However, the performance of ensemble models was not "significantly" better than the baseline models (no change or simple linear model) which performed better than both machine learning and epidemiological models overall (see Figure 1C).The peak predictions of all US-CDC models were plotted on the complete timeline and included in the study (see Figures 5-8).

. Wave-wise analysis of best performing models
The MAPE values of all US CDC models were also analyzed wave-wise (see Figure 2).During the first wave of the pandemic, "Columbia_UNC-SurvCon" achieved the lowest MAPE = 14%, closely followed by "USACE-ERDC_SEIR" (MAPE = 17%) and "CovidAnalytics-DELPHI" (MAPE = 25%).Here, only 4 models performed better than both baselines.Three of these were epidemiological models while one was a hybrid model.From the bar plot of MAPE values of models categorized based on model type (refer Figure 1C), it can be inferred that during the first wave, hybrid models performed the best and attained the lowest MAPE.This was followed by the epidemiological models and those based on machine learning.On the other hand, ensemble models had the largest MAPE during this wave and none of them surpassed the Baseline-I MAPE, i.e., 31%.
During the second wave, "IQVIA_ACOE-STAN" performed the best with a MAPE score of 5% (see Figure 2).In this wave, a total of 13 models performed better than both baselines, with MAPE ranging from 5 to 37.These included five ensemble models, four epidemiological models, two machine learning models, and two hybrid models.All ensemble models exceed Baseline-I performance (that had MAPE = 37%), with the exception of "UVA-Ensemble".The epidemiological models showed a staggered MAPE distribution.Followed by the hybrid and the models categorized as "other" model sub-types, these have the lowest average MAPE in wave-II.In contrast to wave-I, ensemble models provide the best forecasts in wave-II (see Figure 1C).Here, hybrid models are the worst-performing models.
During the third wave (see Figure 1C), ensemble models performed similarly to wave-I.Baseline models had a relatively elevated high MAPE with Baseline-I and II MAPE scores being 74 and 77%, respectively.In wave-III, "USC-SI_kJalpha" is the bestperformed model with MAPE= 32% (see Figure 2).Here, 32 models performed better than both baseline models.These included 12 LNQ-ens1 ( 26) Uses an ensemble of three models; two fit with LightGBM, and the third is a neural net.Ensemble weights are chosen each week manually based on performance in the previous week.

LockNQuay
Ensemble of three models.
Intervention effects are reflected in observable data and will continue in the future.

COVIDhub-baseline (27)
Baseline model for predictions.The most recent observed incidence is the median projection for all future horizons.From 1 week to next, the slope of the predicted medians for cumulative will be constant and equal to previously observed slope.
The model looks at week-to-week incidence variations to generate a median distribution.

− −
COVIDhub-trained_ensemble A weighted ensemble combination of all component model forecasts.

COVID-19 Forecast Hub
Ensemble − UVA-Ensemble (29) Combines models using Bayesian model averaging.Auto-regressive method with features including mobility, other county case counts time-series, an LSTM model with mobility data as an additional predictor, and PatchSim, an SEIR variant with interaction between counties modeled using commuter data and calibrated on new confirmed cases.

MIT Cassandra Ensemble of four models
Assumes that current interventions will remain in place indefinitely.

FDANIHASU-Sweight
Ensemble of submitted forecasts to COVID-19 Forecast Hub.The ensembles are formed by weighting the individual model forecasts with their past performances FDANIHASU Ensemble − compartment models, three machine learning models, four hybrid models, eight ensemble models, and five un-categorized models.
In the fourth wave of the pandemic, several models performed similarly between a MAPE of 28% and a baseline of 47% (Figure 2).Ensemble models performed the best whereas epidemiological models had the highest MAPE during this wave.Baseline-I and II MAPE scores were 47 and 48%, respectively.In wave-IV, "LANL-GrowthRate" is the best performed model with MAPE = 28% (Figure 2).In the fourth wave, 17 models performed better than both baseline models.These included six compartment models, seven ensemble models, one machine learning models, two hybrid models, and one uncategorized models.

. E ect of increase in forecast horizon
The MAPE of models in the US CDC database increased each week out from the time of prediction.Figure 3 depicts a strictly increasing rise in MAPE with the increase in the forecast horizon.
In other words, the accuracy of predictions declined the further out they were made.At 1 week from the time of prediction, the MAPE of models examined clustered just below 25% MAPE and declined to about 50% MAPE by 4 weeks.
The MAPE in each week was relatively bi-modal, with several models fitting within a roughly normal distribution and others having a higher MAPE.The distribution of the points indicates that the majority of models project similar predictions for smaller forecast horizons, while the predictions for larger horizons are more spread out.The utility of model interpretation would improve by excluding those models that fall more than one standard deviation (σ ) from the average MAPE of models.
Though we examined 54 models overall (reported 51 in the full timeline), many of these did not make predictions during the first wave of the pandemic when data was less available and therefore do not appear in the wave-wise MAPE calculation and subsequent analysis.The number of weeks for which each model provides predictions was highly variable, as seen in Figure 4.

MOBS-GLEAM_COVID (35)
A metapopulation method is used.The world is divided into geographical subpopulations, and human mobility between subpopulations is depicted on a network.This data layer on mobility identifies the number of persons traveling from between sub-populations.The mobility network is made up of many mobility processes, ranging from short-distance commuting to intercontinental travel.Superimposed on the globe population and mobility layers is an agent-based epidemic model that describes the infection and population dynamics.

MOBS Lab at Northeastern
Metapopulation age-structured SLIR Social distancing policies in place at the date of calibration are extended for the future weeks.

DDS-NBDS (36)
Jointly modeling daily deaths and cases using a negative binomial distribution based non-parametric Bayesian generalized linear dynamical system (NBDS).
Team DDS Bayesian hierarchical model Intervention effects are reflected in observable data will continue in future.

Discussion
Accurate modeling is critical in pandemics for a variety of reasons.Policy decisions need to be made by political entities that must follow procedures, sometimes requiring weeks for a proposed policy intervention to become law and still longer to be implemented.Likewise, public health entities such as hospitals, nurseries, and health centers need "lead time" to distribute resources such as staffing, beds, ventilators, and oxygen supplies.However, modeling is limited, especially by the availability of data, particularly in early outbreaks (46).Resources such as ventilators are often distributed heterogeneously (47), leading to a risk of unnecessary mortality.Similarly, the complete homogeneous distribution of resources like masks is generally sub-optimal and may also result in deaths (48).Likewise, it is important to develop means of assessing which modeling tools are most effective and trustworthy.For example, a case forecasting model that consistently makes predictions that fare worse than assuming that case counts will remain unchanged or that they will follow a simple linear model isn't likely to be useful in situations where modeling is critical.The use of these baselines allows for the exclusion of models that "fail" to predict case counts adequately.
Direct comparison of error based on the difference from realworld data potentially excludes important dimensions of model accuracy.For example, a model that accurately predicts the time course of disease cases (but underestimates cases by 20% at any given point) might have greater utility for making predictions about when precautions are necessary when compared to a model that predicts case numbers with only a 5% error but estimates peak cases 2 weeks late.To assess the degree of timing error, the peak of each model was compared to the true peak of cases that occurred within the model time window.MAPE i.e., the ratio of the error between the true case count and a model's prediction to the true case count, is a straightforward way of representing the quality of predictive models and comparing between model types.MAPE has several advantages over other metrics of measuring forecast error.First, since MAPE deals with negative residuals by taking an absolute value rather than a squared one, reported errors are proportional.Second, since it's a percent error, it is also conceptually straightforward to understand.Using absolute error as a measure might be misleading at times.For example, a model that predicts 201, 000 cases when 200, 000 cases occur has good accuracy but the same absolute error as a model that predicts 1, 100 cases when 100 occur.MAPE accounts for this by normalizing by the number of data points.
The average MAPE of successful models which we define as lower MAPE than either baseline model varied between methods depending on the wave they were measured in.In the first wave, epidemiological models had a mean MAPE of 31%, and machine learning models had a mean MAPE of 32%.In the second wave, these were 45 and 44%, respectively.In the third wave, these were 63 and 63%, and in the fourth wave, these were 92 and 44%, respectively.Therefore, we notice that the mean MAPE of models got worse with each wave.This is because each model type is susceptible to changing real-world conditions, such as the emergence of new variants with the potential to escape prior immunity, or a higher R 0 , new masking or lockdown mandates, the spread of conspiracy theories, or the development of vaccines which will decrease the number of individuals susceptible to infection.
In waves-2 and 3, the best performing model was a hybrid and machine learning model, respectively.In waves, 1 and 4, the best model was an epidemiological and an "other" model, .

CEID-Walk (44)
The model is based on a random walk with no drift.The variance in step size of random walk is estimated using the last few observations of a target time series.respectively (see Figure 2).In each wave, some examples of each model type were successful, and some were unsuccessful.After wave 1, ensemble and ML models on average had the lowest MAPE but no model category significantly outperformed baseline models (Figure 1C).These results suggest that no overall modeling technique is inherently superior for predicting future case counts.Compartmental (or epidemiological) models broadly use several "compartments" which individuals can move between such "susceptible, " "infectious, " or "recovered, " and use real-world data to arrive at estimates for the transition rate between these compartments.However, the accuracy of a compartment model depends heavily on accurate estimates of the R 0 in a population, a variable that changes over time, especially as new virus variants emerge.For example, the emergence of the Iota variant of SARS-CoV-2 (also known as lineage B.1.526)resulted in an unpredicted increase in case counts (49).Machine learning models train algorithms that would be difficult to develop by conventional means.These models "train" on real-world data sets and then make predictions based on that past data.Machine learning models are sensitive to the datasets they are trained on, and small or incomplete datasets produce unexpected results.Hybrid models make use of both compartment modeling and machine learning tools.On the other hand, ensemble models combine the results of multiple other models hoping that whatever errors exist in the other models will "average out" of the combined model.Ensemble models potentially offer an advantage over individual models in that by averaging the predictions of multiple models, flawed assumptions or errors in individual models may "average out" and result in a more accurate model.However, if multiple models share flawed assumptions or data, then averaging these models may simply compound these errors.An individual model may achieve a lower error than multiple flawed models in these cases.

University of
The "baseline" models had a MAPE of 48 and 54% over the entire course of the pandemic in the US.Most models did not perform better than these.Among those that did, we discuss the five with the best performance (in increasing order of their performance).First, "QJHong-Encounter" is a model by Qijung Hong, an Assistant professor at Arizona State University.This model uses an estimate of encounter density (how many potentially infectious encounters people are likely to have in a day) to predict changes in estimated R (reproduction number), and then uses that to estimate future daily new cases.The model uses machine learning.It had a MAPE of 38% over all waves.Second, "USC-SI_kJalpha_RF" is a hybrid model from the University of Southern California Data Science Lab.This model also uses a kind of hybrid approach, where additional parameters are modeled regionally for how different regions have reduced encounters and machine learning is used to estimate parameters (23).It had a MAPE of 35% over all waves.Third, "MSRA-DeepST" is a SIR hybrid model from Microsoft Research Lab-Asia that combines elements of SEIR models and machine learning.It had a MAPE of 34%.Next, "USACE-ERDC_SEIR" is a compartment model developed by the US Army Engineer Research and Development Center COVID-19 Modeling and Analysis Team.It adds to the classic SEIR model,

FIGURE
Predictions are most accurate closest to the time of prediction.The MAPE in predictions of all models for di erent forecast horizons is shown.The dots in each box plot represent the MAPE over all the predictions of a certain model for the corresponding forecast horizon.The y-axis is the MAE between the predicted case count and the reported case count.The x-axis is the forecast horizon.
adding additional compartments for unreported infections and isolated individuals.It used Bayesian estimates of prior probability based on subject matter experts to select initial parameters.It had a MAPE of 31%.Lastly, "IQVIA_ACOE_STAN" is a machine learning model from IQVIA-Analytics Center of Excellence and has the highest apparent performance on the overall timeframe.The calculated MAPE for this model was 5%.Notably, the MAPE of 5% is much lower than the MAPE of 31% of the next closest model.
Although MAPE is superior to other methods of comparing models, there are still some challenges.For example, "IQVIA_ACOE_STAN" appears to be the lowest model by far, but the only data available covers a relatively short time frame, and unfortunately discontinued its contribution to the CDC website after Wave 2. The time frame that it predicted also does not include any changes in case direction from upswings and downswings.This advantages the model compared to other models, which might cover time periods where case numbers peak or new variants emerge.This highlights a potential pitfall of examining this data: the models are not studying a uniform time window.
Data reporting is one of the sources of error affecting model accuracy.There is significant heterogeneity in the reporting of COVID-19 cases by state.This is caused by varying state laws, resources made available for testing, the degree of sequencing being done in each state, and other factors.Additionally, different states have heterogeneity in vaccination rate, population density, implementation of masking and lockdown, and other measures that may affect case count predictions.Therefore, the assumptions underlying different models and the degree to which this heterogeneity is taken into account may result in models having heterogeneous predictive power in different states.Although this same thinking could be extended to the county level, the case count reporting in each county is even more variable and makes comparisons difficult.
To make the comparison between models more even, we used multiple times segments to represent the various waves during the pandemic.Models that have made fewer predictions, particularly avoiding the "regions of interest" such as a fresh wave or a peak, would only be subjected to a less challenging evaluation than the models that covered most of the timeline.Further, the utility of models with only a few predictions within "regions of interest" is also questionable.Considering these aspects, we define four time segments corresponding to each of the major waves in the US.Within each of these waves of interest, we consider only models that made a significant number of predictions for the purpose of comparison.Such a compartmentalized comparison is now straightforward, as all models within a time segment can now have a common evaluation metric.
We focus on evaluating the model performance for their 4-week ahead forecasts.A larger forecast horizon provides a higher realworld utility in terms of policy-making or taking precautionary steps.We believe this to be a more accurate representation of a model's predictive abilities as opposed to smaller windows in which desirable results could be achieved by simply extrapolating the present trend.Therefore, due to the time delays associated with policy decisions and the movement of critical resources and people, the long-term accuracy of models is of critical importance.To take an extreme example, a forecasting model that only predicted 1 day in advance would have less utility than one that predicted ten days in advance.
The failure of roughly one-third of models in the CDC database to produce results superior to a simple linear model should raise concerns about hosting these models in a public venue.Without strict exclusion criteria, the public may not be aware that the are significant differences in the overall quality of these models.Each model type is subject to inherent weaknesses of the available data.The accuracy of compartment models is heavily dependent on the quality and quantity of reported data and also depends on a

FIGURE
Plot depicting frequency of week ahead predictions made by models.Here, models were ordered alphabetically on the y-axis.The x-axis represents the target dates for which the predictions were made.Dates range from July to Jan .
variable that might change with the emergence of new variants.Heterogeneous reporting of case counts, variable accuracy between states, and variable early access to testing resulted in limited data sets.Likewise, it seems that since training sets did not exist, machine learning models were unable to predict the Delta variant surge.Robust evidence-based exclusion criteria and performance-based weighting have the potential to improve the overall utility of future model aggregates and ensemble models.
Because the US CDC has a primary mission focused on the United States, the models included are focused on United States case counts.However, globally the assumptions necessary to produce an accurate model might differ due to differences in population density, vaccine availability, and even cultural beliefs about health.However, identifying the modeling approaches that work best in the United States provides a strong starting point for global modeling.Some of the differences in modeling will be FIGURE Plots depicting the peak predictions of US-CDC models-"BPagano-RtDriven," "CEID-Walk," "Covid Sim-Simulator," "CovidAnalytics-DELPHI," "Columbia_UNC-SurvCon," "COVIDhub-baseline," "COVIDhub-ensemble," "COVIDhub-_week_ensemble," "COVIDhub_CDC-ensemble," "CU-nochange," "COVIDhub-trained_ensemble," over complete timeline.
accounted for by different input data, which can be customized by country or different training sets in the case of machine learning models.
The ultimate measure of forecasting model quality is whether the model makes a prediction that is used fruitfully to make a real-world decision.Staffing decisions for hospitals can require a lead time of 2-4 weeks to prevent over-reliance on temporary workers, or shortages (50).Oxygen has become a scarce resource during the COVID-19 pandemic and also needs lead time (51).This has had real-world policy consequences as public officials have ordered oxygen imports well after they were needed to prevent shortages (52).Indeed for sufficient time to be available for public officials to enact new policies and for resources to be moved, a time frame of 8 weeks is preferable.The need for accurate predictions weeks in advance is confounded by the declining accuracy of models multiple weeks in advance, especially considering the rise time of new variant waves (Figure 3).During the most recent wave of infections, news media reported on the potential for the "Omicron" variant of SARS-CoV-2 to rapidly spread in November of 2021, however in the United States, an exponential rise was not apparent in case counts until December 14th, when daily new case counts approximated 100 k new cases per day, and by January 14th, 2022, new cases exceed 850, 000 new cases per day.The time when accurate predictive models are most useful is ahead of rapid rises in cases, something none of the models examined were able to predict, given the rise in SARS-CoV-2 cases that occurred during the pandemic.

Conclusion
Although forecasting models have gained immense attention recently, many challenges are faced in developing these to serve the needs of governments and organizations.We found that most models are not better than CDC baselines.The benchmark for resource allocation ahead of a wave still remains the identification and interpretation of new variants and strains by biologists.We propose a reassessment of the role of forecasting models in pandemic modeling.These models as currently
implemented can be used to predict the peak and decline of waves that have already been initiated and can provide value to decision-makers looking to allocate resources during an outbreak.Prediction of outbreaks beforehand however still requires "hands-on" identification of cases, sequencing, and data gathering.The lack of clean, structured, and accurate datasets also affects the performance of the case prediction models, making the estimation of patient mortality count and transmission rate significantly harder.More robust sources of data on true case numbers, variants, and immunity FIGURE depicting the peak predictions of various US-CDC models-"MUNI-ARIMA," "MUNI-VAR," "MSRA-DeepST," "OneQuietNight-ML," "prolix-euclidean," "OliverWyman-Navigator," "RobertWalraven-ESG," "SDSC_ISG-TrendModel," "QJHong-Encounter," "TTU-squider," "UCF-AEM," "SigSci-TS," "UCLA-SuEIR," "UMich-RidgeTfReg" and "UChicagoCHATTOPADHYAY-UnIT" over complete timeline.
would be useful to create more accurate models that the public and policymakers can use to make decisions.Future work can focus on framing a novel evaluation metric for measuring the time lag and direction of prediction in the short term.
FIGURE (A) Visual overlay of real case counts and predicted case counts across all waves examined.Actual case counts are shown in red, predicted counts are shown in gray with each trace representing a di erent CDC COVID-forecasting model on the sqrt plot.(B) MAPE values of US-CDC case prediction models on the complete timeline, i.e., Wave-I to IV.The y-axis is sorted descending from lowest error to highest.The color scheme represents the model category.(C) Bar graph showing the non-parametric Kruskall-Wallis test results.A Mann-Whitney test was further performed for groups with significant di erences.Category-wise error was achieved by the models both overall and wave-wise from wave-to wave-.Note that Hybrid models have a high MAPE, i.e., overall:.%, Wave : .%, Wave : .%, Wave : .%, Wave : .%).
pre-symptomatic incubation phase, employing a time-varying effective R0 to capture the temporal trend of transmission & change in response to a public health intervention.Uses permutation to quantify uncertainty.scenarios, each assuming various interventions & rates of compliance are implemented in the future.(i) Presents the weekly scenario believed to be most plausible given current observations & planned intervention policies.(ii) Current contact rates will remain unchanged in the future.Assumes relatively (iii) low transmission, (iv) moderate transmission, & (vpredictions normalized by the number of data points. is represented in observed data in 2 of 3 models, while the third assumes that interventions will change in the future.Caltech CS156 Ensemble of 14 ML models: (a) Feedforward Neural Network, (b) Quantile Neural Network, (c) LSTM, (d) Conditional LSTM, (e) Encoder-Decoder Conditional LSTM, (f) Autoregressive, (g) Sessional Autoregressive, (h) Decision Tree, (i) Gradient-Boosted Decision Tree, (j) K-NN, (k) Gaussian Process, (l) Bayesian epidemiological, (m) Two-group epidemiological, (n) Curve-fitting.Caltech Ensemble of 14 models − MIT-Cassandra (30) Based on the ensemble of predictions from four models, including (1) MDP feature representation, (2) KNN time-series, (3) Bi-LSTM time-series, (4) C-SEIRD epidemiological.
mechanistic, non-parametric forecasting model that forecasts time series as a sum of bell curves.The confidence intervals are calculated by applying a multiplicative log-Gaussian perturbation to the observed time series.

FIGURE
FIGURE MAPE values of US CDC Forecasting models in wave-I to IV. Models are sorted in descending order of MAPE.The color scheme represents the model category.Here "Baselines" are represented in red.
TABLE Machine learning US-CDC COVID-forecasting models-their description, features employed, methodology, and assumptions they make regarding public health interventions.Reproductive Number (R) and Encounter Density (D) relation in the past as a training set, (2) future D as input, and (3) ML/regression, the model predicts future R, and ultimately future daily new cases.forward RNN is used.The Seq2Seq algorithm trains the model to convert sequences from input to those in the output.The model inputs daily smoothed incident cases, deaths count, google mobility index, daily reproduction number, county demographic and health risk indices to model the baseline risk score.
TABLE Ensemble learning US-CDC COVID-forecasting models-their description, features employed, methodology, and assumptions they make regarding public health interventions.
TABLE Hybrid US-CDC COVID-forecasting models-their description, features employed, methodology, and assumptions they make regarding public health interventions.
TABLE Other US-CDC COVID-forecasting models-their description, features employed, methodology and assumptions they make regarding public health interventions.Presents two processes: first statistical model depicts how the number of COVID-19 infections varies over time while the second model correlates the number of infections with the reported data.The underlying numbers of susceptible and infected cases in the population at the preceding time step, scaled by the size of the state's initial susceptible population, are used to map the rise of new instances.These two components' weights are dynamically adjusted.Built on observations of macro-level societal and political responses to COVID-19 characterized only in terms of infections and deaths.Assumes that the actual net impact of policy actions undertaken, though not identical across time or geography, is predictable.Identifies acceptability ranges from observation data up to current time.