Estimation of Return Levels for Extreme Skew Surge Coastal Flooding Events in the Delaware and Chesapeake Bays for 1980–2019

Extreme storm surges can overwhelm many coastal flooding protection measures in place and cause severe damages to private communities, public infrastructure, and natural ecosystems. In the US Mid-Atlantic, a highly developed and commercially active region, coastal flooding is one of the most significant natural hazards and a year-round threat from both tropical and extra-tropical cyclones. Mean sea levels and high-tide flood frequency has increased significantly in recent years, and major storms are projected to increase into the foreseeable future. We estimate extreme surges using hourly water level data and harmonic analysis for 1980–2019 at 12 NOAA tide gauges in and around the Delaware and Chesapeake Bays. Return levels (RLs) are computed for 1.1, 3, 5, 10, 25, 50, and 100-year return periods using stationary extreme value analysis on detrended skew surges. Two traditional approaches are investigated, Block Maxima fit to General Extreme Value distribution and Points-Over-Threshold fit to Generalized Pareto distribution, although with two important enhancements. First, the GEV r-largest order statistics distribution is used; a modified version of the GEV distribution that allows for multiple maximum values per year. Second, a systematic procedure is used to select the optimum value for r (for the BM/GEVr approach) and the threshold (for the POT/GP approach) at each tide gauge separately. RLs have similar magnitudes and spatial patterns from both methods, with BM/GEVr resulting in generally larger 100-year and smaller 1.1-year RLs. Maximum values are found at the Lewes (Delaware Bay) and Sewells Point (Chesapeake Bay) tide gauges, both located in the southwest region of their respective bays. Minimum values are found toward the central bay regions. In the Delaware Bay, the POT/GP approach is consistent and results in narrower uncertainty bands whereas the results are mixed for the Chesapeake. Results from this study aim to increase reliability of projections of extreme water levels due to extreme storms and ultimately help in long-term planning of mitigation and implementation of adaptation measures.


INTRODUCTION
Coastal flooding poses the greatest threat to human life and is often the source of much of the damage resulting from the storm surge and waves of coastal weather systems (Blake and Gibney, 2011;Rappaport, 2014;Chippy and Jawahar, 2018;Weinkle et al., 2018). Relative sea-level rise (SLR) rates and high-tide flooding frequency and magnitude along the US East Coast have increased in recent decades and are expected to continue increasing into the near future (Sweet et al., 2017a(Sweet et al., , 2018Oppenheimer et al., 2019) with recent studies estimating mean sea levels are rising faster than predicted (Grinsted and Christensen, 2021). The US Mid-Atlantic coast is noted for especially high SLR rates (Sallenger et al., 2012;Kopp, 2013;Boon et al., 2018;Piecuch et al., 2018) and states and counties in this region view coastal flooding as one of their most severe and pervasive natural hazards to prepare for (Callahan et al., 2017;Boesch et al., 2018;Dupigny-Giroux et al., 2018). Increases in sea levels lead directly to higher frequencies of coastal flooding from high tides as well as minor and major coastal storms (Lin et al., 2016;Dahl et al., 2017;Garner et al., 2017;Rahmstorf, 2017;Sweet et al., 2017b;Muis et al., 2020;Taherkhani et al., 2020).
Many of the largest coastal flooding events along the US Mid-Atlantic coast are caused by tropical cyclones (TCs), most notably Hurricanes Isabel in 2003 and Sandy in 2012. For both the US Atlantic and Gulf Coasts, tropical cyclones are the costliest and most damaging weather and climate events (Smith, 2021). Under current global warming scenarios, atmospheric water vapor content and sea-surface temperatures (SSTs) in the North Atlantic Ocean are projected to increase, leading to an increase in the number of severe tropical cyclones, decreases in the forward translational speed, increases in wind speed, and increases in the rate of intensification, especially near the coasts (Kossin et al., 2017;Kossin, 2018;Knutson et al., 2019Knutson et al., , 2020Murakami et al., 2020;Yang et al., 2020;Wang and Toumi, 2021).
Although TCs may gather much of the attention, the threat of major coastal flooding in the region is year-round (Dupigny-Giroux et al., 2018). TCs can account for 40-60% of the top 10 flood events with higher relative percentages in the southern part of the region (Booth et al., 2016;Callahan et al., 2021b), however, the great majority of all coastal flood events, ∼85-90%, in the Mid-Atlantic come from non-tropical weather systems (Callahan et al., 2021b). East Coast winter storms, surface high pressure systems, extratropical cyclones (ETCs), and frontal systems regularly impact the region throughout the year (Hirsch et al., 2001;Thompson et al., 2013). ETCs in the Mid-Atlantic in the winter and spring are often dictated by the relative position of troughs in the westerly polar jet stream, directing low-pressure systems to travel northeastward up the coast over warmer waters, often intensifying into strong nor-Easter storms. The intensity and winds of ETCs, as well as associated beach erosion and other damages due to coastal flooding, are also projected to increase due to climate change, however projections of ETC storm tracks and landfalling TCs due to changing synoptic atmospheric patterns (i.e., "storminess") in the Mid-Atlantic is inconclusive (Hall et al., 2016;Mawdsley and Haigh, 2016;Dupigny-Giroux et al., 2018). Studies have found that US East Coast sea levels vary with synoptic oscillations (Colle et al., 2015;Wahl and Chambers, 2015;Sweet et al., 2020), leading, Rashid Md et al. (2019 to conclude that interannual and multi-decadal variability of extreme storm surge in the Mid-Atlantic was in a transition zone between more clear relationships found in the Northeast and Southeast portions of the US Atlantic Coast. Water levels in the Delaware and Chesapeake Bays, two of the largest estuaries in the US located in the Mid-Atlantic, have been well monitored for several decades by high-quality tidegauge networks, well-suited for climate studies (Holgate et al., 2013;Sweet et al., 2017a;NOAA NWLON, 2020;NOAA PORTS, 2020). This highly developed, economically critical region includes many commercial industries, vast amounts of public and private infrastructure, and provides important ecosystem services (Sanchez et al., 2012; Partnership for the Delaware Estuary (PDE), 2017; Chesapeake Bay Program, 2020). Impacts and costs associated with coastal flooding are highly dependent upon both the natural and social vulnerability, the amount of exposure, and adaptation measures in place (Hallegatte et al., 2013;Hinkel et al., 2014). Extreme coastal flooding can overwhelm protections in place and can have profound negative effects in this region, such as saltwater intrusion, loss of wetland forests and low-lying agricultural fields, beach erosion, damage to infrastructure from surge and waves, and flooding of roads and personal property putting human life at risk. Extreme events often include multiple hazards that compound the damage, leading to their net impact to be greater than the sum of its parts Moftakhari et al., 2017).
Estimating frequency and severity of extreme coastal flooding is difficult as, by definition, these events do not occur often. This lack of observational data makes it difficult to develop robust statistical or physical predictive models at the usual level of confidence although planning and design for extremes are essential to avoid the most severe consequences (Walton, 2000;Calafat and Marcos, 2020). Numerous hazard/risk assessments and flood insurance premiums rely on the FEMA 100-year (i.e., 1% annual chance) base flood elevations. However, many local decisions on infrastructure development, major capital investments, and adaptation planning require estimates of extreme flood levels at shorter-term return periods. Construction and maintenance of paved surfaces (10-20 years) and major roadways and bridges (50-70 years or more) for transportation as well as for wastewater treatment plants, residential development, dams/levees, beach replenishment, and wetland restoration are examples of projects in the region that require estimates of return level probabilities at time periods <100 years (DNREC, 2012;Johnson, 2013;Callahan et al., 2017;Delaware Emergency Management Agency (DEMA), 2018).
A common method to estimate the frequency of extremes (i.e., extreme value analysis, or EVA) is by assuming the largest values from an observational record can be modeled by a statistical distribution distinct from the parent distribution. Two families of extreme value distributions have been shown to model extreme values well: the generalized extreme value (GEV) distribution and the Generalized Pareto (GP) distribution (Coles, 2001). The GEV distribution can be fit to the set of maximum values of discrete, non-overlapping blocks within a time series, such as annual maximum values; this is termed the Block Maxima (BM) approach. Data points using this approach are evenly distributed over time, however, non-extreme data points from years with abnormally low values may be forced into the model fit, biasing the results. In contrast, the GP distribution can be fit to the upper tail of the parent distribution, i.e., the set of values that are greater than a pre-selected threshold; this is termed the Points-Over-Threshold (POT) approach. POT is a more natural interpretation of modeling extreme results although the data points may come in temporal clusters and selection of a threshold is subjective. Which approach is considered "better" is non-trivial and dependent upon the parent distribution of the data, time period, sample size, as well as the metric used to measure each model performance (Walton, 2000;Wong et al., 2020).
Numerous studies have performed EVA of total water levels (TWL) using a variety of methods along the US coastlines; a few recent examples can be found in Wahl et al. (2017), Kopp et al. (2019), Oppenheimer et al. (2019), Sweet et al. (2020), and Wong et al. (2020). TWL is an important measure of flooding, however, it is inherently influenced by location-specific tidal ranges and timing of the storm event relative to the phase of the tide whereas storm surge is generally more closely associated with the characteristics of the storm. EVA of storm surge has been performed along the US Atlantic coasts using both the BM/GEV (Grinsted et al., 2012;Sweet et al., 2014) and POT/GP (Bernier and Thompson, 2006;Tebaldi et al., 2012;U. S. Army Corps of Engineers, 2014;Booth et al., 2016;Hall et al., 2016) approaches, or comparing the two methods (Walton, 2000;Wong et al., 2020). EVA methods such as bootstrap simulations (U. S. Army Corps of Engineers, 2014; Garner et al., 2017) and global modeling (Muis et al., 2020) on storm surge have also been investigated.
The aforementioned studies defined storm surge as the maximum difference between TWL and predicted tide, often called the maximum non-tidal residual (NTR). Skew surge, however, is arguably a more accurate measure of storm surge and most appropriate for long-term planning and estimating extreme flood levels. It is defined as the difference between the maximum observed TWL and the maximum predicted tide during a tidal cycle, even if the observed and predicted tidal peaks are offset (i.e., skewed) from each other (Pugh and Woodworth, 2014). It represents the meteorologically-forced increase of water levels due to the net effect of winds, atmospheric pressure (i.e., inverse barometer effect), nearby river discharge, and wave setup, and is more clearly separated from the astronomically forced-tides and potential complex hydrodynamics of tidesurge interactions (Batstone et al., 2013;Mawdsley and Haigh, 2016;Williams et al., 2016;Stephens et al., 2020). Skew surge levels are consistently less than the measures of maximum NTR up to 30% (Hall et al., 2016;Callahan et al., 2021a).
There have been few studies of skew surge in the Mid-Atlantic. Mawdsley and Haigh (2016) analyzed long term trends of skew surge and Williams et al. (2016) investigated tide-skew surge independence, but only a few Mid-Atlantic tide gages were included in those analyses and neither performed traditional EVA on skew surges. Callahan et al. (2021a) computed skew surge at the same tide gauges as the current study but only analyzed TCs.
Specific goals of this study are two-fold. First goal is to estimate extreme skew surges within the Delaware and Chesapeake Bays and investigate sub-bay geographic differences. Many tide gauges in these bays started collecting data in the late 1970s and only recently has there been sufficient geographic coverages of gauges with records of at least 40 years of continuous hourly data. Second goal is to compare the two common traditional EVA approaches by implementing objective criteria for model parameter selection. The BM approach is enhanced to incorporate the GEVr distribution, a slightly modified form of the GEV distribution that allows for the inclusion of multiple values (the r-largest orders) per year instead of only the annual maximum (see Skew Surge Return Levels section for details), addressing the primary issue with the traditional BM approach of the low number of data incorporated in the model. It is not the intent of this paper to determine the "best" EVA approach to use in all cases, but rather to better understand the differences between them and to increase reliability of projections of extreme water levels due to storms, ultimately helping in long-term planning of mitigation and implementation of adaptation measures.

Study Region
The Delmarva Peninsula, located in the US Mid-Atlantic, is flanked on both sides by the Delaware and Chesapeake Bays (Figure 1). Tidal water levels and storm surges are influenced by the geomorphological environment, geometry of the coastline, bathymetry, bottom friction/dissipation effects, and reflection of the wave near the head of the bay . Storm surge is additionally influenced by storm size and direction of travel, duration, atmospheric pressure, wind speed and wind direction relative to the coastline (Ellis and Sherman, 2015). The Delaware Bay has a classical funnel shape, with pockets of deep scour in the wider lower bay, amplifying tidal range and storm surge in the northern regions (Wong and Münchow, 1995;Lee et al., 2017;Ross et al., 2017). The Chesapeake Bay, by contrast is longer, shallower, exhibits a more dendritic tributary landscape, and its lowest tidal ranges are toward the center (Zhong and Li, 2006;Lee et al., 2017;Ross et al., 2017). Although coastal storms threaten the region year-round, mean water levels follow a bimodal seasonal distribution with the maximum in fall (Oct) and secondary maximum in late spring (May-Jun), primarily caused by periodic fluctuations in atmospheric weather systems and coastal water steric effects (NOAA CO-OPS, 2020a). The largest coastal flood events typically occur either during peak hurricane season (Sept-Nov) or during the winter/early spring from nor'easters (Dec-Mar).

Water Level Data and Computation of Skew Surge
Tide gauges selected for this study were limited to NOAA operational tide gauges in and immediately around the Delaware and Chesapeake Bays. Requirements were that each gauge maintained nearly continuous record of hourly water levels for the time period 1980-2019, evenly located throughout the region, a set of harmonic constituents identified for making tidal predictions, and a vertical tidal datum conversion factor to North American Vertical Datum of 1988 (NAVD88). In all, 12 gauges were selected; 5 associated with the Delaware Bay and 7 with the Chesapeake (Figure 1; Table 1). All selected gauges are part of NOAA NWLON and PORTS networks. Hourly and High/Low water level data were obtained from the NOAA Center for Operational Oceanographic Products and Services (NOAA CO-OPS, 2020b). High/Low data represent the exact time and magnitude of each Higher-High, High, Low, and Lower-Low tidal peak. Hourly data represent the observed water level on each hour (e.g., 21:00, 22:00). The 40 years of hourly data at each gauge were manually inspected for errors and inconsistencies. A few small data clusters (of 2-16 h) within larger gaps of missing data were removed (on seven occasions across all gauges) and small data gaps of 1-2 h (<10 across all gauges) were filled using linear interpolation. Table 1 lists the number of data gaps that spanned 745 h (∼1 month) or greater as well as the percentage of valid hourly data points used in the analysis. Wachapreague had the largest amount of missing data due to a 2.5-year period  when valid Hourly and High/Low data were unavailable.
Skew surge was computed at each tidal peak over 1980-2019 using modeled predicted time series as reference. Total count was a maximum of 28,231 tidal peaks over the study time period, less any missing data. The observed maximum TWL at each peak was extracted from the High/Low dataset; the maximum hourly value was used if High/Low data were not available. The observed and predicted peaks were aligned within ±3 h of each other, which was extended to ±6 h if no High/Low or TWL peak alignment was found, such as due to prolonged surge; this occurred for <100 peaks over the entire study time period and only for gauges in the Chesapeake Bay.
Predicted tides were generated through Harmonic Analysis (HA) based on hourly water levels. The HA incorporated 37 tidal constituents defined by NOAA for their official tide predictions in this region (NOAA CO-OPS, 2020c) and seven tidal constituents noted by Harris (1991) relevant for the US East Coast. Computations were performed in 1-year increments (3-year increments if greater than 1 month of data were missing within a year). Annual computations minimize timing errors that can lead to the leakage of tidal energy into the non-tidal residual (Merrifield et al., 2013). It also essentially removes the SLR trend and minimizes inherent constituent biases when computed over long time periods, which could result from changing physiographical environmental conditions (Ross et al., 2017) or from changing seasonal weather patterns that strongly influence the Sa (solar annual) and SSa (solar semiannual) constituents (NOAA CO-OPS, 2007). More details on the computation of skew surge can be found in Callahan et al. (2021a).
Mean and standard deviation of skew surge, as well as maximum TWL for comparison, were computed to get a sense of the overall parent distribution. To help achieve stationarity and independence of data samples required by EVA, two further processes were performed on each gauge's time series. First, the data were linearly detrended about the 1980-2019 mean ( Table 2). Second, maximum peaks of skew surge were separated temporally by 30 h to ensure at least two high tides between each extreme event. Specifically, multiple surges above each selected threshold (defined following approaches in Block Maxima/GEVr Approach and Points-Over-Threshold/GP Approach sections) within 30 h of each other were treated as from a single event and only the maximum value was chosen.

Block Maxima/GEVr Approach
The BM approach of modeling extreme values is to select the maximum value within equal, independent blocks of time over the study period, which are usually fit to the GEV distribution. One-year blocks are commonly chosen (as in the current study) since a common ultimate goal is to estimate water levels of multiyear-based return periods for long-term planning purposes. Using the BM approach in this traditional way results in 40 data points over the years 1980-2019. The GEV distribution actually represents the combined generalized form of the Fréchet, Weibull, and Gumbel distributions, , with location parameter µ, scale parameter σ > 0, and shape parameter ξ. The shape parameter controls the shape of the tail. The second line of Equation 1 (ξ = 0) represents the Gumbel distribution and is found by taking the limit as ξ → 0. When ξ > 0 (Frechet), the tail is thicker than the Gumbel (i.e., "heavy-tailed") with no upper bound, whereas for ξ < 0 (Weibull), the distribution has a hard upper limit at µσ/ξ. Coles (2001) provides a detailed description of the BM/GEV approach.
A drawback of this approach is the limited number of data points (i.e., one per year) used to fit the model. Therefore, this method was generalized to include more than one value for each independent block of time by Weissman (1978) and later justified for use in hydrological studies, including modeling sea level extremes, by Tawn (1988). This extension of the BM approach allows for the use of the r-largest order statistics per year, permitted that r << total number of events per year. The key distinction of fitting data to the GEVr distribution, as opposed to the GEV distribution, is the choice of r. At r = 1, the GEV and GEVr are identical distributions. Since r is not a specific parameter in the GEVr probability density function, it cannot be estimated in the same way as µ, ξ , or σ .
Several orders of r were tested from 1 to 20 events per year. For each r, model parameters were estimated, and a series of hypothesis tests run. The upper limit choice of 20 was subjective but reasonable, as it would increase the number of data points significantly (20 × 40 years = 880, ∼3% of all tidal peaks over 1980-2019) while keeping r << 730, the maximum number of twice-daily skew surge events per year. Ideally, r should be large enough to include enough points to improve the robustness of the model but not large enough to introduce bias from the parent distribution and contaminating the EVD model fit.
A set of rules were developed by G'Sell et al. (2016) and furthered by Bader et al. (2017) to automate the selection of an optimum value of r. These rules are based on the sequential hypothesis tests for each r using the ForwardStop score and unadjusted p value generated from parametric bootstrap and entropy difference tests. The ForwardStop score is an adjusted p-value to control for the incremental false discovery rate, similar to a weighted mean of p-values of all tests on previous r values (Bader et al., 2017). The over-riding principal here is to start with a minimum number of data points and slowly increase the sample size until the data points do not satisfactorily fit the GEVr distribution. Following guidance provided in Bader et al. (2017), the following procedure was adopted to identify the optimum r.
1. Start with r = 1 and note the ForwardStop score from the parametric bootstrap test. Incrementally increase r by 1 until the ForwardStop score fails hypothesis test at the α = 0.05 level. If a failure occurs, that r is rejected and select the r just prior to the failed test. 2. If no r values are rejected after traversing all 20, use the ForwardStop score from the entropy difference test and repeat Step 1. 3. If no r values were rejected following Step 2, then repeat Steps 1-2 using the unadjusted p-values computed for each r instead of the ForwardStop score. 4. If no r values were rejected following Step 3, increase α to 0.10 and repeat Steps 1-3.
Using these guidelines, an optimum r was selected for each gauge.
The Goodness-of-Fit (GoF) was then tested between the Gumbel distribution (ξ = 0) fit and the Fréchet/Weibull distribution (ξ = 0) fit using the negative log-likelihood ratio test (ratio must be greater than 0.95) and Akaike Information Criterion (AIC) test (the difference in AIC score between sequential tests must be > 2, described in Burnham and Anderson, 2004). Maximum likelihood estimation (MLE) was used for all GEVr model fits.
Temporal declustering of skew surge peaks was performed on an annual basis in order for each of the r-largest orders per year to be an independent event.

Points-Over-Threshold/GP Approach
In contrast to the BM approach, the POT approach is a more natural way of statistically modeling the upper tail of a parent distribution. The entire study period is treated as a single block and the EVD includes only observations over a certain threshold value (i.e., exceedances) regardless of time the event occurred. The threshold is derived from a suitably high quantile level (e.g., 97% quantile is commonly used). Exceedances are then fit to the Generalized Pareto (GP) distribution. Like the GEV, the GP distribution represents a family of three distributions, differentiated by the model shape parameter, the CDFs of which are in Equation 2.2.
where the quantity inside the brackets y = max y , 0 , suitably high threshold µ, threshold-dependent scale parameter σ µ > 0, and shape parameter ξ. The condition is that all values of x must be larger or equal to the threshold µ. Behavior of the parameters is similar to that in the GEV. The shape parameter controls the shape of the tail. The second line of Equation 2.2 is found by taking the limit as ξ → 0, resulting in the Exponential distribution. A heavy tail occurs when ξ > 0 (Pareto distribution) with no upper bound, whereas a thinner tail and a fixed upper bound occurs when ξ < 0 (Beta distribution). Coles (2001) provides a detailed description of the POT/GP approach.
Threshold quantiles were tested from 90-99.5% exceedance probabilities in increments of 0.5% (from 1 to 20 thresholds), resulting in the maximum number of possible skew surge peaks used to model GP to be ∼2,920 (90%) to 146 (99.5%).
A threshold should be chosen to include enough upper tail exceedances that will improve the robustness of the model but not too many exceedances such that the lower values introduce bias from the parent distribution. Scarrott and MacDonald (2012) reviewed various methods on selecting the optimum threshold, including numerical tests and graphical diagnostics, such as Quantile-Quantile and Mean Residual Life plots. Many of these selection methods are subjective, timeconsuming when investigating many sites, and often result in multiple acceptable answers. Diagnostic plots were used in the current study (Supplementary Figures 1-24), however, to better compare results with BM/GEVr approach, a similar standardized methodology was employed for selecting an optimum threshold.
The rules developed by G'Sell et al. (2016) and Bader et al. (2017) were applied to automate the selection of the optimum threshold of the POT/GP approach in Bader et al. (2018). Unadjusted p-values from Anderson-Darling test were chosen in Bader et al. (2018) for threshold sequential hypothesis testing after a comparison among several other GoF tests. Although, Bader et al. (2018) recommends using ForwardStop score, based on skew surge data in the current study, ForwardStop rejects very few thresholds and the unadjusted p-values performed well in Bader et al. (2018) tests. Using the same over-riding principal here as with the BM/GEVr approach, start with the least number of data points and slowly increase the sample size until the data points do not satisfactorily fit the GP model. This is essentially working backwards, from the highest to lowest threshold, noted as the RawDown approach in Bader et al. (2018). A RawUp approach, working upwards from the minimum threshold (i.e., most data points) until a hypothesis test was accepted, was also described in Bader et al. (2018) but carries a higher chance for contaminating the EVD than the RawDown approach. Ultimately, the following rules were adopted to identify the optimum threshold.
1. Start with highest threshold percentage (99.5%) and note the unadjusted p-value from the Anderson-Darling test.
Incrementally decrease the threshold percentage by 0.5% until the GP model fit fails hypothesis test at α = 0.05 level. If a failure occurs, that threshold is rejected and select the threshold just prior to the failed test. 2. If the highest threshold (99.5%) is rejected on the first test but the second (99.0%) is not rejected, then skip the highest threshold and continue working downward until next rejection occurs. This allows for the opportunity to include more exceedances in the model and assumes the rejection occurred by chance. 3. If no thresholds were rejected following Step 2, increase α to 0.10 and repeat Steps 1-2.
Using these guidelines, an optimum threshold was selected for each gauge. Temporal declustering was performed separately for each threshold on all exceedances over the entire study period at once. Declustering therefore significantly reduced the actual number of skew surge events used in fitting the GP model by ∼30-70%.
As with the GEVr model, MLE was used for all GP model fits.

Skew Surge Return Levels
Lastly, return level (RL) skew surges were estimated for 1.1, 3, 5, 10, 25, 50, and 100-year return periods for each EVA modeling approach. A RL represents a threshold that the probability of exceedance in any 1 year is the inverse of the return period. For example, 100-year RL has a 1.0% (0.01) probability of being exceeded in any 1 year. Since the 1-year RL is undefined within the BM/GEVr approach, 1.1-year was used instead for comparison. Although probability quantiles can be easily extracted from the GP theoretical distribution using the fitted parameters, they cannot be viewed as annual probabilities of return levels, such as can be done using the BM/GEVr approach. Therefore, an estimate of the probability of a skew surge exceeding a selected threshold in a year on average must be included in RL calculations using the POT/GP approach. This is found by dividing the total number of declustered skew surge events above the selected threshold by the total number of years (40).
A qualitative review was performed on the estimated model parameters and return levels, with their 95% standard errors (SE) modeled using the selected optimum r (BM/GEVr) and threshold (POT/GP). Differences between the EVA modeling approaches and spatial variations were noted.
The harmonic analysis and tidal data processing work was done using the U-Tide package (Codiga, 2011) and standard modules in the Matlab programming environment. Temporal declustering was performed using the POT package (Ribatet and Dutang, 2019) and the EVA model fitting and RL extraction were performed using the eva package (Bader and Yan, 2020), both of the R statistical computing software environment.

Skew Surge
Mean skew surges are consistent and very close to zero across all tide gauges whereas TWL shows much larger geographic variation ( Table 2). Although differences are minor, largest skew surges (0.2 m) are at PHL and the open ocean gauges at ATL and WAC. TWL is consistently higher in the Delaware Bay than the Chesapeake Bay. Within each bay, the Delaware Bay upper regions have higher max TWL than the lower regions, whereas this pattern is reversed in the Chesapeake Bay. Spatial pattern of max TWL aligns with the Mean Higher-High Water (MHHW) and Great Diurnal Range (GT) tidal datums (Figure 3), which do not align with the spatial pattern of skew surge. Standard deviations of skew surge show slightly more geographic variation (ranging 0.14-0.19 m) with a similar spatial pattern to the max TWL and Mean Sea Level (MSL) tidal datum. Largest deviations are in the upper bays and lowest in the central Chesapeake Bay.
None of the gauges showed statistically significant trends in skew surge except for PHL, which showed a slight negative trend of ∼ −0.3 mm/year, nevertheless, all gauge time series were detrended. Note that for comparison, all gauges showed statistically significant increasing trends in max TWL consistent with local SLR rates (further analysis was not performed on max TWL within this study.) As an example of the parent vs. upper tail distributions,   Normal distribution fit the parent distribution at p < 0.01, yet significantly underestimates the empirical data in much of the upper tail, emphasizing the importance of modeling extremes of skew surge separately from the parent distribution. Figure 3 shows an example of the GEVr sequential hypothesis testing for the LEW gauge. No rejections occurred (below α = 0.05) using ForwardStop score from either the parametric bootstrap or entropy difference procedures. Starting from r = 1 and sequentially comparing the unadjusted p-values, the first rejection occurs at r = 15, resulting in the optimum r = 14. Testing for the optimum threshold in the POT/GP approach worked in the same way, albeit starting on the right side of the unadjusted p-values plot and working downward until a rejection occurs following guidance in Points-Over-Threshold/GP Approach Section. Table 3 and Figure 4 show resultant model parameters estimated after selecting the optimum r in the BM/GEVr approach at each gauge. The number of skew surge events per year that were fit to the GEVr distribution, ranges from N = 120 (r = 3 at CAP, ANN, SEW) to N = 560 (r = 14 at LEW). Both the location and scale parameters have small, consistent SE relative to their magnitude across all sites. The shape parameter is the most uncertain of all the parameters, although SE is relatively consistent across all sites. Uncertainty is inversely related to the total number of skew surge events ultimately used in the EVA after declustering; the lower the number of events, the smaller the SE. Shape parameter is positive at all sites except at WAC where it is slightly negative. Based on the negative log-likelihood ratio and AIC difference tests, none of sites favor the use of the GEVr

R is the number of largest maxima per year included in the analysis. Npks is the number of skew surge events after 30-h temporal declustering and is equal to r multiplied by the number of years of data. Location, scale, and shape are model parameters fit using maximum likelihood estimation with 95% standard error in parentheses.
Gumbel (ξ = 0) distribution over the GEVr Fréchet/Weibull (ξ = 0) distribution. Similarly, Table 4 and Figure 3 summarize the results after selecting the optimum threshold using the POT/GP approach at each gauge. Threshold percentages range from 94.5% (ATL, N = 732) to 99.0% (LEW,ANN,KIP,and SEW,N = 160,194,139,and 142,respectively.) Gauges that have the same optimum threshold still result in different total number of skew surge events due to temporal declustering. Scale parameter SE is low  while the shape parameter SE is relatively high across all sites. Shape parameter is positive at all sites except at KIP where it is slightly negative. Spatial patterns and relative uncertainties of both the scale and shape parameter estimates are generally similar between the two approaches. Supplementary Figures 1-12 (BM/GEVr) and 13-24 (POT/GP) show diagnostic plots of the model fit using the optimally selected r and threshold values at each tide gauge. Included are probability-probability (PP) and quantile-quantile (QQ) plots of the modeled vs. empirical data, and histograms overlaid with model fit PDF curve. The PP plots and histograms show good agreement between the model and observations. For most gauges, the QQ plots show a few outliers with the observed skew surge levels higher than modeled quantile estimates. The LEW gauge did not show this behavior but rather at the largest values, the modeled quantiles were larger than the observed data.

Skew Surge Return Levels
Skew surge return levels for 1.1, 3, 5, 10, 25, 50, and 100-year return periods with 90% confidence intervals (i.e., uncertainty) are shown in Table 5 for BM/GEVr and Table 6 for POT/GP. For the sake of brevity and ease of comparison, only the mean values are plotted in Figure 5. Note the more traditional continuous RL curves with confidence intervals are included in Supplementary Figures 25-26, and additionally plotted with empirical data in panel 4 of Supplementary Figures 1-24. RLs increase in a consistent manner with longer return periods at all sites under both modeling approaches. For BM/GEVr, 100-year RLs range from 1.07 m (LWS) to 1.79 m (LEW) with generally largest values starting in the lower bay regions, decreasing to a minimum in the central regions, then increasing toward the upper regions. This pattern is similar across all RLs. LEW and SEW have the largest RLs for most return periods, except for the 1.1-year return period, where the maximum RL is at ATL (although several other sites are very close). Longer return periods demonstrate more spatial variation in RLs. Using POT/GP, 100-year RLs range from 1.08 m (LWS) to 1.56 m (LEW), with approximately the same spatial pattern as with BM/GEVr.
There are few differences in RLs between approaches (Table 7; Figure 6). The most noticeable is that the 1.1-year RLs using  1,3,5,10,25,50, and 100-year return periods modeled using the BM/GEVr approach for tide gauges in the Mid-Atlantic region.   Generally, uncertainties under both approaches are very similar to each other at shorter return periods. At longer return periods in the Delaware Bay, uncertainties are smaller using POT/GP for most sites. At longer return periods in the Chesapeake Bay, generalization is more difficult; BAL (−0.44 m at 100-year) and WAC (−0.87 m at 100-year) have significantly smaller uncertainties using BM/GEVr while many other sites have smaller uncertainties using POT/GP.

DISCUSSION
The focus areas of the current study is to investigate the magnitude and geographic variation within the Delaware and Chesapeake Bays of estimated return levels of skew surge for ∼1-year to 100-year return periods and to compare the two most common traditional EVA approaches. Although skew surge is arguably one of the best and simplest measures of the meteorological drivers of coastal flooding, as it is the portion of flood depth above the high tide level, its wider use in literature has only recently gained attention. This work was done strictly through empirical data (rather than using simulated events or scenario-based projections) over the past 40 years and statistically analyzed through stationary EVA on detrended skew surges. Observational data showed minimal trends over this time period, hence results from this study should not be appreciably different than non-stationary EVA that allows for temporally varying or multivariable dependent location and scale parameters. To test the robustness of the methods, the POT/GP approach presented here was applied to max TWL and resulting return levels compared against those found in U. S. Army Corps of Engineers (2014), who performed a similar EVA using longer time periods. Seven tide gauges were analyzed in both studies: LEW, CAP, ATL, BAL, ANN, CAM, and SEW. Although they used different thresholds, detrending methods, and reference periods, the general spatial and temporal trends were consistent between the two studies, including the intervals between the 50year and 100-year RLs. Largest differences were found for longer return periods at BAL and ANN, sites that have experienced extreme coastal flooding outliers (Callahan et al., 2021b). Due to the approximate independence of skew surge to SLR and tidal phase (i.e., likely minor influences of tide-surge For example, if a structure is designed to withstand the 50year flood event, the 50-year return levels can be added to the expected high tide (either the MHWW or Highest Astronomical Tide datum) under an appropriate SLR planning scenario. It should be noted that skew surge computed in this study includes contributions from other aspects associated with extreme flooding events, such as nearby river discharge and wave setup, which ultimately likely benefits many long-term planning activities. As well, trends in skew surge were minor and mean values were within a few centimeters of zero over 1980-2019 (Table 2), however, an adjustment to skew surge RLs due to detrending could be performed. Care is warranted when making linear adjustments too far into the future due to the expected exponential increase in SLR and uncertainty in future extreme storm conditions along the Mid-Atlantic (Callahan et al., 2017(Callahan et al., , 2021b. Largest return levels across most return periods occur within the bay boundaries in the lower regions, and not in the upper regions of the bays and ocean coast sites that typically show higher surges and TWLs. Specially, LEW and SEW gauges, both located on the southwest side of the mouth of each bay, consistently show the largest RLs throughout the region. One explanation is that many large coastal flood events are associated with ETCs, often as traditional nor'easters. For these storms, the low-pressure center off the coast brings strong northeast winds, which drives enhanced surges into the bays through Ekman transport as well as direct winds piling up water on the southwest sides of the lower bays. This would be most effective in the lower Delaware Bay, where the width of the Bay reaches 45 km. The upper Delaware Bay, although it experiences large tidal ranges and increased surges (due to conical shape of coastline and from the increased volume of water entering the bay from southeasterly to easterly winds), may not experience the worst impacts from the most extreme storm events and may actually see decreases in surges from northerly winds that also occur during nor'easters. The upper Chesapeake Bay does not exhibit the same high TWL, MHHW, or surges as in the upper Delaware Bay (primarily due to the overall size, shape, and depth of the Chesapeake Bay), however, extreme skew surge RLs in both upper regions are comparable to each other. This supports results in Callahan et al. (2021a), which found the upper bays were highly correlated with each other from TC-caused skew surges, more so than with their respective lower bay regions. TCs can account for close to 40-60% of the largest (top 10) coastal flooding events in the Mid-Atlantic, with smaller relative percentages over larger number of events (Booth et al., 2016;Callahan et al., 2021b). In particular for the upper Chesapeake Bay, Hurricane Isabel in 2003 caused extreme coastal flooding compared to other events, directly influencing higher return period RLs and their uncertainties.
RLs tend to be at a minimum within each bay closer to the central regions, CAM and LWS in the Chesapeake and RDY in the Delaware Bays. These areas have the lowest mean skew surges throughout the region and typically do not experience the worst wind-driven impacts from coastal storms. Likewise, these areas also exhibit the smallest uncertainties throughout the region across many return periods.
A secondary focus of this study is to provide insight into the two most common approaches of stationary EVA applied to Mid-Atlantic skew surge. To that end, the GEVr distribution was combined with the BM approach to address the small sample size of the traditional annual max BM/GEV approach, and a standardized method for selecting optimum r and threshold was incorporated. The use of GEVr increases the robustness of the model fit and puts the number of data points more comparable to the POT/GP approach, however, there are some disadvantages of using a BM approach. Large surge events could be missed, for example, if an individual year has more major coastal flood events than the selected optimum r (i.e., false positives). At the other end, non-extreme surge events could be included, for example, if an individual year has less major coastal flood events than the selected optimum r, introducing bias from the parent distribution (i.e., false negatives). Use of the POT/GP approach circumvents these issues as it is irrespective of time, solely focused on the upper tail of the parent distribution. A potential trade-off is if the majority of extreme events occur toward either end of the study time period, direct interpretation of annual return levels from the mean number of events per year is more difficult. From review of the data used in the current study, clustering of major skew surge events occurring on either end of the time period was not present.
The choice of optimum r or threshold is a tricky problem to address. It is usually a subjective process, including graphical and numerical diagnostic information, and choosing among multiple appropriate candidates. The current study incorporates a standardized methodology of sequential hypothesis texting that can be applied to all sites simultaneously while allowing for variable r/threshold selection per site. Although stopping rules and goodness of for tests are still subjective within this methodology, they are data-driven, based on statistical results from Bader et al. (2017Bader et al. ( , 2018. Choice of stopping rules influences the number of data points (r-largest orders or threshold exceedances), and hence, directly influence the uncertainty in model parameters. Uncertainty in RLs do not consistently show strong dependence on the number of peaks included in the model fit. This potential relationship of RL uncertainty and optimum r/threshold should be explored further in future work.
Changes in storm frequency and intensity ("storminess"), either observed or projected due to climate change, were not addressed in this study. As stated above, skew surge is closely related to the meteorological characteristics driving the flooding and relatively independent of SLR or tides.
Trends in skew surge are therefore influenced by oscillations and trends in oceanic-atmospheric circulation patterns that support enhanced cyclogenesis or steer storms along the coast. Common teleconnections associated with the frequency or magnitude of surges in the Mid-Atlantic region include the North Atlantic Oscillation, Pacific-North American oscillation, El Nino/Southern Oscillation, and Atlantic Meridional Overturning Circulation, several of which have been included as covariates in non-stationary EVA or joint-probability models of surges in recent years (Ezer et al., 2013;Sweet et al., 2014;Hamlington et al., 2015;Wahl and Chambers, 2015;Kopp et al., 2019;Little et al., 2019;Rashid Md et al., 2019). The 40-year time period of the current study is long enough to capture several oscillations of many of these teleconnections, essentially averaging out their influence. Extreme RLs then can be viewed as based on relative average synoptic conditions, however, the probability of occurrence of an extreme surge event in any single year is dependent upon the presence and strength of teleconnection patterns, and assessment should be performed as near to the year in question as possible.
Other aspects of this study could have influenced extreme surge estimates. Most notably is the length of the data record, as is usually the case in EVA. Although there exists general agreement with results in U. S. Army Corps of Engineers (2014), 40 years of data to estimate 100-year RL is not ideal. Comparing EVA results on skew surge using the methods presented in the current study on a subset of gauges with much longer records, perhaps one gauge per sub-bay region, could offer insight into the robustness of the current study statistical results. Additionally, the set of 44 constituents used in the HA computation of the predicted tide may not capture all the tidal oscillations present at every site, thereby impacting the magnitude of skew surge (albeit these changes likely would be minimal). The choice of 30 hours was subjective and may not be optimum at all sites to separate individual skew surge events, although it is rare for a single storm event to reach extreme surge levels multiple times separated by two or more high tides.

CONCLUSION
Understanding extreme events is important because of their potential for disproportionate damage and threat to public health, which will aid in mid-to long-term planning of many coastal communities and critical ecosystems along their shores. Across most of the return periods, the largest return levels occur at LEW and SEW, which are within the bay boundaries in the southwest side of the lower regions of their respective bay. Likewise, minimum return levels occur near the central regions of each bay, at RDY and LWS. This may seem counterintuitive for the Delaware Bay where the upper bay (PHL) experiences high water levels and tidal ranges but is consistent with wind and ocean current patterns for offshore storms in this region.
Determination of which approach is "best" for modeling extreme skew surge events in the Mid-Atlantic is not a goal of the current study. Nevertheless, differences between the approaches are highlighted and some general recommendations can be made.
Overall, both approaches provide similar results. Confidence in model parameters is good and consistent across all sites between both approaches, with narrow confidence intervals for the location and scale parameters. Confidence is less for shape parameter but is generally the same for both approaches. Not many differences in magnitude of RLs exist, especially for 3-year to 100-year return periods, which helps justify comparisons of extreme levels of surge among previous EVA studies in this region (at least for skew surge).
For the 1.1-year return period, the POT/GP approach provides more consistent values in respect to other return periods across both bays. This is likely due to the effects of estimating RLs from the GEVr distribution close to 1-year. For the Delaware Bay at longer return periods, the POT/GP also seems to perform well (lower uncertainty) at most sites and therefore could be used at all return periods up to 100 years. This finding is supported by U. S. Army Corps of Engineers (2014) statements that traditional BM/GEV approach tends to overestimate and have larger uncertainties when compared to POT/GP. Recommendations are more mixed for the Chesapeake Bay for return periods at 3-year and above. Results at ANN and LWS are nearly identical for both approaches. For KIP and SEW, sites in the lower Chesapeake Bay, lower uncertainties and slightly lower RLs tend toward the POT/GP approach. Conversely, BAL (upper region) and WAC (ocean coast) tend toward the use of the BM/GEVr approach.

DATA AVAILABILITY STATEMENT
The datasets presented in this study can be found in online repositories. The names of the repository/repositories and accession number(s) can be found below: the dataset of extreme skew surge return levels, estimated using both BM/GEVr and POT/GP approaches, at each tide gauge analyzed for this study can be found at the figshare repository, https://figshare.com/ account/home#/projects/101024.

AUTHOR CONTRIBUTIONS
JC and DL conceived the idea of investigating extreme coastal flood levels in the U.S. Mid-Atlantic Delaware and Chesapeake Bays. JC designed the analysis framework, obtained all the necessary data, performed the statistical analysis, and wrote the manuscript. DL helped with results interpretation and manuscript revisions. All authors contributed to the article and approved the submitted version.

ACKNOWLEDGMENTS
The authors would like to thank the National Oceanic and Atmospheric Administration for providing a large database of water level data for public use. The authors would also like to thank Dr. Daniel L. Codiga, Graduate School of Oceanography, University of Rhode Island, and Dr. Richard Pawlowicz, Department of Earth, Ocean and Atmospheric Sciences, University of British Columbia, for their help in understanding the operation of U-Tide and performing harmonic analysis.

BENEFIT OF RESEARCH TO SCIENTIFIC COMMUNITY
Extreme storm surges can overwhelm many coastal flooding protection measures in place and cause severe damages to private communities, public infrastructure, and natural ecosystems, particularly in the Mid-Atlantic region. However, by definition, extreme events are rare and difficult to model due to small sample sizes. Results from this study provide estimates of extreme skew surge, a less often studied but robust measure of the meteorological component of coastal flood levels, for return periods of 1.1 to 100 years. This study also aims to increase understanding and reliability of projections of extreme water levels using methods commonly found in the literature and ultimately help in long-term planning of mitigation and implementation of adaptation measures.