Sensitivity of Skill Score Metric to Validate Lagrangian Simulations in Coastal Areas: Recommendations for Search and Rescue Applications

Search and rescue (SAR) modeling applications, mostly based on Lagrangian tracking particle algorithms, rely on the accuracy of met-ocean forecast models. Skill assessment methods are therefore required to evaluate the performance of ocean models in predicting particle trajectories. The Skill Score (SS), based on the Normalized Cumulative Lagrangian Separation (NCLS) distance between simulated and satellite-tracked drifter trajectories, is a commonly used metric. However, its applicability in coastal areas, where most of the SAR incidents occur, is difficult and sometimes unfeasible, because of the high variability that characterizes the coastal dynamics and the lack of drifter observations. In this study, we assess the performance of four models available in the Ibiza Channel (Western Mediterranean Sea) and evaluate the applicability of the SS in such coastal risk-prone regions seeking for a functional implementation in the context of SAR operations. We analyze the SS sensitivity to different forecast horizons and examine the best way to quantify the average model performance, to avoid biased conclusions. Our results show that the SS increases with forecast time in most cases. At short forecast times (i.e., 6 h), the SS exhibits a much higher variability due to the short trajectory lengths observed compared to the separation distance obtained at timescales not properly resolved by the models. However, longer forecast times lead to the overestimation of the SS due to the high variability of the surface currents. Findings also show that the averaged SS, as originally defined, can be misleading because of the imposition of a lower limit value of zero. To properly evaluate the averaged skill of the models, a revision of its definition, the so-called SS∗, is recommended. Furthermore, whereas drifters only provide assessment along their drifting paths, we show that trajectories derived from high-frequency radar (HFR) effectively provide information about the spatial distribution of the model performance inside the HFR coverage. HFR-derived trajectories could therefore be used for complementing drifter observations. The SS is, on average, more favorable to coarser-resolution models because of the double-penalty error, whereas higher-resolution models show both very low and very high performance during the experiments.

Search and rescue (SAR) modeling applications, mostly based on Lagrangian tracking particle algorithms, rely on the accuracy of met-ocean forecast models. Skill assessment methods are therefore required to evaluate the performance of ocean models in predicting particle trajectories. The Skill Score (SS), based on the Normalized Cumulative Lagrangian Separation (NCLS) distance between simulated and satellite-tracked drifter trajectories, is a commonly used metric. However, its applicability in coastal areas, where most of the SAR incidents occur, is difficult and sometimes unfeasible, because of the high variability that characterizes the coastal dynamics and the lack of drifter observations. In this study, we assess the performance of four models available in the Ibiza Channel (Western Mediterranean Sea) and evaluate the applicability of the SS in such coastal risk-prone regions seeking for a functional implementation in the context of SAR operations. We analyze the SS sensitivity to different forecast horizons and examine the best way to quantify the average model performance, to avoid biased conclusions. Our results show that the SS increases with forecast time in most cases. At short forecast times (i.e., 6 h), the SS exhibits a much higher variability due to the short trajectory lengths observed compared to the separation distance obtained at timescales not properly resolved by the models. However, longer forecast times lead to the overestimation of the SS due to the high variability of the surface currents. Findings also show that the averaged SS, as originally defined, can be misleading because of the imposition of a lower limit value of zero. To properly evaluate the averaged skill of the models, a revision of its definition, the so-called SS * , is recommended. Furthermore, whereas drifters only provide assessment along their drifting paths, we show that trajectories derived from high-frequency radar (HFR) effectively provide information about the spatial distribution of the model performance inside the HFR

INTRODUCTION
Search and rescue (SAR) operators need the most accurate met-ocean forecast models to respond the most effectively to an emergency. In case of incident at sea, they run trajectory models, mainly based on Lagrangian discrete particle algorithms, to predict the drift of their target induced by the effect of ocean currents, waves, and winds and define a search area (Breivik et al., 2013;Barker et al., 2020). The skill of drift prediction is highly dependent on the accuracy of the met-ocean forecast data used to advect the Lagrangian model. In order to cover diverse spatiotemporal scales and guarantee near-real-time availability, different forecast products are made available to SAR operators. When all model predictions agree (i.e., their simulated trajectories are similar), there is a high level of confidence in the prediction, and the search area is reduced. But if different forecast models result in disparate trajectories, there are multiple viable outcomes, and the level of confidence decreases. Therefore, SAR operators need skill assessment methods to assess, within the shortest possible time, which model is likely to give the most accurate prediction at the moment and in the region of the incident.
Discrepancies between ocean forecast systems can be attributed to different numerical models, horizontal and vertical grids, domains, forcing mechanisms (meteorological, tidal, etc.), or data assimilation schemes. Field observations such as satellite-tracked drifters, satellite observations, moorings, and high-frequency radar (HFR) surface currents can all be used to estimate errors in the forecast outcomes. Usually, trajectories from drifter observations are used to evaluate the drift prediction accuracy (e.g., Vastano and Barron, 1994;Smith et al., 1998;Thompson et al., 2003;Price et al., 2006;Barron et al., 2007;Brushett et al., 2011). Virtual particles are launched where satellite-tracked drifters are observed, and the separation distance between simulated and observed trajectories is evaluated during a determined forecast time. This Lagrangian separation distance is a direct measure of trajectory model skill: the smaller the separation distance, the better the model skill. However, Liu and Weisberg (2011) have shown that this metric fails to indicate the relative model performance in areas characterized by different dynamics. Indeed, the separation distance tends to be larger in regions of strong currents than in regions of weak currents.
To overcome this, Liu and Weisberg (2011) proposed a new Skill Score (SS) based on the cumulative Lagrangian separation distance normalized by the cumulative observed trajectory length (i.e., the cumulative distance traveled by the observed drifter). The SS is a dimensionless index ranging from 0 to 1: the higher the SS value, the better the model performance. After being applied in the context of the Deepwater Horizon oil spill (Liu and Weisberg, 2011;Mooers et al., 2012;Halliwell et al., 2014), this metric has been one of the most widely used statistics for trajectory evaluation. The SS has been used to evaluate different parameterizations in operational oil spill trajectory models (Ivichev et al., 2012;Röhrs et al., 2012;De Dominicis et al., 2014;Berta et al., 2015;Wang et al., 2016;French-McCay et al., 2017;Janeiro et al., 2017;Chen et al., 2018;Zhang et al., 2018;Tamtare et al., 2019), to assess the impact of data assimilation in the model's Lagrangian predictability (Sperrevik et al., 2015;Phillipson and Toumi, 2017), to estimate the accuracy of the gap-filled method for HFR data (Fredj et al., 2017), to test the ability of ocean models in simulating surface transport (Sotillo et al., 2016), and to evaluate the relative performance of ocean models and HFR surface currents in predicting trajectories for SAR operations (Roarty et al., 2016(Roarty et al., , 2018. In the context of SAR, the SS is indeed of particular interest, as it provides information in a user-friendly way, by means of an easily interpretable metric. However, to the best of our knowledge, a sensitivity analysis of this metric to the length of the time window during which trajectories are compared is still lacking. The aforementioned studies all used different forecast horizons, ranging from 9 h (Fredj et al., 2017) to 5 days (Pereiro et al., 2018), without discussing the SS sensitivity to the choice of this time window. Liu and Weisberg (2011) applied this method with a forecast horizon of 1, 3, and 5 days and showed that there are more uncertainties for shorter time simulations (e.g., 1 day), especially in the continental shelf region (weak current region), because of small values of the trajectory length. However, in SAR operations, the emergency response has to be given in a shorter time frame (few hours). Moreover, most SAR cases occur in coastal areas, within 40 km offshore, and almost 50% within less than 3 km offshore (statistical analysis from the data shown in Figure 1), where small-scale dynamics often lead to complex trajectories, mainly due to intense inertial oscillations (Millot and Crépon, 1981;Salat et al., 1992;Tintoré et al., 1995), sea-land breeze-induced currents, or semidiurnal tides. These trajectories may lead to the convergence of the simulated and observed trajectories after a certain time, which could result in an overestimation of the SS. The determination of the appropriate forecast time is therefore of primary importance, and the dependence of the SS to the forecast horizon needs to be examined.
At the time of an emergency, SAR operators need to receive information about the reliability of the different ocean model predictions in the most efficient way, to make decisions as quickly as possible. This information should therefore be understandable at first glance. An effective way to display this information could be by means of a model ranking based on the averaged model performance over the area of interest. This can be addressed by computing the spatiotemporal average of SS considering all available drifter observations during a specific period of observations (as in Röhrs et al., 2012;De Dominicis et al., 2014;Roarty et al., 2016;Sotillo et al., 2016;French-McCay et al., 2017;Phillipson and Toumi, 2017;Roarty et al., 2018). However, we show in this article that averages of the SS should be applied and interpreted with caution. Moreover, drifter observations only provide an estimation of the model's performance along their drifting paths, i.e., over the area and during the time period when they are available. The reliability of this method thus strongly depends on the availability of trajectory observations at the location and at the moment of the incident, or shortly after it (near-real-time evaluation). As this is not always possible because of the scarcity of drifter observations in coastal regions (Tintoré et al., 2019), most of the time there is no alternative but to use the latest observations available in the region, which are not necessarily representative of the dynamics occurring at the moment of the incident. However, in coastal regions, HFR surface currents are now available over wide areas (Figure 1) and provide continuously gridded velocities at the resolution required to tackle small-scale processes. Trajectories derived from HFR could therefore be used to complement drifter observations for evaluating the ocean model predictions in these areas.
In this study, we assess the skill of four ocean forecast systems available in the Ibiza Channel (Western Mediterranean Sea) and evaluate the applicability of the SS in such coastal risk-prone regions seeking for a functional implementation in the context of SAR operations. The surface circulation in this region, mainly driven by local winds (Lana et al., 2016), is highly variable over a wide range of temporal and spatial scales (Robinson et al., 2001;Pinot et al., 2002;Ruiz et al., 2009) and strongly influenced by the intense frontogenesis driven by the confluence of the Liguro-Provençal-Catalan Current and the Atlantic inflow (García-Lafuente et al., 1995;Heslop et al., 2012;Sayol et al., 2013;Hernández-Carrasco et al., 2018a. We use 22 drifters deployed during three experiments carried out by the Balearic Islands Coastal Observing and forecasting System (SOCIB) and analyze the sensitivity of the SS to different forecast horizons ranging from 6 to 72 h. These experiments were conducted during different seasons, under distinct oceanic conditions, allowing to evaluate the SS sensitivity under diverse dynamical scenarios. Furthermore, we evaluate the validity of applying averages to the SS for comparing the relative average performance of the models, and make recommendations for obtaining an accurate model ranking with this method by introducing the novel SS * . Finally, addressing the sparseness of drifter observations that reduces the robustness of the methodology and our capacity to assess the skill of operational systems, we evaluate the use of HFR-derived trajectories for assessing the model performance.
The article is organized as follows. Section "Data and Models" presents the multiplatform observations used for the assessment and the ocean forecast systems evaluated. In section "Model Skill Assessment Methodology, " we provide a description of the methodology developed by Liu and Weisberg (2011) and the skill assessment design applied in this article. The results are presented and discussed in section "Results" and the conclusions are drawn in section "Summary and Discussion."
On September 30, 2014 (Tintoré et al., 2014), 13 drifters were released at four different locations in the Ibiza Channel (Figure 2), as described by Lana et al. (2016). Most of the drifters went northward, trapped by the Balearic current along the northern coast of Ibiza, whereas two drifters went northeastward and crossed the narrow central isthmus between Ibiza and Formentera. Those drifters were of different types (Figure 3): four CODE (cylindrical hull of 10 cm in diameter with a drogue consisting in four sails made of plasticized nylon arranged in a cross-like shape, Davis, 1985), five MDO3i drifters (cylinder of 10 cm in diameter and 32 cm in length, with only 8 cm above the sea surface, with a drogue attached at 0.5-m depth, Callies et al., 2017), and four ODI (spherical hull of 20 cm in diameter and low weight (3 kg), with a drogue of 2 kg attached at 0.5-m depth, Callies et al., 2017). All these configurations ensure that the drifter's path represents the current within the first meter of the water column, and we indeed do not find clear differences in our results in terms of drifter types (not shown).
On July 28, 2016 (Reyes et al., 2020c), four ODI drifters were launched at four different locations (Figure 2). That time, all drifters went northward, trapped in the Balearic Current, and got outside the Ibiza Channel in less than 3 days. After 2 days, one of the drifters stopped emitting data (after July 29, 22:00), so we can only consider its trajectory during the first 2 days. Consequently, its trajectory cannot be considered for forecast times longer than 48 h. To ensure a correct comparison of the models at different forecast times of up to 3 days, this drifter is excluded from the sensitivity analysis.
On November 15, 2018 (Reyes et al., 2020b), five CARTHE drifters (Novelli et al., 2017;D'Asaro et al., 2018) were released within a circular area of 2 km in diameter, offshore of Formentera Island (Figure 2). CARTHE-type drifters are biodegradable drifters (Figure 3) and are representative of the upper 0.6 m. In this experiment, the drifters were entrapped in small-scale circulation patterns. During the first week, all drifters followed the same path and went eastward and then northward. Then, two drifters went westward and then took opposite directions, whereas the three other drifters got trapped in eddies in the northern part of the Ibiza Channel.
As the purpose of our article is not to evaluate the Lagrangian model simulation, we do not consider the errors in drifter observations (e.g., O'Donnell et al., 1997). However, qualitycontrol tests have been applied, following the procedure of the SOCIB-Data Centre Facility (Ruiz et al., 2018). CODE drifters provide positions every 10 min, whereas MDO3i and ODI drifters provide positions not evenly distributed in time, every 5-15 min. Linear interpolation is therefore applied to the drifter positions in order to work with equally spaced hourly time series.
In the case of an emergency, SAR operators need to know the accuracy of the model predictions in an area of up to a few kilometers around the location of the incident. Moreover, the ocean being highly variable in both space and time, the evaluation of ocean models needs to be performed over a reduced area and period. For this reason, and to ensure the availability of multiplatform observations, we focus on the region around the (Figure 2), where an HFR system is available. Moreover, in 2014 and 2018, we only consider the first 14 days of the experiment, including 88% and 76% of the original dataset, respectively. However, the same conclusions are reached when considering the whole trajectories of the drifters, which extend further north over the Balearic Sea.

High-Frequency Radar System of the Ibiza Channel
High-frequency radar is a shore-based remote-sensed technique that relies on the Bragg resonance phenomenon (Barrick et al., 1977). The HFR of the Ibiza Channel is part of the multiplatform observing system operated by SOCIB (Tintoré et al., , 2019). The HFR system consists of two CODAR SeaSonde radial stations, transmitting at a center frequency of 13.5 MHz with a bandwidth of 90 KHz (Lana et al., 2015(Lana et al., , 2016. Hourly radial velocity maps (i.e., velocities toward or away from one antenna) from both stations are combined to form vector surface current velocities on a regular 3 × 3 km grid, with a range of up to 65 km from the shore (Tintoré et al., 2020).
The increased availability of continuous high-resolution HFR surface currents is providing new possibilities for the application of Lagrangian methods to understand small-scale coastal transport processes (Berta et al., 2014;Solabarrieta et al., 2016;Rubio et al., 2017;Hernández-Carrasco et al., 2018b). However, hardware and software failures or environmental conditions can result in incomplete spatial coverage or periods without data. This particularly occurs at the baseline which joins the two stations and at the outer-edge domain, where the Geometrical Dilution of Precision (GDOP) error is higher Barrick, 2006).
With the purpose of obtaining HFR-derived Lagrangian trajectories (see section "Use of HFR-Derived Trajectories"), the generation of HFR gap-filled products are required (Solabarrieta et al., 2016;Hernández-Carrasco et al., 2018a,b). To do this, we use the Open-Boundary Modal Analysis (OMA) method, implemented by Lekien et al. (2004) and further optimized by Kaplan and Lekien (2007), using the modules of the HFR Progs MATLAB package 1 . OMA is based on a set of linearly independent velocity modes that are fit to the radial data. These modes (189 in our case) describe all possible current patterns inside a two-dimensional domain (considering the open boundaries and the coastline). OMA considers the kinematic constraints imposed by the coast by setting a zero normal flow. Depending on the constraints of the methodology, it can be limited in representing localized small-scale features, as well as flow structures near open boundaries. Also, difficulties may arise when the horizontal gap size is larger than the minimal resolved length scale (Kaplan and Lekien, 2007) or when only data from one antenna are available, which was not the case during the periods considered here. To avoid the occurrence of unphysically fitted currents, gaps are filled only if they are smaller than the smallest spatial scale of the modes (Kaplan and Lekien, 2007).

Global and Regional Ocean Models
Four ocean forecast systems available operationally in the Ibiza Channel are evaluated: the global model (Lellouche et al., 2018) from the Copernicus Marine Environmental Monitoring Service (CMEMS) 2 , the two CMEMS regional models of the IBI (Iberia-Biscay-Ireland; Sotillo et al., 2015) and MED (Mediterranean) regions (Simoncelli et al., 2014;Clementi et al., 2017), and the regional model of the Western Mediterranean Operational (WMOP) Forecasting System from SOCIB (Juza et al., 2016;Mourre et al., 2018). Table 1 summarizes the main characteristics of these models. They will be referred hereinafter to as GLO, IBI, MED, and WMOP.
IBI and MED are nested inside GLO, whereas MED is the parent model of WMOP. CMEMS models are forced with 3-hourly atmospheric fields from the European Center of Medium Weather Forecast (ECMWF) Integrated Forecast System, whereas WMOP was forced for the periods considered in this study by 5-km resolution HIRLAM model atmospheric fields (3-hourly in 2014 and 2016 and hourly in 2018) from the Spanish Meteorological Agency (AEMET). All these systems are run operationally, producing several-day forecasts of a number of physical parameters, including ocean currents. In this article, the historical best estimates of 24-h-forecast horizon are used, except in 2014 when CMEMS forecast estimations were no longer available and their reanalysis is used. While GLO and MED include data assimilation from satellite altimetry and Argo floats over the whole period of study, IBI and WMOP prediction systems have only implemented data assimilation in March 2018 and November 2018, respectively. The drifter trajectories are not assimilated in any of the models. All models include river runoff, and none of the models includes tidal forcing except IBI, with 11 tidal harmonics (M2, S2, N2, K1, O1, Q1, M4, K2, P1, Mf, and Mm) built from FES2004 (Lyard et al., 2006) and TPXO7.1 (Egbert and Erofeeva, 2002) tidal model solutions.

MODEL SKILL ASSESSMENT METHODOLOGY Normalized Cumulative Lagrangian Separation Distance and SS
The Normalized Cumulative Lagrangian Separation (NCLS) distance, developed by Liu and Weisberg (2011), is defined as the cumulative summation of the separation distance between simulated and observed trajectories (i.e., the cumulative separation distance, referred to as D), weighted by the length of the observed trajectory accumulated in time (i.e., the cumulative observed trajectory length, referred to as L), as follows: where d i is the separation distance between observed and simulated trajectories at time step i, lo i is the length of the observed trajectory at time step i, and N is the forecast horizon. All distances are calculated using the plane sailing formula.
In contrast with the simple separation distance, the NCLS is able to estimate the relative performance of a model in strong and weak current regions. However, as a measure of trajectory model performance, the NCLS is counterintuitive to the conventional model skill scores (e.g., Willmott, 1981;Liu et al., 2009). Indeed, the smaller the NCLS value, the better the model performance. Thus, Liu and Weisberg (2011) proposed a new SS defined as follows: where n is a nondimensional, positive number that defines the tolerance threshold for no skill. This threshold corresponds to the criterion that the simulation is considered as having no skill (i.e., SS = 0) if the growth of the predictability error is larger than n times the mean flow displacement, i.e., D ≥ n × L. Larger (smaller) n values correspond to lower (higher) requirements to the model. In general, n is set to 1 (e.g., Röhrs et al., 2012;De Dominicis et al., 2014;Berta et al., 2015;Roarty et al., 2016). However, some studies selected the n value heuristically, in such a way that a small percentage of the SS values are negative and set to zero (e.g., n = 2 in Sotillo et al., 2016). Hence, the formulation of the SS contains two preliminary defined parameters: the forecast horizon N, and the threshold n. In this study, we address the importance of the proper choice of both parameters, based on a sensitivity analysis.

Design of the Skill Assessment Method
Virtual particles are launched hourly along the successive positions of the observed drifters and for a forecast horizon of N = 72 h. Pairs of observed and simulated trajectories are then compared hourly at each time step i = 1, 2, . . . N, and the SS is computed as a function of N. To ensure an exact comparison at different forecast times, the exact same temporal length of the observed trajectories is considered for all forecast times (i.e., the observed trajectories minus the last 72 h of observation).
The virtual trajectories are computed using the COSMO Lagrangian model described by Jimeìnez Madrid et al. (2016), which is a free software available in GitHub repository 3 (version from June 5, 2019, doi: 10.5281/zenodo.3522268). COSMO consists of a fifth-order Runge-Kutta advection scheme (RK5) with a time step of 1 h, with a bicubic interpolation in space and a third-order Lagrange polynomial interpolation in time, and does not include diffusion, Stokes drift, or beaching. This article focuses on the methodology for comparing the performance of different ocean models in predicting trajectories, but does not focus on the performance of the trajectory model itself.
For this reason, we use a very simple trajectory model and do not consider the errors in Lagrangian modeling (e.g., van Sebille et al., 2018). However, it has been verified that the trajectories simulated by COSMO using WMOP currents agreed with the ones simulated using CDrift (Sayol et al., 2014), as shown in Reyes et al. (2020a).

Spatiotemporal Distribution of the SS
We start analyzing the spatiotemporal distribution of the SS of each model considering the case of a high skill model requirement (n = 1) and two typical time horizons for model predictions (N = 72 h and N = 6 h). Figure 4 shows that the SS at 72 h exhibits high spatiotemporal variability, with very different values, depending on the model and the period considered.
In 2014, WMOP outperforms the other models in the region of strong frontal dynamics along the northern coast of Ibiza (SS > 0.6, Figure 4D). On the contrary, GLO shows no skill in this region (SS ∼ 0, Figure 4B), but obtains the best performance among models in the south of the Ibiza Channel, and shows similar skill to WMOP in the northeastern side of the Ibiza-Formentera Islands (SS ∼ 0.4 − 0.8, Figures 4B,D). IBI also outperforms its parent-model GLO along the northern coast of Ibiza ( Figure 4A), but shows similar or even lower skill elsewhere. In the Ibiza Channel, WMOP shows both instances of low (SS ∼ 0) and very high performance (SS > 0.8, Figure 4D), revealing the high temporal variability of the model predictive capability. Indeed, low SS values are obtained over the positions observed at the beginning of the experiment (September 30 to October 2, 2014; see Figure 2), whereas very high SS values (the best among models) are obtained a few days later, on October 6-8, 2014. We come back to this point in section "The Averaged SS".
In 2016, all models perform better than in 2014 and 2018, probably because all drifters were entrapped in the densitydriven Balearic current (Ruiz et al., 2009;Mason and Pascual, 2013), which is generally better represented in the models. However, spatial differences are also observed: WMOP shows high performance north of 38.8 • N (SS > 0.7) but lower performance in the southern part (SS ∼ 0.3, Figure 4H), in agreement with past studies that have shown that WMOP has a larger bias in the southern part of the Ibiza Channel (e.g., Aguiar et al., 2020). On the other hand, MED shows higher performance in the Ibiza Channel (SS > 0.7) than in the northern part (SS ∼ 0.4, Figure 4G). Similarly, GLO exhibits lower SS values in the westernmost part (Figure 4F), and IBI has lower skill in the vicinity of Ibiza (Figure 4E).
In 2018, it is striking that GLO shows the highest SS over the entire region (SS ∼ 0.3 − 0.7, Figure 4J), despite its lower spatial resolution compared to the other models (Table 1). On the opposite, its child-model IBI shows the lowest performance (SS < 0.4 in most cases, Figure 4I). In some areas, WMOP obtains very high SS [e.g., in the northern part, where drifters got trapped in an eddy (Figure 4L)], being able to improve its parentmodel MED (Figure 4K). However, WMOP also obtains lower SS than MED in other regions (e.g., on the westernmost part of the trajectories).
We now compute the SS with a forecast horizon of N = 6 h (Figure 5), which is more consistent with the duration of the search that maximizes survivors in SAR missions (US Coast Guard Addendum, 2013). Although the regional differences observed in Figure 4 are still present (e.g., WMOP shows better performance along the northern coast of Ibiza than in the southern part of the Ibiza Channel in both 2014 and 2016, Figures 5D,H), the SS exhibit lower values and more irregular patterns. The case of GLO in 2018 is still striking: whereas GLO clearly exhibits the best performance with a forecast horizon of 72 h (right panels of Figure 4), its performance is much lower at 6-h forecast (Figure 5J vs. Figure 4J), providing only slightly higher SS than the other models (right panels of Figure 5).

Sensitivity to the Forecast Horizon
To further understand the SS dependence on forecast horizon, we analyze the cumulative separation distance D, the cumulative observed trajectory length L, the NCLS, and the SS obtained along each drifter path, for each model, and for different forecast horizons N. Since GLO in 2018 exhibits a striking rise of the SS, we show in Figure 6 this analysis for GLO along the path of one drifter in 2018. However, similar results are obtained for all drifters and all models (not shown).
As observed in Figure 6A, L is, on average, always larger than D, and L grows with advection time at a much higher rate than D: whereas they are of the same order of magnitude after 6 h of simulation (D ranges from 0.7 to 26 km with a mean value of 8.8 km, and L ranges from 1.7 to 35 km with a mean value of 11.3 km), L is almost twice D after 72 h of advection (D ranges from 245 to 1403 km with a mean value of 753 km, whereas L ranges from 694 to 2,220 km, with a mean value of 1,372 km). In other words, the cumulative absolute dispersion of the observed drifter grows much faster than the cumulative separation distance. Figure 6B shows the evolution of the NCLS distance (Eq. 1). At 6-h forecast, the NCLS shows spikes, which are much less pronounced after 24-h forecast, and nonexistent after 72 h. Those spikes, as already noted by Liu and Weisberg (2011) for a 1-day simulation, are due to the sudden decrease of the trajectory length L compared to the cumulative separation distance D (Figure 6A). Particularly, they are related to small-scale dynamical features that induce slow currents and consequently a small displacement of the observed drifter and that are not well represented by the models, such as inertial oscillations. However, removing the inertial oscillations applying a 40-h low-pass filter leads to even smaller values of L and consequently more occurrences of NCLS spikes (not shown). As L becomes much higher than D as the forecast time increases, there are more situations of significantly large NCLS values at shorter forecast time than at longer forecast time ( Figure 6B). Considering n = 1, SS = 0 if NCLS > 1. As the NCLS spikes are higher and more frequent at shorter forecast times, the SS get more zero values and is thus more irregular after 6-h forecast than after 24 or 72 h ( Figure 6C). Figure 6B is the decrease of the NCLS distance with forecast time in most cases, dropping from an average of 0.84 at N = 6 h to 0.55 at N = 72 h, resulting in the increase of the SS (Figures 6C,E,F,G), rising from an average of 0.25 at N = 6 h to 0.45 at N = 72 h. This tendency is directly related to the higher rate of L with respect to D, as seen before (Figure 6A), and can result from two different scenarios. First, this can result from the convergence of the simulated and observed trajectories as the forecast time increases, inducing a decrease of the separation distance. For instance, during the first days of the experiment, GLO simulates northwestward trajectories, whereas the observed drifter goes eastward then northward and then westward ( Figure 6D). The simulated trajectories launched during the first hours of the experiment thus finally converge with the observed drifter path after a certain time. As a result, the SS increases from SS ∼ 0.1 at 6-h forecast to SS ∼ 0.5 at 72-h forecast.

Another observation from
Second, the higher rate of L with respect to D can result from rapidly evolving local structures such as small-scale eddies that induce a large displacement of the observed drifter. As an example, we show in Figure 7A a drifter trajectory between November 25, 2018, 19:00 and November 28, 2018, 19:00, and the corresponding simulations from the four models. While GLO performs the best during the first 12 h and thus obtains the highest SS (Figure 7D), its simulated trajectory deviates from the observed one during the following hours. However, its SS increases from SS ∼ 0.4 at 12-h forecast to SS ∼ 0.6 after 72 h because the distance traveled by the drifter strongly increases when the drifter becomes entrapped in a highly energetic structure, whereas the separation distance increases only slightly (Figures 7A-C). Similarly, MED and IBI exhibit an increase of the SS with forecast time (Figure 7D), although their simulated trajectories go in the opposite direction than the observed one ( Figure 7A). On the other hand, the SS of WMOP increases with forecast time because the resemblance between the simulated and observed trajectories is better after the first 6 h of simulation. This example illustrates the high dependence of the SS model ranking to the forecast horizon: whereas the SS of GLO is the highest among models for a simulation time < 24 h, WMOP and GLO obtain a similar value of SS ∼ 0.6 from 36 h of simulation ( Figure 7D).
The SS is therefore strongly dependent on the forecast horizon, and an overly long forecast time can lead to an overestimation of the SS. Hence, although a short forecast time leads to more irregular SS patterns because of not well-resolved small-scale dynamics, in coastal areas characterized by high variability of the surface currents, a short forecast time should be applied (e.g., 6 h). In the rest of this article, we therefore consider a forecast horizon of 6 h.

The Averaged SS
We now examine the best way to obtain a model ranking based on the spatiotemporal average of SS considering the whole period of observations. However, in the definition of the SS, the negative values are imposed to zero. The average of SS thus overestimates the averaged model's skill and can be misleading. In fact, the negative values indicate how far is the NCLS to the tolerance threshold n and thus give information on how far the simulated trajectory is from the observation. These values should therefore be included in the calculation.
For each experiment and each model, Figure 8 shows the histogram of SS * , which is defined as the SS, but without the imposition of the negative values to zero, such as: As in the previous section, we use n = 1. The average SS and the average SS * are given in Figure 8 in blue and red, respectively, together with their confidence intervals. These confidence intervals have been estimated in two ways, providing similar results. The first approach uses the bootstrap distribution of SS * from 10,000 resamples, whereas the second method is based on the Student t distribution.
In 2014, the SS * of IBI, GLO, and MED exhibit a nearly bellshaped distribution with a maximum occurrence of near-zero values (Figures 8A-C). The average is therefore close to zero, sometimes even negative, as for GLO (SS * = − 0.06 ± 0.02). On the other hand, WMOP obtains a left-skewed distribution, with a higher percentage of values in the range of [0.6-1] compared to the other models, but also more occurrences of large negative values (SS * < − 1). Consequently, WMOP obtains the lowest averaged SS * with a negative value of SS * = − 0.41 ± 0.05, strongly penalized by large negative values obtained during the first days of the experiment, whereas it obtains the highest performance a few days later, as seen previously in Figure 5D.
In 2016, differences between SS * and SS are much smaller because all models perform better during this period, as already seen in Figure 5, and only a few negative values are obtained (middle panels of Figure 8). Both calculations indicate GLO as the best model on average, although the SS * differences between GLO, IBI, and MED are not statistically significant at the 95% confidence level. In 2018, on the other hand, SS * and SS are strongly different. GLO is again significantly better than the other models, obtaining a positive averaged score of SS * = 0.14 ± 0.02, whereas the other models obtain negative averaged SS * (right panels of Figure 8).
These results show that the imposition of a lower limit value of zero for SS < 0 is not adequate if spatiotemporal averages are computed afterward, as it would lead to a strong overestimation of the averaged SS and could lead to an erroneous model ranking. However, in the context of SAR operations, the SS * could be less intuitive than the SS as it is in the range of [−∞ 1]. A solution could therefore be to determine the appropriate threshold n such that 100% of the SS * values are positive, obtaining a userfriendly metric from 0 to 1. Figure 9 shows the percentage of positive SS * values obtained by each model as a function of n in the range [0.5-3]. As observed, the minimum value of n from which 100% of positive values are obtained differs for each model and experiment. While in 2016, a threshold of n = 2 is sufficient, in 2014 and 2018 a tolerance threshold of n > 3 would be necessary.
This scenario dependence of the appropriate threshold n makes difficult its use in operational applications, where different models have to be simultaneously evaluated in a systematic routine manner. The use of SS * with n = 1 remains therefore the safest solution. The percentage of positive values is an interesting additional metric to provide, as in Phillipson and Toumi (2017), as it is also an indicator of the level of performance. For example, it is striking that in 2014 and 2018, WMOP obtains the highest percentage of positive values among models if a tolerance threshold of n = 0.5 is selected, meaning that WMOP obtains the highest percentage of cases where D < L 2 , although it obtains the lowest score on average in 2014, and the second lowest score in 2018. This again reflects the high variability of WMOP performance, which can be in strong disagreement with the drifters, but also in very good agreement, in this case outperforming the other models.

Use of HFR-Derived Trajectories
We now analyze the use of HFR-derived trajectories to evaluate the model performance. First, we evaluate the skill of HFR in reproducing the drifter trajectories (Figure 10), and compare the SS * of HFR with the one from the models ( Table 2), considering only the drifter observations available inside the HFR coverage. As shown in Figure 2, this corresponds to the period of September 30, 2014, to October 10, 2014 (10 days); July 28, 2016, to July 31, 2016andNovember 15, 2018, to November 30, 2018 (15 days).
In 2014, HFR shows much higher skill than models, with SS * = 0.32 ± 0.02 and 84% of positive values, whereas the best model, IBI in this case, obtains SS * = −0.01 ± 0.02 with 57% of positive values. In 2016, on the other hand, MED significantly outperforms HFR (SS * = 0.63 ± 0.04 and SS * = 0.42 ± 0.06, respectively), and GLO and IBI also exhibit higher performance than HFR on average, although the confidence intervals slightly overlap. In 2018, GLO shows similar skill than HFR (SS * = 0.12 ± 0.03 and SS * = 0.10 ± 0.04, respectively), but GLO obtains more positive values (79% against 68%). This indicates that models show sometimes better skill than HFR in reproducing the drifter observations, as already noted in Roarty et al. (2016). However, in 2018, the spatiotemporal distribution of the SS * reveals that HFR performs better than any of the models in the center of the domain (Figure 10C vs. the right panels of Figure 5), but shows very low performance in the outer edges, where uncertainties in HFR data are higher because of larger GDOP errors  and where data gaps usually occur.
Despite this limitation, HFR allows obtaining a larger number of trajectories, improving not only the robustness of the SS statistics but also the spatial and temporal assessment of the model performance. From now on, we therefore use HFR-derived  For each period, the sample size l is indicated, and the best performance is marked in bold.
trajectories to evaluate the models (Figure 11). The methodology consists in selecting the HFR grid points where there is 80% of data availability during the period considered (as recommended by Roarty et al., 2012). We obtain 287 points in 2014, 338 in 2016, and 277 in 2018. Virtual particles are launched hourly from these grid points for a 6-h horizon forecast, and the trajectories simulated by the models are compared against the HFR-derived trajectories. In order not to mix different dynamics, we apply this method during a short time frame of 6 h. As an example, and for comparison with the results obtained using the drifters, we select the first 6 h of the experiments, i.e., from 13:00 to 19:00 on September 30, 2014, and from 13:00 to 19:00 on November 15, 2018. In 2016, this period corresponds to an exercise at the radial sites, and the transmission was off for 4-5 h, decreasing the data availability. We therefore choose the period just after the exercise: from 00:00 to 06:00 on July 29, 2016. However, we reach similar conclusions when computing the SS * hourly during the first 3 days of observations instead of the first 6 h (not shown). Figure 11 shows the spatial distribution of the temporally averaged SS * of each model during these 6-h periods. We use a tolerance threshold of n = 1, and a lot of values obtained are negative. In order to not conceal the negative values and observe the spatial variability, the color bar is adjusted to [−1 1]. To ease interpretation, the corresponding averaged surface currents during the same periods from the different models and HFR are shown in Figure 12.
In 2014, IBI shows the best performance ( Figure 11A), being able to improve its parent-model GLO ( Figure 11B), with SS * values around 0.5 in most of the domain, in agreement with the relatively good correspondence between IBI surface currents and HFR observations (Figures 12A,B). On the other hand, MED and its child-model WMOP simulate a northward circulation (Figures 12D,E) and show therefore a strong disagreement, especially in the western part, indicated by very low score values (SS * < − 1, Figures 11C,D). Interestingly, SS * values are higher (albeit close to zero) in the southeastern part, because the models' current amplitudes are much weaker (Figures 12D,E), thus leading to smaller separation distances, even though the model outputs do not match with HFR observations.
In 2016, MED shows the best performance in the central and eastern part of the domain (SS * ∼ 0.6, Figure 11G). The other models all show lower skills in this region because their surface currents overestimate the eastward component (middle panels of Figure 12). In the western part, IBI exhibits the highest scores (SS * ∼ 0.5), while along the island of Ibiza, WMOP shows the greatest resemblance with HFR.
In 2018, IBI shows the best agreement with HFR in the northern and southeastern parts (SS * ∼ 0.5, Figures 11I, 12K,L). It is striking that, over the whole domain, GLO obtains globally the best scores, albeit with low SS * values (Figure 11J), whereas its surface currents do not resolve the small-scale features observed by HFR (Figures 12K,M). This is because GLO surface current intensity is much smaller compared to the other models, leading to the shortest separation distances between GLO and HFR trajectories, whereas the higher-resolution models exhibit very energetic circulation features (Figures 12L,N,O), leading to larger separation distances if the structures are not correctly located. However, although these SS * values are the best among models, they are still low (SS * ≤ 0.1) and therefore do not reflect a good agreement between GLO and HFR.
To compare these results with the ones obtained using drifters, we compute the spatial average of the SS * over the areas covered by drifter observations during these 6-h periods ( Table 3). As the trajectories cover different regions in each experiment, the areas selected do not cover the same region nor have the same size in the three experiments (Figures 11, 12). In 2018, the area selected is small because all drifters cover the same region during the period considered, whereas in 2014 and 2016, the trajectories are far away from each other. However, as the dynamics is quite consistent over the region (Figure 12), the area covering all trajectories is selected. Furthermore, similar results are obtained if smaller regions are selected (not shown).
In all experiments, the same model ranking is obtained as when using the drifter trajectories (Table 3). In 2014, GLO outperforms the other models in the region selected, with SS * = 0.21 ± 0.10 and SS * = 0.34 ± 0.05 obtained from drifter and HFR, respectively. However, GLO and IBI skills are not statistically different, in agreement with their very similar surface currents in this area (Figures 12B,C). In 2016, MED obtains the highest score over the region, with SS * = 0.55 ± 0.16 and SS * = 0.46 ± 0.04 obtained from drifter and HFR, respectively. In 2018, GLO performs the best with SS * = 0.01 ± 0.02 and SS * = −0.06 ± 0.01 obtained from drifter and HFR, respectively. However, as seen previously, these values are low, in agreement with the strong difference between GLO and HFR surface currents in the area, in nearly opposite directions (Figures 12K,M). None of the models is able to adequately represent the ocean dynamics of the area during this period.

SUMMARY AND DISCUSSION
We have assessed the Lagrangian performance of four ocean models in the Ibiza Channel applying the methodology developed by Liu and Weisberg (2011) and using 22 drifter trajectories from three different periods under diverse dynamical conditions. The sensitivity of the SS has been analyzed for different forecast horizons ranging from 6 to 72 h, further examining the best way to obtain an average model performance, avoiding biased conclusions, thus introducing the novel SS * . Finally, we have evaluated the skill of the HFR surface currents in reproducing the drifter trajectories and analyzed the use of HFR-derived trajectories for assessing the relative performance of the models.
Our analysis shows (see sections "Spatiotemporal Distribution of the SS" and "Sensitivity to the Forecast Horizon") that the model performance is highly scenario-dependent and can exhibit highly variable SS values over the same region if the available trajectory observations sample different periods dominated by diverse dynamical conditions. This highlights the need of developing skill assessment methods to quantify the models' performance in a systematic routine manner. We have also shown that the SS is very sensitive to the forecast length, so the forecast horizon should be chosen with caution. At short forecast times (i.e., 6 h), the SS exhibits a much higher variability, with more occurrence of zero values. This is due to the relatively short trajectory lengths L observed compared to the separation distance D obtained at short temporal and spatial scales not properly resolved by the models. Indeed, for short forecast horizons, L can frequently be much smaller than D in case of small-scale dynamical features that induce slow currents and consequently a small displacement of the  observed drifter and that are not well represented by the models, such as inertial oscillations. However, removing the inertial oscillations applying a 40-h low-pass filter leads to even smaller values of L and consequently a higher variability of the SS (not shown).
We have also shown that the SS increases with forecast time in most cases (see section "Sensitivity to the Forecast Horizon") because the cumulative length of the observed drifter grows faster than the cumulative separation distance (i.e., L >> D), without necessarily implying a good model performance. This tendency is explained, on the one hand, to the decrease of the separation distance with time due to high variability of the surface currents that lead to the convergence of the simulated and observed trajectories after a certain time and, on the other hand, by the strong increase of trajectory length L in case of intense smallscale dynamics, which occur very often in coastal areas. An overly long forecast time thus leads to an overestimation of the SS. Hence, in coastal areas characterized by high variability of the surface currents, a short forecast time should be applied (e.g., 6 h), although it leads to more irregular SS patterns.  Figure 11 and indicate the regions where the averages of Table 3 are applied. For each case, the sample size l is indicated, and the best model performance is marked in bold.
For SAR operation purposes, we examined the best way to identify the model with the highest mean performance over an area and along a specified period (see section "The Averaged SS"). We have shown that the averaged SS can be misleading because of the previous imposition of the negative values to zero (Eq. 2). Indeed, whereas the original definition of the SS is correct for analyzing its spatiotemporal distribution (as in section "Spatiotemporal Distribution of the SS"), it is not adequate if averages are going to be applied afterward, especially if a large percentage of negative values are obtained. One solution can be to determine the appropriate tolerance threshold n such that 100% of positive values are obtained. However, the appropriate threshold is strongly scenario-dependent. Therefore, although defining an appropriate n value is feasible for a specific case study, it is not appropriate for an operational routine of model skill assessment. In the aim of providing an easily interpretable metric that could be averaged to provide an accurate model ranking, we thus propose the introduction of the novel SS * , which is defined as the SS, but without the imposition of the negative values to zero (Eq. 3). Moreover, the percentage of positive SS * values is an interesting metric to provide, as in Phillipson and Toumi (2017). In addition, being the model performance highly scenario-dependent, the average should be applied over the shortest possible period in order not to mix different dynamics.
Whereas drifters can only provide assessment along their drifting paths, and only when drifter observations are available, HFR data have the advantage to be continuous in space and time and to cover wide coastal areas. HFR-derived trajectories could therefore be used to evaluate the models, providing further information about the spatial distribution of the model performance (see section "Use of HFR-Derived Trajectories"). However, models show sometimes better skill than HFR in reproducing the drifter observations, as already noted in Roarty et al. (2016). The chaotic nature of the ocean implies that small differences in the velocity field at the initial conditions can drastically modify particle trajectories (Griffa et al., 2004). Hence, the discrepancies between HFR-derived trajectories and drifter observations can be due to Lagrangian model errors (spatial and temporal interpolation), or because of observational errors, especially along the baseline between the two HFR stations and at the domain outer edge, due to larger GDOP errors. Moreover, there can be errors in the reconstruction of the total velocity vectors from radials, and HFR OMA data may contain significant errors for some areas and/or periods. Besides, there are drifter's measurements errors: drifters are equipped with GPS receivers with an accuracy of approximately 5-10 m (Berta et al., 2014) and are subject to windage and slippage, which for CODE drifters are estimated to be within 1-3 cm/s for winds up to 10 m/s (Poulain et al., 2009). Furthermore, the discrepancies between HFR-derived trajectories and drifter observations could be because of the different nature of the measurements: HFR velocities are assumed to be uniform over cells of 3 km and over time intervals of 1 h (Lipa and Barrick, 1983;Graber et al., 1997), vertically integrated and representative of the upper 1 m of the water column (at the operation central frequency of the HFR used here), whereas the drifters represent the uppermost layer and the very local dynamics (Berta et al., 2014).
Despite this, the model ranking based on HFR-derived trajectories provides similar results to drifter observations, demonstrating the great potential of this method for estimating the model's performance at operational basis. HFR could therefore be used in combination with the available drifter trajectories, if any. As HFR data are continuous in time, this method can be applied in near-real time, which is a strong advantage for evaluating extremely scenario-dependent models. This is a very promising result and opens up the possibility for implementing short-term prediction of HFR surface currents (Zelenke, 2005;Barrick et al., 2012, Frolov et al., 2012Orfila et al., 2015, Solabarrieta et al., 2016, Vilibić et al., 2016Abascal et al., 2017) for SAR applications, although this is outside the scope of this study.
Our analysis shows that downscaling models (e.g., IBI and WMOP) are sometimes able to improve the Lagrangian predictability of their parent models, as in 2014, but this is not the case in the other experiments. This is in part because of the double penalty error: regional higher-resolution models generally produce better defined mesoscale structures, but due to the chaotic nature of the ocean and the sparsity of measurements available for assimilation, particularly in coastal areas, they may contain small-scale features which are not present in the real ocean or are mislocated in space or time (Sperrevik et al., 2017;Mourre et al., 2018, Aguiar et al., 2020. Moreover, higherresolution models tend to create much more energetic dynamics, thus leading to larger separation distances if the structures are not correctly located. Consequently, the SS is on average more favorable to coarser-resolution models, because the large-scale structures are generally better represented in these models, and because a smoother dynamic has less probability to be in strong disagreement with the drifter observations. This is why WMOP obtains both very low and very high performance during the experiments, whereas GLO obtains higher performance on average in 2016 and 2018. Finer temporal resolution may also contribute to a better representation of the dynamics of the region. The higher performance of IBI and WMOP in 2014 compared to their respective parent models can be attributed to the low (daily) temporal resolution of GLO and MED in 2014 compared to the hourly and 3-hourly resolution of their respective child models. In 2016 and 2018, on the other hand, WMOP might be penalized by its 3-hourly resolution compared to the hourly resolution of the other models. Moreover, its high spatial resolution of 2 km may be smaller than the distance traveled by a particle in 3 h, generating numerical diffusion problems (not resolved advective scales).
In this article, we do not consider the natural dispersion of ocean particles, i.e., the fact that particles deployed at the same location may follow different paths (e.g., Schroeder et al., 2012). It is indeed not reasonable to talk about the path of a single drifter, as that path is almost surely unequalled. To ensure more robust statistical results, both single-and multiple-particle statistics are required. The methodology described in this article should therefore be applied to particle clusters, as in Bouffard et al. (2014), applying the SS * to each individual trajectory and then applying the average.

DATA AVAILABILITY STATEMENT
The original contributions presented in the study are included in the article/supplementary material, further inquiries can be directed to the corresponding author.

AUTHOR CONTRIBUTIONS
ARé and ER conceived the idea of this study in the context of the IBISAR CMEMS User-Uptake project and designed the methodology. ARé collected the data, performed the computation of the Lagrangian model, applied the methodology, conducted the formal analysis, and took the lead in writing the manuscript. ER contributed to the writing of the manuscript, collected the information from the Spanish Maritime Safety and Rescue Agency, edited the Figure 1, applied the HFR gap-filled methodology, and collected the metadata for the DOI of the datasets. ER and JT lead the funding acquisition leading to this publication. BM and IH-C aided in interpreting the results and significantly contributed to the main structure of the manuscript. ARu, PL, CF, JM, EÁ-F, and JT provided critical feedback and helped shape the final version of the manuscript. All authors have read and agreed to the published version of the manuscript.