Accurate Locations of Felt Earthquakes Using Crowdsource Detections

We present a methodology that uses crowdsourced detections as an initial location to obtain fast and reliable hypocenter parameters for felt earthquakes using arrival-time data from the GEOFON Program. We derive selection criteria for issuing an alert message using a 3-year-long training set from the trial runs at the European-Mediterranean Seismological Centre (EMSC) to identify accurate event locations at a high confidence level. Since an event may have several crowdsourced detections, we also develop a methodology dealing with multiple triggers. We validate the selection criteria using real-time processing of recent data and demonstrate that 95% of the selected events are within 50 km distance from the traditional seismic location published by the EMSC. Since CsLoc remains essentially a seismic location algorithm, the selection criteria measure the quality of the seismological network coverage used in the location, not the method itself. We show that our methodology provides accurate locations much faster than those published by conventional seismic methods. On average, the EMSC CsLoc service can provide rapid and accurate locations within a minute after the occurrence of a felt earthquake, thus it can provide timely and accurate information on a felt earthquake to the civil protection services and the general public.


INTRODUCTION
Earthquake crowdsourced detections are based on following eyewitnesses' immediate reactions to felt earthquakes on various social media platforms, such as Twitter (Earle et al., 2011), traffic on the EMSC website (Bossu et al., 2014), and the number of launches of the EMSC smartphone app, LastQuake (Bossu et al., 2018). While other crowdsourced approaches in seismology (e.g., Cochran et al., 2009;Minson et al., 2015;Finazzi, 2016;Kong et al., 2016;Cochran, 2018) have focused on using accelerometers in smartphones or dedicated sensors that are maintained by the public, our approach exploits the public's search for information and their online reactions . In other words, a crowdsourced earthquake detection reflects a public desire for information. Offering a very fast earthquake location is a way to answer this desire. It is also instrumental for rapid engagement of eyewitnesses and to ensure efficient felt report collection from eyewitnesses which are in turn essential for rapid impact assessment (Bossu et al., 2015). It can also be exploited as a "heads-up" for civil protection services which might save lives in a period where every minute counts and this is why seismic networks around the world have been constantly pushing for always faster earthquake information (Kanamori, 2005).
Crowdsourced detections typically appear very fast in social media, almost immediately after the earthquake occurrence in densely populated areas. Hence, they can be used as an initial estimate of the earthquake location. This initial guess triggers our seismic data analysis to obtain a reliable earthquake location with a state-of-the-art event location algorithm. Steed et al. (2019) demonstrated that the crowdseeded location (CsLoc) approach produces quicker results than traditional earthquake alert algorithms, and that it can provide reliable locations even with a limited number of seismic phase arrivals.
This paper focuses on the conditions that would allow our method to enter into routine operational service, providing fast, reliable locations of felt earthquakes. This information can then be provided to the civil protection services and disseminated to the public. The public's appreciation for high accuracy is much less than it's dislike of false alarms, so one of the crucial aspects of our effort is to minimize the number of events with inaccurate locations whilst providing accurate locations on average. Hence, our objective is to achieve 50 and 80 km location accuracy (measured as the distance from the traditional seismic network location) at the 95 and 98% confidence levels, respectively, while maximizing the number of events that pass the publication criteria. To derive the selection criteria, we use a training set of 3year data, and validate the results on 4-month data from current real-time processing.

Crowdsourced Detection
We rely on three different crowdsourced detection methodologies to start a CsLoc analysis. Note that they may trigger CsLoc independently, therefore several triggers may exist for the same earthquake. CsLoc is initiated by the detection of increased traffic at the EMSC website, www.emsc-csem.org (Bossu et al., 2014); the detection of increased number of launches of the EMSC LastQuake smartphone application (Bossu et al., 2018); and the detection from the Twitter Earthquake Detection (TED, Earle et al., 2011) system that follows the keyword "earthquake" in 59 languages in tweets of less than seven words because people tend to react to stressful events such as earthquakes in just a few words. The TED system was developed by the United States Geological Survey National Earthquake Information Center (NEIC), and it is currently used in the EMSC crowdsourced detection system.
To detect an event, the number of app launches or website visits are monitored as counts/minute at 5 s intervals and a short-term average/long-term average (STA/LTA) algorithm is applied to these curves to detect peaks in the traffic . The latest count/minute is compared to a baseline created from an average of the last half an hour of traffic and if the difference reaches a preset threshold then a peak is declared. Various procedures are used to increase signal to noise and to eliminate false detections (such as those caused by automated scans of IP addresses or the website). For instance, only visitors that have not been seen within 30 min are included in the analysis, as this helps to remove frequent users from the data such as researchers from institutes. We also bin our users by country of origin so that the background noise level is reduced. As the EMSC becomes more known by the public, we will probably need to adjust our triggering system to take account of greater levels of traffic but the current system has worked well for since 2014.
Crowdsourced detections are typically obtained before the first seismic location is made, therefore the CsLoc procedure starts without having a location provided by local or regional seismic networks. Once a crowdsourced detection is made, the centroid of the largest cluster of geolocations of the users within 120 s before the detection time and within the country where the detection was made is passed to the CsLoc association module . The cluster centroid and the crowdsourced detection time serves as an initial guess for the earthquake location, and as noted above, several CsLoc processes could be initiated for the same event. The system collects arrival picks within 1000 km (for regions with sparse networks up to 2000 km) distance of the crowdsourced initial location from the global GEOFON Program (73 FDSN networks as used in GEOFON Data Centre, 2019; Steed et al., 2019) that includes some 800 stations. The P-wave arrival picks are received in real time from 210 s before until 120 s after the crowdsourced detection time using the GEOFON HTTP Message Bus (Heinloo, 2016).

CsLoc Association and Location
The CsLoc association process is optimized for speed and it uses the crowdsourced initial guess as the event hypothesis for finding corroborating arrivals. Hence, CsLoc is a seismic location algorithm that exploits the fact that we already know from crowdsourcing that an earthquake occurred, and we have a rough idea where and when the earthquake has struck. We assume that for our spatial range of interest the first P wave arrival is a Pn phase and we search for first-arriving P-phases that given the hypocenter origin hypothesis, providing a reasonably good fit to the ak135 (Kennett et al., 1995) Pn travel-time curve. Only those arrivals that are within three times the median absolute deviation (MAD) of the Pn travel time curve are passed to the locator.
Using the selected arrivals, we apply the iLoc (Bondár and Storchak, 2011;Bondár et al., 2018) location algorithm to locate the event. iLoc accounts for correlated travel time prediction errors due to unmodeled 3D velocity structures (Bondár and McLaughlin, 2009) and thus provides robust location estimates even for unfavorable network geometries. It is an iterative linearized inversion method that obtains an improved hypocenter estimate using a neighborhood algorithm (Sambridge, 1999).
As new data arrives and the location changes, it is necessary to repeat the association and location procedures several times until an acceptable solution is reached. Figure 1 illustrates the iterative association-location steps for the 2016-08-24, magnitude 6.2 Central Italy event. The crowdseeded location triggered by the EMSC website traffic is some 450 km away from the earthquake epicenter. The association algorithm considers P picks arriving FIGURE 1 | The CsLoc association and location cycle, for iterations (A) 0, (B) 1, and (C) 2. Top row: The initial crowdsourced trigger (yellow circle) may be far away from the EMSC seismic location (green circle), but iLoc (red circle) converges fast to the traditional seismic location. Yellow, blue and green triangles show the seismic stations considered, associated and used in the locations, respectively. Bottom row: First-arriving P phase picks are considered in a time window (green lines) before the crowdsourced trigger. Those within 3*MAD (blue lines and blue diamonds) of the best fitting travel time curve (red line) with the slope of the ak135 Pn velocity, 8.04 km/s, are passed to iLoc. in the time interval shown in green lines, and selects those that are within the 3 * MAD of the best fitting line with a slope of 8.04 km/s, the ak135 Pn velocity. On the map, green triangles show the seismic stations that iLoc used in the location and the iLoc solution is shown as a red circle. In the two next iterations, as the iLoc solution improves, the 3 * MAD interval for the candidate associations shrinks drastically and even after the first iteration the iLoc solution is very close to the final EMSC seismic location. Steed et al. (2019) executed 10 iterations of the association and location cycle with 15-s delays between each step. In this paper we focus on the determination of the set of conditions that will allow us to stop as soon as some quality assurance criteria are met. The selection criteria will also allow us to fully automate the CsLoc procedures.
The three types of crowdsourced detections (web traffic, LastQuake app, and TED) can each trigger the CsLoc procedure. For the web triggers the geolocation is based on the user's IP address that varies from country to country and it is often accurate to the city level or less. If the website is accessed via a mobile phone, the geolocation often gives the location where the FIGURE 3 | Location map of events in the (A) training and (B) validation data sets. Circles color coded by depth denote the events that pass the selection criteria described later in the text; empty circles represent the events that did not pass the criteria. (C) Histogram of (C) depths and (D) magnitudes of event in the training (blue) and validation (red) data sets. Filled bars in the histogram represent events that pass the selection criteria. mobile network is connected to the internet. Thus, as Figures 1, 2 illustrate, the physical location of the users can be quite inaccurate and often biased by large cities and therefore the centroid of the crowdsourced detections often coincides with a large city, such as Istanbul, Athens, Milan, etc. This is always true for IP locations and tweets.
The LastQuake app asks for the user's permission to access their mobile phone's location, otherwise it determines the user's location using triangulation or wifi. Some 80% of users allow the use of location services, therefore the app triggers are considered the most accurate. Furthermore, the website and app detection systems are monitored in each country separately. The Twitter FIGURE 5 | (A) Histogram (blue) and cumulative distribution (red line) of the distance of CsLoc locations from published EMSC locations for the validation data set. Green lines mark the 95 and 98% confidence levels and the 50 and 80 km location accuracy targets, respectively. The green line at the 50% confidence level indicates that 50% of the locations are within 10 km from the EMSC location. (B) Event mislocation by crowdsource triggers that first satisfied the publication criteria. Only 1 event was located with a larger than 80 km location error. detection system determines the location of the user from the profile of the author found in each tweet. It also tries to divine the user's location based on the language used in the tweet. Therefore, the accuracy of TED triggers may also exhibit a large scatter.
Because of the various triggers, it is not uncommon that there are several crowdsource detections for the same event.
CsLoc is robust enough to reach accurate locations, even if the initial location is far off. However, it helps to identify these multiple strains early on. We analyzed our data set to find reasonable criteria to decide if two crowdsourced detections are generated by the same event. We found that events with a large number of seismic arrivals and those with just a few seismic arrivals require separate logic. We rely on the assumption that if two solutions share a fair amount of common seismic arrival picks then the events are likely to be the same. For candidate events for multiple triggers we check the number of common seismic arrivals for each event pair. If the number of common seismic arrival picks is larger than 20, we declare the two events common. For events with just a few picks, we require at least three common seismic arrival picks and that Frontiers in Earth Science | www.frontiersin.org 20% of the seismic phases be shared between the events to declare them the same. Figure 2 shows examples for CsLoc event location trajectories starting from several different crowdsourced detection. Recall that the crwodsourced detection is the barycenter of the eyewitness locations. Green trajectories denote web-based triggers, red lines LastQuake app triggers and blue trajectories TED triggers. One of the major strengths of our method is that regardless of the trigger type and the initial mislocation, CsLoc is capable to obtain a final solution that is very compatible to the final EMSC solution of the event. Steed et al. (2019) executed 10 iterations of the association and location cycle with 15-s delays between each step and developed publication criteria based on the combination of acceptance thresholds of six different parameters. Exploiting the accumulated wealth of data, we aim to simplify the original publication criteria and focus on the determination of the set of conditions that will allow us to stop as soon as some quality assurance criteria are met.

RESULTS
To determine the new selection criteria, we use a training set of crowdsourced detections between January 2016 and May 2019 including 708 events triggered by the EMSC website traffic, 782 events triggered by the LastQuake app, and 648 events triggered by TED. Note that the same earthquake may initiate several triggers and the data set represents 2,138 unique events. To validate the selection criteria, we use the data set between 10 October 2019 and 12 December 2019 that were not used in the creation of the training data set. We consider only those events that produced a location at the last, 10th iteration. The validation data set contains 288 events of which 123 events triggered by the EMSC web-site traffic, 97 events triggered by the LastQuake app, and 68 events triggered by TED. Figure 3 shows the location map of the training and validation sets, as well as their depth and magnitude distributions. The training set represents a fairly good representation of global seismicity of felt earthquakes, while the validation data set, owing to its much shorter time window, have events mostly from Europe and South America. Nevertheless, the depth and magnitude distribution of the events in the training and validation sets are quite similar. Note that both sets have subcrustal and intermediate depth events, and the magnitudes span from small to large events.
We consider the secondary azimuthal gap in the network used in the location, and the MAD of the residuals after the iLoc location in each iteration. The secondary azimuthal gap is obtained by calculating the largest azimuthal gap when removing one station from the network and it is a good indicator of reliable, accurate locations (Bondár et al., 2004). The MAD of the residuals helps removing outliers due to noisy data or associations from other events, typically aftershocks. We use the distance between the published EMSC location and the CsLoc location as the metric to measure the performance of CsLoc. These parameters measure of the seismic network coverage that ultimately controls the location accuracy.
Our design goal is to achieve 50 km location accuracy at the 95% confidence level and less than 80 km mislocation at the 98% confidence level while maximizing the number of events that pass the criteria and stop the iterations as soon as possible to facilitate quick but reliable earthquake alert information. This means that only 5 and 2% of the events would have a location error larger than 50 km and 80 km, respectively, all the rest will be much more accurately located. We calculate the metric for a series of secondary azimuthal gap thresholds between 180 and 300 degrees (the smaller the secondary azimuthal gap, the more favorable the network geometry to produce accurate locations) and a MAD residual threshold of 3, 4, 5, and 100 (the latter being no constraint on MAD). We found that setting the MAD threshold to 4 s is a reasonable choice, that excludes obvious outliers while keeping most events.
As noted previously and illustrated on Figure 2, the different triggers represent different levels of reliability, therefore we develop the selection criteria for each trigger type separately. The web traffic and TED crowdseeded initial locations can be far away from the final solution, and they may need a few iterations for CsLoc to close on the right location. On the other hand, the LastQuake app crowdseeded location can be quite accurate, therefore the final CsLoc solution might be obtained in just one iteration. Thus, we also set thresholds for the minimum number of iterations CsLoc has to perform before we apply the selection criteria. Figure 4 summarizes our results. The figure shows the cumulative distributions of the distance of the CsLoc location from the published EMSC solution for each trigger type for the series of secondary azimuthal gap thresholds for MAD leq 4. Note that Figure 4 shows only the upper 20% percentiles, from 80 to 100%, as we focus on location errors in the top 10 percentiles. We found that for the web traffic and TED triggers we should execute at least two iterations to allow for the warm-in period for CsLoc before testing for the criteria; for the LastQuake triggers we can apply the selection criteria right away.
We list our final publication criteria for each trigger types below. Note that these criteria measure the seismic network performance, not the quality of the crowdsource detection. That is only used as the initial guess for the location using observations from seismological stations. Once the selection criteria are met at any iteration after the prescribed number of iterations, the CsLoc association -location iteration cycle stops and an earthquake alert can be issued. The selection criteria for the web traffic triggers select 69% (488 out of 708) of the events with a median mislocation of 9.2 km from the EMSC solution and with a location accuracy of 41 and 77 km at the 95 and 98% confidence levels, respectively. For the LastQuake app triggers, they select 73.5% (575 out of 782) of events with a location accuracy of 10.4, 47, and 74 km at the median, 95 and 98% percentiles, respectively. For the TED triggers, the criteria select 68% (441 out of 648) of events with a mislocation of 13.2, 48, and 65 km at the median, 95 and 98% confidence levels, respectively.
Applied to the validation data set, the publication criteria for web traffic triggers selected 60.2% (74 out of 123) of events with a mislocation of 7.5, 42, and 52 km at the median, 95 and 98% confidence levels, respectively. The publication criteria for the LastQuake triggers select 56% (54 out of 97) of events with 8.7, 38, and 40 km mislocation at the median, 95 and 98% confidence levels, respectively. For the TED triggers, the publication criteria select 37% (25 out of 68) of events with a location accuracy of 8.5, 51, and 71 km at the median, 95 and 98% percentiles, respectively.
We indicated those events that passed our selection criteria in Figure 3 as the events color coded by depth. The events that did not pass the selection criteria are shown as empty circles, and concentrate in regions with somewhat poorer station coverage. The depth and magnitude distributions do not show any particular bias for events passing (colored bars) or failing the selection criteria (empty bars) either. Figure 5 shows the distribution of the CsLoc location differences from the published EMSC locations as well as the mislocations by the trigger types that first reached the publication criteria. The green lines show our target design criteria of 50 and 80 km location accuracy at the 95 and 98% confidence level, respectively. They indicate that the validation data set confirms that our publication criteria are indeed able to identify accurate locations for all trigger types that satisfy our design goals of minimizing the number of poorly located events and maximizing the number of accurately located events when issuing an earthquake alert to the public. The selection criteria will also allow us to fully automate the CsLoc procedures and the automatic publication of fast and reliable locations even using very limited data sets.

DISCUSSION
Aiming at fast and accurate locations for an operational centre such as the EMSC, the first issue to address is the identification of the single event to trigger among the various triggers for the same event. Thus, we check at each iteration if the event has already satisfied the publication criteria from another trigger, by applying the test for common events. If the event proves to be a common event by an earlier trigger and is already published, we simply abandon the trigger and stop processing the event. While other triggers may later result in slightly more accurate locations, our objective is to issue an alert at the earliest possible time with the stated location accuracy at high, 95 and 98% confidence levels.
Our crowdsourced detections carry no information on event depth, yet with the CsLoc procedures we are able to determine the depth with reasonable accuracy. Recall that CsLoc employs the iLoc location algorithm (Bondár and Storchak, 2011;Bondár et al., 2018) that provides robust depth estimates. In the CsLoc procedures the local networks typically provide sufficient resolution for depth determination. Figure 6 shows the histograms of the deviation of the CsLoc depth and origin time from the published EMSC values for the validation data set. The vast majority of CsLoc event depths are within 10 km of the EMSC depth, and the origin times are within 2 s from the published EMSC origin time.
In principle, CsLoc can also provide magnitude estimates. We plan to publish magnitudes alongside the hypocenters as that would be a fairly trivial task; all we need to do is to get the automatic amplitude measurements along with the first-P arrival picks and calculate the magnitude. Since we collect phase picks up to 1,000 km (for sparse networks up to 2,000 km) this would allow us to calculate local magnitude, ML. However, ML starts saturating relatively early at medium moment magnitudes, therefore for some cases ML would underestimate the magnitude. For these events we will not publish ML at all. Attenuation along the ray path and possible interference with Lg phase poses further problems that might bias the ML estimate. Obviously, we will have to rely on generic attenuation relations the same way as the most popular programs, such as Antelope, SeisComp3 do. Nevertheless, we believe that besides producing rapid, accurate locations for felt earthquakes it is also important to publish magnitudes for small events that may not be recorded at teleseismic distances.

CONCLUSION
We successfully developed a methodology that can be used to identify accurately located events at a high confidence level. The selection criteria are quite robust against the various crowdsource triggers and facilitate the handling of multiple triggers for the same event. The location accuracy is better than 10 km for 50% of the events, which is comparable to the average location error of 9.4 km in the EHB bulletin (Engdahl et al., 1998). The EHB bulletin is the groomed ISC bulletin and it is considered amongst the highest quality global bulletins and thus the preferred source for doing global and regional tomography. The location error is larger than 50 and 80 km or only for 5 and 2% of the events, respectively. Similarly, the CsLoc depth and origin time estimates are on average within 5 km and 1 s of the EMSC solution for 50% of the events, and larger than 25 km and 3 s for only 10% of the events.
Our selection criteria for publication allows us to significantly reduce the publication latency times compared to those cited in Steed et al. (2019) as the majority of events can be published right after the third iteration and notably it was never necessary to wait for the full ten iterations. Figure 7 shows the publication delay after the origin time for the EMSC published hypocenter and the CsLoc locations that satisfy the publication criteria. The median delay time for the EMSC is 5.6 min, while the median delay in publication time is reduced to 55, 53, and 72 s for the web traffic, LastQuake and TED triggers, respectively. Overall, the median delay in publication time for the CsLoc locations is reduced to 60 s, hence providing a significant improvement over the 103 s median delay reported by Steed et al. (2019).
The selection criteria allow us to reduce the EMSC publication delay after the event origin time by as much as 4 min on average and publish 75% of the events within 2 min after their occurrence. The performance of the CsLoc services depends on both population and station density as well as information timeliness. To further improve the CsLoc services we plan to improve the network coverage by complementing the actual real time seismic phases obtained from the GEOFON Program with more openly accessible stations, without significantly increasing the data latency.

DATA AVAILABILITY STATEMENT
All datasets generated for this study are included in the article/Supplementary Material.

AUTHOR CONTRIBUTIONS
IB developed the phase association algorithm and its code, developed the event acceptance criteria, and also the author of the iLoc location algorithm, available for download at https:// seiscode.iris.washington.edu/projects/iloc. RS and JR developed the CsLoc implementation, and created both the training and the validation data sets. RB formulated the overarching research goals, led and supervised the project, and acquired funding. AH developed the HMB messaging bus. AS and JS provided the feedback during the discussions as well as phase and seismic detection from the GEOFON program both historically and in real time using the HTTP Message Bus (HMB). All authors contributed to the article and approved the submitted version.

FUNDING
This article was partially funded by the European Union's Horizon 2020 Research and Innovation Programme under grant agreement RISE No. 821115 and TurnKey No. 821046. Opinions expressed in this article solely reflect the authors' view; the EU was not responsible for any use that may be made of information it contains.