Induced Seismicity Completeness Analysis for Improved Data Mining

The study of induced seismicity at sites of fluid injection is paramount to assess the seismic response of the earth’s crust and to mitigate the potential seismic risk. However statistical analysis is limited to events above the completeness magnitude m c , which estimation may significantly vary depending on the employed method. To avoid potential biases and optimize the data exploitable for analysis, a better understanding of completeness, detection capacity and censored data characteristics is needed. We apply various methods previously developed for natural seismicity on 16 underground stimulation experiments. We verify that different techniques yield different m c values and we suggest using the 90% quantile of the m c distribution obtained from high-resolution mapping, with m c defined from the mode of local magnitude frequency distributions (MFD). We show that this distribution can be described by an asymmetrical Laplace distribution and the bulk MFD by an asymmetric Laplace mixture model. We obtain an averaged Gutenberg-Richter parameter b = 1.03 ± 0.48 and a detection parameter k = 3.18 ± 1.97 from mapping, with values subject to high uncertainties across stimulations. We transfer Bayesian m c mapping developed for natural seismicity to the context of induced seismicity, here adapted to local three-dimensional seismicity clouds. We obtain the new prior parameterization m c , p r e d = 1.64 l o g 10 ( d 3 ) − 1.83 , with d 3 the distance to the 3rd nearest seismic station. The potential use of censored data and of m c prediction is finally discussed in terms of data mining to improve the monitoring, modeling and managing of induced seismicity.


INTRODUCTION
The evaluation of the completeness magnitude m c , above which the Gutenberg-Richter law is verified as all the data is by definition observed, is a prerequisite to virtually all statistical analyses of seismicity. This includes the study of induced seismicity at sites of underground stimulation by fluid injection. Underestimating m c yields to biased estimates of the slope of the Gutenberg-Richter law, the b-value, and overestimating it may lead to unnecessary under-sampling. Selection of m c has therefore indirect consequences on seismic hazard assessment. Most published works provide an estimate of m c but rarely explain how it has been calculated and rarely, if ever, provide a sensitivity analysis.
The present study aims at filling this gap by an in-depth analysis of the magnitude frequency distribution (MFD) at multiple sites. To the best of our knowledge, this is the first study dedicated to completeness magnitude analysis in the induced seismicity context. We will test different m c techniques (e.g., Wiemer and Wyss, 2000;Amorèse, 2007) and transfer two recent models originally developed by the author for natural seismicity: The Asymmetric Laplace MFD model to describe the incomplete part of seismicity (Mignan, 2012;Mignan, 2019), and the Bayesian Magnitude of Completeness (BMC) method for robust m c mapping (Mignan et al., 2011), based on m c being the mode of local MFDs, in agreement with the Asymmetric Laplace formulation.
The BMC method has been successfully applied in various regions of the world, but so far only in the context of natural seismicity: Taiwan (Mignan et al., 2011), Mainland China , Switzerland (Kraft et al., 2013), Lesser Antilles arc (Vorobieva et al., 2013), California (Tormann et al., 2014), Greece (Mignan and Chouliaras, 2014), Iceland (Panzera et al., 2017), South Africa (Brandt, 2019) and Venezuela (Vásquez and Bravo de Guenni, n.d.) 1 . It becomes urgent to apply it to induced seismicity, which requires a reformulation of the model. Based on the new parameterization and additional information on incomplete (socalled censored) data, we will discuss how such information could improve induced seismicity data mining, or in other words, how it could improve knowledge on the underground feedback activation and the management of the associated risk.
Depending on the parameters provided (see Table 1), different completeness analysis levels are achievable. When earthquake coordinates are not included, the study is limited to the bulk MFD analysis (Woessner and Wiemer, 2005;Mignan and Woessner, 2012) and to the application of the Asymmetric Laplace distribution (Mignan, 2019); when earthquake coordinates are included, observed completeness magnitude m c,obs mapping is performed (e.g., Wiemer and Wyss, 2000). In the ideal situation in which the coordinates of the seismic stations are also given, posterior completeness magnitude m c,post maps are also generated using the BMC method (Mignan et al., 2011).
Since this study is solely dedicated to seismicity completeness, data such as total volume injected, flow rate profile, or injection/postinjection windows are not considered (only mentioned in the discussion, Discussion and Perspectives on Data Mining). For statistical analyses related to the fluid injection process at different sites, the reader can refer to, e.g., Dinske and Shapiro (2013), van der Elst et al. (2016), Mignan et al. (2017) or Bentz et al. (2020).

Standard Magnitude Frequency Distribution Analysis
The bulk magnitude frequency distribution (MFD) of an earthquake catalog can be described by a probability density function that takes the form: where m is the earthquake magnitude, f GR (m) the Gutenberg-Richter law (Gutenberg and Richter, 1944), q(m) a detection function that controls the shape of the MFD and c a normalization constant so that p(x)dx 1. The noncumulative MFD, defined as the number of earthquakes per magnitude bin m, is simply n(m) ΔmN tot p(m) with N tot the total number of events and Δm 0.1 the magnitude bin. The cumulative MFD is more commonly formulated as N( ≥ m) 10 a− bm where b β/log10 and a is the overall seismicity activity. We should have the condition q( ≥ m c ) 1 by definition, although it may only tend to 1 if q is unbounded, for example if defined as a cumulative Normal distribution (Ringdal, 1975;Ogata and Katsura, 1993), a log-normal distribution (Martinsson and Jonsson, 2018), or a power-law (so that p can be represented by a gamma distribution; Kijko and Smith, 2017). Mignan (2012), Mignan (2019), in contrast, consider the gradual curvature of the MFD to be due to the sum of "angular" MFDs, each of constant m c , with q a bounded exponential function and p an asymmetric Laplace distribution (see below). "Curved" q functions would then be fitting proxies not representative of the spatially varying and scale-variant detection process (Mignan and Chen, 2016).
Various methods have been proposed to estimate m c from the bulk MFD, independently of the function q(m) (see reviews by Woessner and Wiemer, 2005;Mignan and Woessner, 2012). We here consider two non-parametric techniques, the mode of the MFD (also known as "maximum curvature" method; Wiemer and Wyss, 2000) and the Median-Based Analysis of the Segment Slope (MBASS) method (Amorèse, 2007). The b-value of the Gutenberg-Richter law can then be estimated with the maximum likelihood estimation (MLE) method (Aki, 1965) for the complete magnitude range (m c − Δm/2, +∞). It is important to note that m c values obtained from the bulk MFD can vary significantly across methods, which hampers the evaluation of b. A spatial analysis can limit the potential ambiguity (Mignan and Chouliaras, 2014).
Spatial heterogeneities in m c , due at first order to the seismic network spatial configuration (Bayesian Magnitude of Completeness Mapping Method), can be evaluated by a simple mapping procedure. We perform a three-dimensional mapping of m c,obs (x, y, z) in cubic cells 100-m wide. No smoothing kernel (e.g., Wiemer and Wyss, 2000) is used in order to minimize m c heterogeneities in individual cells. The parameter is estimated by using the mode of the distribution of magnitudes m in each cell (x, y, z). The mode is a reasonable choice for localized data where no significant spatial variations in m c is expected (Mignan, 2012;Mignan and Chen, 2016). It also yields robust results for sample sizes as low as n min 4 earthquakes (Mignan et al., 2011), the threshold used in the present mapping procedure. The set of m c estimates for all cells is then represented by the vector m c,obs , which distribution explains the curvature characteristics of the bulk MFD. Different quantiles of m c,obs can be tested to evaluate b. The map of m c,obs (x, y, z) is also used as input for the BMC method described in Bayesian Magnitude of Completeness Mapping Method.
Local MFDs of cells (x, y, z) can be used to estimate both b and the parameter k of the detection function q. We consider the asymmetric Laplace probability density function: with mode m c and the detection parameter k κ/log(10) also estimated using the MLE method (Mignan, 2012; see also equation Asymmetric Laplace Mixture Model). This parameter has been shown to be relatively stable with k ≈ 3 for natural seismicity in Southern California and Nevada (Mignan, 2012). We apply the same approach to test how this parameter behaves in the context of induced seismicity. We only consider cells with n min 50 for those calculations. The asymmetric Laplace distribution is the basic component of the mixture model presented below (Asymmetric Laplace Mixture Model). It also explains why the mode is used to compute m c in the BMC method (Bayesian Magnitude of Completeness Mapping Method).

Asymmetric Laplace Mixture Model
The sum of local "angular" MFDs of different m c which forms the bulk "curved" MFD can be approximated by mixture modeling instead of a mapping procedure. This is particularly practical if earthquake coordinates are unavailable with only access to a magnitude vector. The Asymmetric Laplace Mixture Model (ALMM) (Mignan, 2019) is defined as: with K the number of Asymmetric Laplace mixture components ordered by m c value (m c,1 < m c,2 < / < m c,i < / < m c,K ) and w i the mixing weight of the i th component such that Parameters κ and β are assumed constant across components. Any MFD shape can be fitted by the flexible ALMM based on the Expectation-Maximization (EM) algorithm (Dempster et al., 1977). The initial parameter values are estimated by applying K-means (MacQueen, 1967) Frontiers in Earth Science | www.frontiersin.org March 2021 | Volume 9 | Article 635193 with χ κ − β the slope of the incomplete part of the asymmetric Laplace distribution in a log-linear plot. At each EM iteration j, a deterministic version of the expectation step (E-step) attributes a hard label i to each event magnitude from the parameter set θ (j−1) i {m c,i , κ, β} defined in the previous iteration j − 1 (j 0 corresponding to the K-means estimate). Hard labels are assigned as: The maximization step (M-step) then updates the component parameters. The best number of components K is finally selected from the lowest Bayesian Information Criterion estimate BIC −LL + 1/2(2 + K)ln(N tot ) (Schwarz, 1978) where LL is the loglikelihood of the ALMM. Details of the full method are given in Mignan (2019). For MFD mixture modeling based on a lognormal component, the reader may refer to Martinsson and Jonsson (2018).

Bayesian Magnitude of Completeness Mapping Method
The last method to be tested in the present study is the Bayesian Magnitude of Completeness (BMC) method that consists in using Bayesian inference to estimate m c based on incomplete information and prior belief. The incomplete information is the m c,obs map (see Standard Magnitude Frequency Distribution Analysis), which presents gaps in cells of low seismicity and is highly uncertain when estimated from a limited number of earthquake magnitudes. BMC is constrained by a prior model m c,pred f (d k ) relating the spatial heterogeneities in m c to the density of seismic stations, approximated by the distance d k to the k th nearest station (Mignan et al., 2011). Priors were defined so far in the literature for two-dimensional m c mapping. We here define a new prior based on three-dimensional distance, which is a requirement for fluid injections characterized by a threedimensional seismicity cloud centered at the borehole depth and detected by a combination of surface stations and downhole stations. The distance between a cell and a station of coordinates (x sta , y sta , z sta ) is thus d (x − x sta ) 2 + (y − y sta ) 2 + (z − z sta ) 2 . We additionally improve the functional form of the prior, moving from m c,pred c 1 d c 2 k + c 3 (Mignan et al., 2011) to the form m c,pred c 1 log 10 (d k ) + c 2 , a simpler attenuation function reduced to two free parameters.
Following Bayes' Theorem, we obtain the posterior completeness magnitude m c,post and standard deviation σ post : where σ obs and σ pred are the standard deviations of m c observations (based on 100 bootstraps) and of the prior model, respectively. Note that all the aforementioned parameters depend on location (x, y, z), except for σ pred which is constant.

Results of a Standard m c Analysis
We first apply the standard methods of m c evaluation, based on bulk MFD analysis and m c mapping. This is the first systematic comparison of completeness level for different induced seismicity sequences. Figure 1 shows the cumulative bulk MFD for the 16 fluid injections and the matching m c,obs distribution. Figure 1 also shows the estimates m c,mode (dotted vertical line) and m c,MBASS (dashed vertical line) which are often close to the m c,obs median. More conservative estimates of m c , such as the 75% or 90% quantiles of m c,obs seem to provide reasonable b-values. We use q 90 (m c,obs ) to estimate the Gutenberg-Richter slope b in Figure 1. Note that the m c,obs distribution shape matches the curvature of the bulk MFD, which verifies that it is due at first order to spatial heterogeneities. Table 2 lists m c,bulk estimates obtained from different approaches with their respective b-values for comparison. For most cases, the m c range for induced seismicity is comprised between -2 and 1. It goes down to -4 for the Äspö Hard Rock Laboratory experiment where picoseismicity is detected. Such low m c values have been reached at other underground laboratories (e.g., Villiger et al., 2020). The range of b-values is consistent with the ones obtained by Dinske and Shapiro (2013) for the 5 datasets common to both studies. The authors however only provided one estimate while our Table 2 shows its sensitivity to the minimum magnitude cutoff. Figure 2 shows m c,obs maps at selected depths z for the two stimulations the richest in induced seismicity (N tot > 10, 000): S93 and CB12. Other maps will be shown in Bayesian Magnitude of Completeness Prior & Posterior m c Maps when used as input for BMC mapping. Local MFDs for cells that include more than 400 earthquakes are also displayed with their asymmetric Laplace distribution fit. Considering all cells of all sites together, assuming that k and b variations in space are random (Mignan, 2012;Kamer and Hiemer, 2015), we obtain for induced seismicity k 3.18 ± 1.97 and b 1.03 ± 0.48, which is consistent with natural seismicity regimes but here with significantly larger uncertainties. The plots of Figures 2B,D confirm that the mode of the local MFD is a reasonable choice to estimate m c .
This so-called standard m c analysis highlights the importance to test several techniques to minimize possible biases in the b-value. Mapping remains the best approach to evaluate the m c range. Reasonable b-values are obtained when using conservative m c,obs quantiles (e.g., 75% or 90%).

Asymmetric Laplace Mixture Model Fits
We then apply the ALMM to the 16 magnitude vectors but only get reasonable fits for 9 of them. We find that the ALMM requires Frontiers in Earth Science | www.frontiersin.org March 2021 | Volume 9 | Article 635193 n min > 300 for statistically significant component modeling. It means that the ALMM is not applicable for KTB94, GS08, A15 and P17. It fails for 3 other cases, S00, S03, B06, due to anomalous fluctuations in the observed non-cumulative MFD, which will be discussed in another paragraph. Figure 3 shows the 9 ALMM fits (for S93, PV94, S04, S05, G07, CB12, NB12, SG13 and E18). Parameters max(m c,i ) and b are listed in Table 2 for comparison with the techniques tested in Results of a Standard m c Analysis. Those values range between estimates obtained with the MBASS method and q 75 (m c,obs ) so the method does not seem to provide any new insight into which method to prefer. We observe that the number of K components reflects the gradual curvature of the bulk MFD. For instance, only 2 components suffice to fit the almost angular SG13 MFD while 13 components are needed for the wide S05 MFD, proving the flexibility of the ALMM to fit different MFD shapes. It also verifies that bulk MFDs can be described by the sum of angular MFDs with m c as modes. We obtain k {7.6, 3.3, 2.1, 2.8, 6.5, 9.1, 4.5, 3.7, 3.1}, respectively, with median 3.7 and mean 4.7.
The ALMM is highly sensible to abnormal fluctuations in the non-cumulative MFD, which are often not visible from the cumulative MFD. In the case of Soultz-sous-Fôrets, the S00 noncumulative MFD shows significant drops in the number of events inconsistent with any model monotonously increasing up to m c and monotonously decreasing above m c . In the latter, we observe n i  Table 2). In regards of the Basel catalog, a zig-zag pattern is observed on the non-cumulative MFD, suggesting an  error in rounding between odd and even magnitude digits, which confuses the ALMM algorithm. Those cases indicate more problems with the magnitude vectors than with the ALMM. This suggests that seismologists preparing earthquake catalogs should analyze the non-cumulative distribution of magnitudes to check for potential errors and/or explain the origin of those anomalies incompatible with the Gutenberg-Richter law.

Bayesian Magnitude of Completeness Prior and Posterior m c Mmaps
We define a BMC prior model for induced seismicity by combining the relation between m c and the distance d 3 to the 3rd nearest seismic station, observed for the earthquake catalogs that come with seismic network information ( Table 1). We choose d 3 (over e.g. d 4 or d 5 ) since this metric shows the minimal residual error (see σ pred below). We assume that m M L M w so that seismicity clouds from different depth levels can be combined to fit one model constrained on a relatively wide d 3 range. Figure 4 represents the BMC prior derived from 7 datasets: S93, S04, S05, GS08, CB12, SG13, and P16. The model, represented by the solid curve, is defined as m c,pred f prior (d 3 ) 1.64log 10 (d 3 ) − 1.83; σ pred 0.37 with distance d 3 in km. Note that the uncertainty σ pred is greater than the ones obtained from natural seismicity (σ pred (0.25; e.g., Mignan et al., 2011;Mignan et al., 2013;Kraft et al., 2013;Mignan and Chouliaras, 2014;Tormann et al., 2014). Several reasons may be advanced: different sites are here combined, representative of different soil conditions and thus potentially of different seismic attenuation functions; considering the depth component may add uncertainty on distance measures; finally, the model is constrained on far shorter distance (d 3 < 10 km) compared to up to hundreds of kilometers in regional catalogs. It is interesting to compare the model prediction to the pico-seismicity completeness level m c ≈ − 4 observed at Äspö (A15). We learn from Kwiatek et al. (2018) that sensors were located between a few meters and 100 m from the injection borehole. We independently predict m c,pred (10m) −5.1 and m c,pred (100m) −3.5, which is a reasonable approximation. Adding further datasets to the model will help better constraining it.
Two datasets, S00 and S03, were not included in this analysis as event declaration depended in those cases on two triggering conditions from both the downhole and surface networks  (EOST and GEIEEMC, 2018a;EOST and GEIEEMC, 2018b), which is likely inconsistent with the simple d 3 metric. Testing with d 3 leads to a systematic bias requiring a correction f prior (d 3 ) + 1. Use of such formula would however be inadequate. It remains unclear if the magnitude scale used for S00 and S03, duration magnitude m D , could also play a role in the observed m c shift upward. We then combine the m c,obs with the prior model to derive the posterior m c,post maps. We show some examples taken from S93 and CB12 in Figure 5. The BMC methods fills all the gaps in m c,obs , and provides completeness levels expected for future seismicity, e.g., during cloud development as more fluids get injected, which can be of use to the injection operators. The BMC method also decreases m c uncertainties, as can be observed when comparing σ post to σ obs . Note finally that the BMC method is consistent with the Asymmetric Laplace detection model previously described. It makes use of the mode of local MFDs so that the number of cells with m c,obs values is maximized while f prior explains how m c evolves in space, from which the bulk FMD, approximated by the ALMM, emerges.

DISCUSSION AND PERSPECTIVES ON DATA MINING
We reviewed some standard approaches to estimate the completeness magnitude m c and ported the recent ALMM mixture and BMC mapping methods to the induced seismicity context. We provided various estimates of m c , b ( Table 2) and detection parameter k so that better informed choices could be made in future statistical analyses of induced seismicity. We observed that the k-value for induced seismicity is compatible with the one obtained for natural seismicity, suggesting a common detection process although uncertainties are high. We also provided the first parameterization of the BMC prior for three-dimensional seismicity clouds. The present study could help refine future seismic hazard analyses, since the parameter m c is a prerequisite to the estimation of the hazard inputs: the a-and b-values of the Gutenberg-Richter law. In contrast to the tectonic regime, the a-value is normalized to the total injected volume V for comparisons across stimulations, so that N( ≥ m) V10 a fb − bm with a fb the normalized a-value, called underground feedback parameter in Mignan et al. (2017) and seismogenic index in poroelasticity parlance (e.g., Dinske and Shapiro, 2013). The term a fb is agnostic, while alternatives to poroelasticity exist (e.g., Mignan, 2016). A priori knowledge of the Gutenberg-Richter parameters is required in pre-stimulation risk assessment (e.g., Mignan et al., 2015;Broccardo et al., 2020), and the parameterization may be updated during stimulations via a dynamic traffic light system (e.g., Broccardo et al., 2017;Mignan et al., 2017). Note also that the maximum magnitude m mas relates directly to b (e.g., van der Elst et al., 2016;Broccardo et al., 2017).
We first showed the impact of m c values on b and selected q 90 (m c,obs ) as conservative estimates. We also found that the ALMM does not provide any new insight to the problem and is hampered by fluctuations in the non-cumulative MFD observed in some experiments. As a consequence, m c mapping remains the best alternative and is simple enough to implement.
While m c also alters a fb via b, we can consider another aspect that may improve our knowledge of the underground feedback. It has been observed that a fb significantly varies across sites and across stimulations at a same site (e.g., Dinske and Shapiro, 2013;Mignan et al., 2017) which may lead to risk aversion of potential investors in geo-energy for instance . One may difficultly infer a fb from the literature when no information about completeness is given, which is especially true for early articles. However, we can now estimate a fb despite the total number of events induced N( ≥ m ? ) being potentially ambiguous. To illustrate the problem posed, let us consider the 1988 stimulation at Hijiori, Japan. We learn from Sasaki (1998) that N( ≥ m ? ) 65 FIGURE 3 | Non-cumulative MFD (in blue) of 9 underground stimulations for which an Asymmetric Laplace Mixture Model (ALMM) fit is available, shown in red, with the mixture components shown in orange. See Table 2 for some values.
Frontiers in Earth Science | www.frontiersin.org March 2021 | Volume 9 | Article 635193 micro-earthquakes were observed above m ? −4 (their Figure 6) for an injected volume V 2, 000 m 3 . The equation a fb log 10 (N( ≥ m ? )/V) + bm ? is valid only if m ? ≥ m c . Information in Sasaki (1998) is however ambiguous, and we may have m ? min(m) < m c instead, which would underestimate the underground feedback activation since the data would then be incomplete. Considering all datasets of Table 1 with N tot > 200, we can estimate from their censored data the metrics δm m c − min(m) and c N( ≥ m c )/N( ≥ min(m)) which range on the intervals [0.8, 1.9] and [0.20, 0.37], respectively (with no trend observed). The distributions are shown in Figure 6A alongside the corrected underground feedback parameter a fb,corrected log 10 (cN( ≥ min(m))/V) + b(min(m) + δm). Assuming δm and c representative (and b 1, see Results of a  Despite the ambiguity, an estimate may therefore still be provided. A review of the literature could provide additional values from other fluid injections to better constrain the range of a fb to be considered as a priori information in risk assessment, which is so far potentially biased toward high a fb values (e.g., Mignan et al., 2017). Finally, if the BMC method allows defining robust m c maps (no spatial gap, uncertainty constrained by the seismic network configuration), BMC may be even more useful for seismic network planification (e.g., Kraft et al., 2013) prior to new stimulations. Seismic safety criteria can be mapped into magnitude thresholds not to be crossed , which tell us the completeness magnitude level required for sound statistical analysis. One can then use the BMC prior f prior (d 3 ) to test how a completeness level can be achieved given a seismic network configuration. Figure 6B illustrates such an application. The two approaches presented in Figure 6 demonstrate how induced seismicity data mining can be done from completeness magnitude knowledge, which in turn can improve induced seismicity monitoring, modeling and managing.

DATA AVAILABILITY STATEMENT
Publicly available datasets were analyzed in this study. This data can be found here See Table 1 and reference list.

AUTHOR CONTRIBUTIONS
AM did all the research and writing.