Errors in Statistical Inference Under Model Misspecification: Evidence, Hypothesis Testing, and AIC

The methods for making statistical inferences in scientific analysis have diversified even within the frequentist branch of statistics, but comparison has been elusive. We approximate analytically and numerically the performance of Neyman-Pearson hypothesis testing, Fisher significance testing, information criteria, and evidential statistics (Royall, 1997). This last approach is implemented in the form of evidence functions: statistics for comparing two models by estimating, based on data, their relative distance to the generating process (i.e., truth) (Lele, 2004). A consequence of this definition is the salient property that the probabilities of misleading or weak evidence, error probabilities analogous to Type 1 and Type 2 errors in hypothesis testing, all approach 0 as sample size increases. Our comparison of these approaches focuses primarily on the frequency with which errors are made, both when models are correctly specified, and when they are misspecified, but also considers ease of interpretation. The error rates in evidential analysis all decrease to 0 as sample size increases even under model misspecification. Neyman-Pearson testing on the other hand, exhibits great difficulties under misspecification. The real Type 1 and Type 2 error rates can be less, equal to, or greater than the nominal rates depending on the nature of model misspecification. Under some reasonable circumstances, the probability of Type 1 error is an increasing function of sample size that can even approach 1! In contrast, under model misspecification an evidential analysis retains the desirable properties of always having a greater probability of selecting the best model over an inferior one and of having the probability of selecting the best model increase monotonically with sample size. We show that the evidence function concept fulfills the seeming objectives of model selection in ecology, both in a statistical as well as scientific sense, and that evidence functions are intuitive and easily grasped. We find that consistent information criteria are evidence functions but the MSE minimizing (or efficient) information criteria (e.g., AIC, AICc, TIC) are not. The error properties of the MSE minimizing criteria switch between those of evidence functions and those of Neyman-Pearson tests depending on models being compared.


Background
In the twentieth century, the bulk of scientific statistical inference was conducted with Neyman-Pearson hypothesis tests, a term which we broadly take to encompass significance testing, P-values, generalized likelihood ratio, and other special cases, adaptations, or generalizations. The central difficulty with interpreting NP tests is that the Type 1 error probability (usually denoted α) remains fixed regardless of sample size, rendering problematic the question of what constitutes evidence for the model serving as the null hypothesis (Aho et al., 2014;Murtaugh, 2014;Spanos, 2014). The fixed null error rate of hypothesis testing lies at the core of why model selection procedures based on hypothesis testing (such as stepwise regression and multiple comparisons) have always had the reputation of being jury-rigged contraptions that have never been fully satisfactory (Gelman et al., 2012). An additional problem with hypothesis tests arises from the "Type 3" error of model misspecification, in which neither the null nor the alternative hypothesis model adequately describes the data (Mosteller, 1948). The influence of model misspecification on all types of inference is under appreciated.
In an apparently separate statistical development, the concept of statistical evidence was refined in light of the shortcomings of using as evidence quantities such as P-values that emerge from frequentist hypothesis testing (Royall, 1997(Royall, , 2000Taper andLele, 2004, 2011;Taper and Ponciano, 2016). Crucial to the evidence concept was the idea of an evidence function (Lele, 2004). An evidence function is a statistic for comparing two models that has a suite of statistical properties, among them two critical properties: (a) both error probabilities (analogous to Type 1 and Type 2 error probabilities in hypothesis testing) approach zero asymptotically as the sample size increases, and (b) when the models are misspecified and the concept of "error" is generalized to be the selection of the model "farthest" from the true data-generating process, the two error probabilities still approach zero as sample size increases.
Despite widespread current usage of AIC-type indexes in ecology, the inferential basis and implications of the use of information criteria are not fully developed, and what is developed is commonly misunderstood (see the forum edited by Ellison et al., 2014). AIC-type indexes are used for different purposes: in some contexts in place of hypothesis testing, in some as evidence for model identification, in some as estimates of pseudo-Bayesian model probabilities, and in some purely as criteria for prediction . Of concern is that few ecologists can explain the inferences they are conducting with AIC, as Akaike's (Akaike, 1973(Akaike, , 1974 mathematical argument is not an easy one, and more recent accounts (Bozdogan, 1987;Burnham and Anderson, 2002;Claeskens and Hjort, 2008) are heavily mathematical as well. A clear and accessible inferential concept is needed to promote confidence in and appropriate uses of the information-theoretic criteria. We believe that the concept of statistical evidence can serve well as the inferential basis for the uses of and distinctions among the AIC-type indexes. This paper contrasts the concept of evidence with classical statistical hypothesis testing and demonstrates that many information-based indexes for model selection can be recast and interpreted as evidence functions. We show that the evidence function concept fulfills many seeming objectives of model selection in ecology, both in a statistical as well as scientific sense, and that evidence functions are intuitive and easily grasped. Specifically, the difference of two values of an information-theoretic index for a pair of models possesses in whole or in part the properties of an evidence function and thereby grants to the resulting inference a scientific warrant of considerable novelty in ecological practice. quantitative hypotheses for how data arose, the evidence function concept can serve as a scientifically satisfying basis for model comparison in observational and experimental studies.

Method of Analysis and Notation
For convenience we label as Neyman-Pearson (NP) hypothesis tests a broad collection of interrelated statistical inference techniques, including P-values for likelihood ratios, confidence intervals, and generalized likelihood ratio tests, that are connected to Neyman and Pearson's original work (Neyman and Pearson, 1933) and that form the core of modern applied statistics. We distinguish Fisher's use of P-values as a measure of the adequacy of the null hypothesis from the use of P-values in likelihood ratio hypothesis tests.
NP hypothesis tests and evidential comparisons are conducted in very different fashions and operate under different warrants. Thus, comparison is difficult. However, they both make inferences. One fundamental metric by which they can be compared is the frequency that inferences are made in error. In this paper we seek to illuminate how the frequency of errors made by these methods is influenced by sample size, the differences among models being compared, and also the differences between candidate models and the true data generating process. Both of these inferential approaches can be, and generally are, constructed around a base of the likelihood ratio (LR). By studying the statistical behavior of the LR, we can answer our questions regarding frequency of error in all approaches considered.
Throughout this discussion, one observation (datum) is represented using the random variable X with g(x) being the probability density function representing the true, datagenerating process and f (x) being the probability density function of an approximating model. If the observed process is discrete, g(x) and f (x) will represent probability mass functions. For simplicity we refer to these functions in both the discrete and continuous cases as pdf's, thinking of the abbreviation as "probability distribution function." The likelihood function under the true model, for n independent and identically distributed (iid) observations x 1 , x 2 , … x n is written as L g = g x 1 g x 2 …g x n , (1) whereas under the approximating model it is L f = f x 1 f x 2 …f x n .
(2) In cases where there are two candidate models f 1 (x) and f 2 (x), we write the respective likelihoods as L 1 and L 2 to avoid double subscript levels.
We make much use of the Kullback-Leibler (KL) divergence, one of the most commonly used measures of the difference between two distributions. The KL divergence of f (x) from g(x), denoted K(g, f), is defined as the expected value of the log-likelihood ratio of g and f (for one observation) given that the observation came from the process represented by g(x): K(g, f) ≡ E g log g(X) f(X) = ∫ ∑ g(x)log g (x) f(x) . (3) Here E g denotes expectation with respect to the distribution represented by g(x). The expectation is a sum or integral (or both) over the entire range of the random variable X, depending on whether the probability distributions represented by g(x) and f (x) are discrete or continuous (or both, such as for a zero-inflated continuous distribution). The functions must give positive probability to the same sets (along with other technical mathematical requirements which are usually met by the common models of ecological statistics).
The KL divergence is interpreted as the amount of information lost when using model f (x) to approximate the data generating process g(x) . Its publication (Kullback and Leibler, 1951) was a highpoint in the golden age of the study of "information theory." The KL divergence is always positive if g(x) and f (x) represent different distributions and is zero if the distributions are identical ("identical" in the mathematical sense that the distributions give the same probabilities for all events in the sample space). The KL divergence is not a mathematical distance measure in that K(g, f) is not in general equal to K(f, g).
The relevant KL divergences under correct model specification are for f 1 (x) and f 2 (x) with respect to each other: K 21 ≡ K f 2 , f 1 = E 2 log f 2 (X) f 1 (X) By reversing numerator and denominator in the log function in Equation (5), one finds that The convention for which subscript is placed first varies among references; we put the subscript of the reference distribution first as it is easy to remember.
The likelihood ratio (LR) and its logarithm figure prominently in statistical hypothesis testing as well as in evidential statistics. The LR is and the log-LR is In particular, the log-LR considered as a random variable is a sum of iid random variables, and its essential statistical properties can be approximated using the central limit theorem (CLT). The CLT (Box 1) provides an approximate normal distribution for a sum of iid random variables and requires the expected value (mean) and the variance of one of the variables. Under correct model specification, the observations came from either f 1 (x) or f 2 (x), and Equations (4)-(6) above give the expected value of one of the random variables in the sum as K 12 or −K 21 , depending on which model generated the data. Let σ 1 2 and σ 2 2 be the variances of log [f 1 (X)/f 2 (X)] with respect to each model: One can envision cases in which these variances might not exist, but we do not consider such cases here. The CLT, which requires that the variances be finite, provides the following approximations. If the data arise from f 1 : log L 1 L 2˙n ormal nK 12 , nσ 1 2 , 1 n log L 1 L 2˙n ormal K 12 , σ 1 2 /n , n σ 1 1 n log L 1 L 2 − K 12˙n ormal(0, 1) .
The device of using the CLT to study properties of the likelihood ratio is old and venerable and figures prominently in the theory of sequential statistical analysis (Wald, 1945).
A model, f, can be said to be misspecified if the distribution of data implied by the model (under best possible parameterization) differs in any way from the distribution of data under the true generating process. In the Kullback-Leibler divergence setting within which we are working, f is misspecified if K(g, f) > 0. A model set can be said to be misspecified if all of its member models are misspecified. Misspecification can have a host of causes, including omission of real covariates, inclusion of spurious covariates, incorrect specification of functional form, incorrect specification of process error structure, and incorrect specification of measurement error structure.
The approximate behavior of the LR under misspecification can also be represented with the CLT. To our two model candidates f 1 (x) and f 2 (x), we add the pdf g (x) defined as the best possible mathematical representation of the distribution of data stemming from the actual stochastic mechanism generating the data, the unknown "truth" sought by scientists. We denote by ΔK the difference of the KL divergences of f 1 (x) or f 2 (x), from g (x): We note that ΔK could be positive, negative, or zero: if ΔK is positive, then f 1 is "closer" to truth, if ΔK is negative, then f 2 is closer to truth, and if ΔK is zero, then both models are equally distant from truth. To deploy the CLT, we need the mean and variance of the singleobservation LR under misspecification. For the mean we have The rightmost equality is established by adding and subtracting Eg [log (g (X)]. We denote the variance by σ g 2 which becomes And now by the CLT, if the data did not arise from f 1 (x) or f 2 (x), but rather from some other pdf g (x), we have: 1 n log L 1 L 2˙n ormal ΔK, σ g 2 /n , n σ g 1 n log L 1 L 2 − ΔK˙normal(0, 1) .
Critical to the understanding, both mathematical and intuitive, of inference on models is an understanding of the topology of models. Once one has a concept of distances between models, a topology is implied. A model with one or more unknown parameters represents a whole family or set of models, with each parameter value giving a completely specified model. At times we might refer to a model set as a model if there is no risk of confusion.
Two model sets can be only be arranged as nested, overlapping, or non-overlapping. A set of models can be correctly specified or misspecified depending on whether or not the generating process can be exactly represented by a model in the model set. Thus, there are only six topologies relating two model sets to the generating process (Figures 1, 2).

Correctly Specified Models
In the canon of traditional statistical practices for comparing two candidate models, f 1 (x) and f 2 (x) say, with or without unknown parameters involved, the assumption that the data arose from either f 1 (x) or f 2 (x) is paramount. In this section we adopt this assumption of correctly specified models and compare the properties of statistical hypothesis testing with those of the evidence approach. The correct model assumption is the home turf, so to speak, of hypothesis testing, and so the comparison should by rights highlight the strengths of traditional statistical practice. To focus the issues with clarity we concentrate on the case in which f 1 (x) and f 2 (x) are statistically simple hypotheses (a.k.a. completely specified models, not to be confused with correctly specified models). In other words, we assume for now there are no unknown parameters in either model, deferring until later in this paper a discussion of unknown parameters. Neyman and Pearson (1933) proved in a famous theorem (the "Neyman-Pearson Lemma") that basing a decision between two completely specified hypotheses (H 1 : the data arise from model f 1 (x), and H 2 : the data arise from model f 2 (x)) on the likelihood ratio had certain optimal properties.

Neyman-Pearson Statistical Hypothesis Tests-
Neyman and Pearson's LR decision rule has the following structure: Here the cutoff quantity (or critical value) c is determined by setting an error probability equal to a known constant (usually small), denoted α. Specifically, the conditional probability of wrongly deciding on H 2 given that H 1 is true is the "Type 1 error probability" and is denoted as α.
Often for notational convenience in lieu of the statement "H i is true" we will simply write "H i ." Now, such a data-driven decision with fixed Type 1 error probability is the traditional form of a statistical hypothesis test. A test with a Type 1 error probability of α is said to be a size α test. The other error probability ("Type 2"), the conditional probability of wrongly deciding on H 1 given H 2 , is usually denoted β: likelihood ratio test, is a cornerstone of most contemporary introductions to mathematical statistics (Rice, 2007;Samaniego, 2014).
With the central limit theorem results (Equations 11-16), the error properties of the NP test can be approximated. To find the critical value c, we have under H 1 : and so the CLT tells us that where Φ (z) is the cumulative distribution function (cdf) of the standard normal distribution. The approximate critical value c required for a size α test is then found by solving Equation (27) for c: Here z α = Φ −1 (1 − α) = −Φ −1 (α) is the value of the 1 − α quantile of the standard normal distribution. Thus, for error rate α to be fixed, the critical value as a function of n is seen to be a rapidly moving target.
The error probability β is approximated in similar fashion. We have, under H 2 , L 1 L 2 > c n σ 2 1 n log L 1 L 2 + K 21 > n σ 2 1 n log(c) + K 21 n σ 2 1 n log L 1 L 2 + K 21 > n σ 2 1 n log(c) + K 21 , so that, after substituting for c, It is seen that β → 0 as sample size n becomes large. Here K 12 + K 21 is an actual distance measure between f 1 (x) and f 2 (x) (Kullback and Leibler (1951); sometimes referred to as the "symmetric" KL distance) and can be regarded as the "effect size" as used in statistical power calculations.
Five important points about the Neyman-Pearson Lemma are pertinent here. First, the theorem itself is just a mathematical result and leaves unclear how it is to be used in scientific applications. The prevailing interpretation that emerged in the course of 20 th century science was that one of the hypotheses, H 1 , would be accorded a special status ("the null hypothesis"), having its error probability α fixed at a known (usually small) value by the investigator. The other hypothesis, H 2 , would be set up by experiment or survey design to be the only reasonable alternative to H 1 . The other error probability, β, would be managed by study design characteristics (especially sample size), but would remain unknown and could at best only be estimated when the model contained parameters with unknown values.
The hypothesis H 1 would typically play the role of the skeptic's hypothesis, as in the absence of an effect (absence of a difference in means, absence of influence of a predictor variable, absence of dependence of two categorical variables, etc.) under study. The other hypothesis, H 2 , contains the effect under study and serves as the hypothesis of the researcher, who has the scientific charge of convincing a reasoned skeptic to abandon H 1 in favor of H 2 .
Second, the theorem in its original form does not apply to models with unknown parameters. Various extensions were made during the ensuing decades, among them Wilks' (Wilks, 1938) and Wald's (Wald, 1943) theorems. The Wilks-Wald extension allows the test of two composite models (models with one or more unknown parameters) in which one model, taken as the null hypothesis, is formed from the other model (the alternative) by placing one or more constraints on the parameters. An example is a normal (μ, σ 2 ) distribution with both mean μ and variance σ 2 unknown as the model for the alternative hypothesis H 2 , within which the null hypothesis model f 1 constrains the mean to be a fixed known constant: μ = μ 1 .
In such scenarios, the null model is "nested" within the alternative model, that is, the null is a special version of the alternative in which the parameters are restricted to a subset of the parameter space (set of all possible parameter values). Wilks' (Wilks, 1938) and Wald's (Wald, 1943) theorems together provide the asymptotic distribution of a function of the likelihood ratio under both the null and alternative hypotheses, with estimated parameters taken into account. The function is the familiar "generalized likelihood ratio statistic," usually denoted G 2 , given by where L 1 and L 2 are the likelihood functions, respectively for models f 1 and f 2 , with each likelihood maximized over all the unrestricted parameters in that model. The resulting parameter estimates, known as the maximum likelihood (ML) estimates, form a prominent part of frequentist statistics theory (Pawitan, 2001). Let θ be the vector of unknown parameters in model f 2 formed by stacking subvectors θ 21 and θ 22 . Likewise, let θ under the restricted model f 1 be formed by stacking the subvectors θ 11 and θ 12 , where θ 11 is a vector of fixed, known constants (i.e., all values in θ 21 are fixed) and θ 12 is a vector of unknown parameters. Wald's (1943) theorem (after some mathematical housekeeping: Stroud, 1972) gives the asymptotic distribution of G 2 as a non-central chisquare(ν, λ) distribution, with degrees of freedom ν equal to the difference between the number of estimated parameters in f 2 and the number of estimated parameters in f 1 , and non-centrality parameter λ being a statistical (Mahalanobis) distance between the true parameter values under H 2 and their restricted versions under H 1 : λ = n θ 21 − θ 11 ′Σ −1 θ 21 − θ 11 .
Here Σ is a matrix of expected log-likelihood derivatives (details in Severini, 2000). Technically the true values θ 21 must be local to the restricted values θ 11 ; the important aspects for the present are that λ increases with n as well as with the effect size represented by the distance (θ 21 − θ 11 )′Σ −1 (θ 21 − θ 11 ). With the true parameters equal to their restricted values, that is with H 1 governing the data production, the non-centrality parameter becomes zero, and Wald's theorem collapses to Wilks' theorem, which gives the asymptotic distribution of G 2 under H 1 to be an ordinary chisquare(ν) distribution. For linear statistical models in the normal distribution family (regression, analysis of variance, etc.), G 2 boils down algebraically into monotone functions of statistics with exact (non-central and central) t-or F-distributions, and so the various statistical hypothesis tests can take advantage of exact distributions instead of asymptotic approximations.
The concept of a confidence interval or region for one or more unknown parameters follows from Neyman-Pearson hypothesis testing in the form of a region of parameter values for which hypothesis H 1 would not be rejected at fixed error rate α. We remark further that although a vast amount of every day science relies on the Wilks-Wald extension of Neyman-Pearson testing (and confidence intervals), frequentist statistics theory prior to the 1970s had not provided much advice on what to do when the two models are not nested.
Certainly nowadays one could setup a model f 1 (x) as H 1 in a hypothesis test against a nonoverlapping model f 2 (x) taken as H 2 and obtain the distributions of the generalized likelihood ratio under both models with simulation/bootstrapping.
Third, the Neyman-Pearson Lemma provides no guidance in the event of model misspecification. The theorem assumes that the data was generated under either H 1 or H 2 . However, the "Type 3" error of basing inferences on an inadequate model family is widely acknowledged to be a serious (if not fatal) scientific drawback of the Neyman-Pearson framework (and parametric modeling in general, see Chatfield, 1995). Modern applied statistics rightly stresses rigorous checking of model adequacy with various diagnostic procedures, such as the standard battery of residual analyses in regression models. Deciding between two models based on diagnostic qualities has been a standard workaround in the situation mentioned above for which the two models are not nested. For instance, one might choose the model with the most homoscedastic residuals.
Fourth, the asymmetry of the error structure has led to difficulties in scientific interpretation of Neyman-Pearson hypothesis testing results. The difficulties stem from α being a fixed constant. A decision to prefer hypothesis H 2 over H 1 because the LR (Equation 23) is smaller than c is not so controversial. The H 2 over H 1 decision has some intuitively desirable statistical properties. For example, the error rate β asymptotically approaches 0 as the sample size n grows larger. Further, β asymptotically approaches 0 as model f 2 becomes "farther" from f 1 (in the sense of the symmetric KL distance K 12 + K 21 as seen in Equation   30). Mired in controversy and confusion, however, is the decision to prefer H 1 over H 2 when the LR is larger than c. The value of c is set by the chosen value of the error rate α, using the probabilistic properties of model f 1 . If a larger sample size is used, the LR has more terms, and the value of c necessary to attain the desired value of α changes. In other words, c depends on sample size n and moves in such a way as to keep α fixed (at 0.05 or whatever other value is used; Equation 28). The net effect is to leave the Neyman-Pearson framework without a mechanism to assess evidence for H 1 , for no matter how far apart the models are or how large a sample size is collected, the probability of wrongly choosing H 2 when H 1 is true remains stuck at α.
Fifth, scientific practice rarely stops with just two models. In an analysis of variance, after an overall test of whether the means are different, one usually needs to sort out just who is bigger than whom. In a multiple regression, one is typically interested in which subset of predictor variables provide the best model for predicting the response variable. In a categorical data analysis of a multiway contingency table, one is often seeking to identify which combination of categorical variables and lower and higher order interactions best account for the survey counts. For many years (through the 1980s at least), standard statistical practice called for multiple models to be sieved through some (often long) sequence of Neyman-Pearson tests, through processes such as multiple pairwise comparisons, stepwise regression, and so on. It has long been recognized, however, that selecting among multiple models with Neyman-Pearson tests plays havoc with error rates, and that a pairwise decision tree of "yes-no's" might not lead to the best model among multiple models (Whittingham et al., 2006 and references therein). Using Neyman-Pearson tests for selection among multiple models was (admittedly among statisticians) a kludge to be used only until something better was developed.

Fisher Significance Analysis-R.
A. Fisher never fully bought into the Neyman-Pearson framework, although generations of readers have debated about what exactly Fisher was arguing for, due to the difficulty of his writing style and opacity of his mathematics. Fisher rejected the scientific usefulness of the alternative hypothesis (likely in part because of the lurking problem of misspecification) and chose to focus on single-model decisions (resulting in lifelong battles with Neyman; see the biography by Box, 1978). Yea or nay, is model f 1 an adequate representation of the data? As in the Neyman-Pearson framework, Fisher typically cast the null hypothesis H 1 in the role of a skeptic's hypothesis (the lady cannot tell whether the milk or the tea was poured first). It was scientifically sufficient in this approach for the researcher to develop evidence to dissuade the skeptic of the adequacy of the null model. The inferential ambitions here are necessarily more limited, in that no alternative model is enlisted to contribute more insights for understanding the phenomenon under study, such as an estimate of effect size. As well, Fisher's null hypothesis approach preserves the Neyman-Pearson incapacitation when the null model is not contradicted by data, in that at best, one will only be able to say that the data are a plausible realization of observations that could be generated under H 1 .
Fisher's principal tool for the inference was the P-value. For Fisher's preferred statistical distribution models, the data enter into the maximum likelihood estimate of a parameter in the form of a statistic, such as the sample mean. The implication is that such a statistic carries all the inferential information about the parameter; knowing the statistic's value is the same (for purposes of inference about the parameter) as knowing the values of all the individual observations. Fisher coined the term "sufficient statistic" for such a quantity. The null model in Fisher significance analysis is formed by constraining a parameter to a prespecified value. In the tea testing example, the probability of correct identification is constrained to one half. Fisher's P-value is the probability that data drawn from the model H 1 yield a sufficient statistic as extreme or more extreme than the sufficient statistic calculated from the real data.
In absence of an alternative model, Fisher's strict P-value accomplishes an inference similar to what is called a goodness of fit test (or model adequacy test) in contemporary practice, as the inference seeks to establish whether or not the data plausibly could have arisen from model f 1 . Accordingly, just about any statistic (besides a sufficient statistic) can be used to obtain a P-value, provided the distribution of the statistic can be derived or approximated under the model f 1 . Goodness of fit tests therefore tend to multiply, as witnessed by the plethora of tests available for the normal distribution. To sort out the qualities of different goodness of fit tests, one usually has to revert to a Neyman-Pearson two-model framework to establish for what types of alternative models a particular test is powerful.

Neyman-Pearson Testing With P-values-P-values
are, of course, routinely used in Neyman-Pearson hypothesis testing, but the inference is different from that made with Fisher significance. A P-value in the context of the generalized LR test above (Equation   31) is defined as the probability that, if the data generation process were to be repeated, the new value of the LR would exceed the one already observed, provided that the data were generated under H 1 . Hinkley (1987) interprets the P-value as the Type 1 error rate that an ensemble of hypothetical experiments would have if their critical level c was set to the observation of this experiment. In the generalized LR test, the approximate P-value would simply be the area to the right of the observed value of G 2 under the chisquare pdf applicable for H 1 -generated data. For Fisher's preferred statistical distributions (those with sufficient statistics, nowadays called exponential family distributions), the generalized LR statistic G 2 algebraically reduces to a monotone function of one or more sufficient statistics for the parameter or parameters under constraint in the model f 1 . In the generalized likelihood ratio framework, the hypothesis test decision between H 1 and H 2 can be made by comparing the P-value to the fixed value of α, rejecting H 1 as a plausible origin of the data if the P-value is ≤ α.
In both Neyman-Pearson hypothesis testing and Fisher significance analysis, the P-value provides no evidence for model H 1 . The P-value in the two-model framework has been thought of as an inverse measure of the "evidence" for H 2 , as the distribution of the P-value under data generated by H 2 becomes more and more concentrated near zero as sample size becomes large or as model f 2 becomes "farther" from f 1 . In the Fisher one-model framework an alternative model is unspecified. Consequently, a low P-value has been interpreted as "evidence" against H 1 . However, the P-value under data generated by H 1 has a uniform distribution (because a continuous random variable transformed by its own cumulative distribution function has a uniform distribution) no matter what the sample size is or how far away the true data generating process is. Hence, as with NP tests, Fisher's P-value has no evidential value toward f 1 , as any P-value is equally likely under H 1 .
Ecologists use and discuss hypothesis testing in both the Fisher sense and the Neyman-Pearson sense, sometimes referring to both enterprises as "null hypothesis testing." The use of P-values, strongly argued for by some (Hurlburt and Lombardi 2009), does not in and of itself distinguish the two approaches. Rather, a specific alternative hypothesis, an estimable effect size, and (most controversially) a decision rule fixing a Type 1 error rate (i.e., comparing a P-value to α) identifies the analysis as more Neyman-Pearsonian than Fisherian. While Fisher himself originated the P ≤ 0.05 tradition for judging whether a deviation is significant [… "it is convenient to draw the line at about the level at which we can say: 'Either there is something in the treatment, or a coincidence has occurred, such as does not occur more than once in twenty trials.'" Fisher (1926)], he was mostly casual about the cutoff and viewed P-values more as evidence against the null hypothesis in question. In ecology, null hypotheses in the Fisherian sense are seen, for instance, in analyses of species assembly patterns in ecological communities, such as in testing whether bird species groups on o shore islands could be modeled as randomly drawn from the mainland (Connor and Simberlo, 1979). By contrast, a field experiment aimed at demonstrating the existence of competition and estimating an effect size (Underwood, 1986) would take on a Neyman-Pearsonian flavor.

Equivalence Testing and Severity-Attempts have been made to modify the
Neyman-Pearson framework to accommodate the concept of evidence for H 1 . In some applied scientific fields, for example in pharmacokinetics and environmental science, the regulatory practice has created a burden of proof around models normally regarded as null hypothesis models: the new drug has an effect equal to the standard drug, the density of a native plant has been restored to equal its previous level (Anderson and Hauck, 1983;McDonald and Erickson, 1994;Dixon, 1998). Equivalence testing and non-inferiority testing (e.g., Wellek, 2010) are statistical methods designed to address the problem that "absence of a significant effect" is not the same as "an effect is significantly absent." In practice, the equivalence testing methods reverse the role of null and alternative hypotheses by specifying a parameter region that constitutes an acceptably small departure from the parameter's constrained value and then casting the region as the alternative hypothesis. Typically, two statistical hypothesis tests are required to conclude that the parameter is within the small region containing the constraint, such as two one-sided t-tests (to show that the parameter is bounded by each end of the region).
Another proposed solution for the evidence-for-the-null-hypothesis problem is the concept of severity (Mayo, 1996(Mayo, , 2018Mayo and Spanos, 2006) and the closely related method of reverse testing (Parkhurst, 2001). Severity is a sort of P-value under a specified (or possibly estimated) version of the alternative hypothesis. It is the probability that a test statistic more extreme than the one observed would be obtained if the experiment were to be repeated, if the data were arising from model f 2 (with the particular effect size specified). In the generalized likelihood ratio framework, the severity would be calculated as the area to the right of the observed value of G 2 under the non-central chisquare pdf applicable for data generated under model f 2 , with the non-centrality parameter set at a specified value. Thus, severity is a kind of attained power for a particular effect size. Also, severity is mostly discussed in connection with one-sided hypotheses, so that its calculation under the two-sided generalized likelihood ratio statistic is at best an approximation. However, if the effect size is substantial, the probability contribution from the "other side" is low, and the approximation is likely to be fine. In general, the severity of the test is related to the size of the effect, so care needs to be taken in the interpretation of the test.
For a given value of the LR, if the effect size is high, the probability of obtaining stronger evidence against H 1 is high, and the severity of the test against H 1 is high. "A claim is severely tested to the extent that it has been subjected to and passes a test that probably would have found flaws, were they present" (Mayo, 2018).
For both equivalence testing and severity, we are given procedures in which consideration of evidence requires two statistics and analyses. In the case of equivalence testing, we have a statistical test for each side of the statistical model specified by H 1 , and for severity we have a statistic for H 2 and a statistic for H 1 . Indeed, Thompson (2007), section 11.2, considers that for P-values to be used as evidence for one model over another, these must be used in pairs. There is evidence for H 1 relative to H 2 if the first P-value, say P 1 , is large and the second P-value, say P 2 , is small. The requirement for two analyses and two interpretations seems a disadvantageous burden for applications. More importantly, the equivalence testing and severity concepts do not yet accommodate the problems of multiple models or nonnested models. (Equation 7), as discussed by Hacking (1965) and Edwards (1972), can be regarded as a measure of evidence for H 1 and against H 2 (Edwards 1972 termed it support, but the word has a different technical meaning in probability and is better avoided here), or equivalently, an inverse measure of evidence for H 2 and against H 1 . The evidence concept here is post-data in that the realized value of the LR itself, and not a probability calculated over hypothetical experiment repetitions, conveys the magnitude of the empirical scientific case for H 1 or H 2 . However, restricting attention to just the LR itself leaves the prospect of committing an error unanalyzed; while scientists want to search for truth, they strongly want (for reasons partly sociological) to avoid being wrong. Royall (1997Royall ( , 2000 argued forcefully for greater use of evidence-based inferences in statistics, and to Hacking's and Edwards' frameworks he added formal procedures and consideration of errors. Royall's basic setup uses completely specified models as in Neyman-Pearson, but the conclusion about which model is favored by the data is based on fixed thresholds for the LR value, not thresholds determined by any error rate. The idea is to conclude there is strong evidence in favor of model H 1 when L 1 is k times L 2 and strong evidence in favor of H 2 when L 2 is k times L 1 . Royall's conclusion structure in terms of the LR then has a trichotomy of outcomes:
For k, values of 8, 20, or 32 are mentioned. The k value is chosen by the investigator, but unlike α in the Neyman-Pearson framework, k is not dependent on sample size. Viewed as evidence, LR is a post-data measure. The inference does not make appeals to hypothetical repeated sampling. Royall (1997Royall ( , 2000 moreover defines pre-data error rates which are potentially useful in experimental design and serve as reassurance that the evidential approach will not lead investigators astray too often. Suppose the data were generated by model f 1 . It is possible that the LR could take a wayward value, leading to one of two possible errors in conclusion that could occur: (1) the LR could take a value corresponding to weak or inconclusive evidence (the error of weak evidence), or (2) the LR could take a value corresponding to strong evidence for H 2 (the error of misleading evidence). Given the data are generated by model f 1 , the probabilities of the two possible errors are defined as follows: Similarly, given the data are generated under H 2 , P misleading evidence | H 2 = P L 1 /L 2 ≥ k | H 2 = M 2 .
The error probabilities M 1 , M 2 , W 1 , and W 2 can be approximated with the CLT results for L 1 /L 2 (Equations 11-16). Proceeding as before with the Neyman-Pearson error rates, we find that The error probabilities M 1 , M 2 , W 1 , and W 2 depend on the models being compared, but it is easy to show that all four probabilities, as approximated by Equations (38-41), converge to zero as sample size n becomes large. For either hypothesis H i (i = 1, 2), the total error probability given by M i + W i is additionally a monotone decreasing function of n, as for instance M 1 + W 1 = Φ n σ 1 1 n log(k) − K 12 , (42) in which the argument of the cdf Φ ( ■ ) is seen (by ordinary differentiation, assuming k > 1) to be monotone decreasing in n (the expression for M 2 + W 2 would have σ 2 and K 21 in place of σ 1 and K 12 ).
The probability V 1 of strong evidence for model f 1 (x), given the data are indeed generated by model f 1 (x), becomes with V 2 = 1 − M 2 − W 2 defined in kind. Here V stands for veracity or veridicality (because of context, there should be no confusion with the variance operator). It follows from the monotone property of M i + W i that V i is a monotone increasing function of n. Furthermore, it is straightforward to show that V i > M i , i = 1, 2.
Note that V 1 , M 1 , and W 1 are not in general equal to their counterparts V 2 , M 2 , and W 2 , nor should we expect them to be; frequencies of errors will depend on the details of the model generating the data. One model distribution with, say, a heavy tail could produce errors at a greater rate than a light-tailed model. The asymmetry of errors suggests possibilities of predata design to control errors. For instance, instead of LR cutoff points 1/k and k, one could find and use cutoff values k 1 and k 2 that render M 1 and M 2 nearly equal for a particular sample size and particular values of σ 1 , K 12 , σ 2 , and K 21 . Such design, however, will induce an asymmetry in the error rates (defined below) for misspecified models.
Interestingly, as a function of n, M i (i = 1, 2) increases at first, rising to a maximum value before decreasing asymptotically to zero. The value ñ 1 at which M 1 is maximized is found by maximizing the argument of the normal cdf in Equation (38): with the corresponding maximum value of M 1 being Expressions for ñ 2 and M 2 are similar and substitute K 21 and σ 2 in place of the H 1 quantities. That the M i functions would increase with n initially is counterintuitive at first glance. With just a few observations, the variability of the likelihood ratio is not big enough to provide much chance of misleading evidence, although the chance of weak evidence is high. As the sample size increases, the chance of misleading evidence grows at first, replacing some of the chance of weak evidence, before decreasing. It is the overall probability of either weak or misleading evidence, W i + M i , that decreases monotonically with sample size.
2.1.6. Illustration of the Concept of Evidence-We illustrate the error properties of evidence under correct model specification with an example. Suppose the values x 1 , x 2 , … , x n are zeros and ones that arose as iid observations from a Bernoulli distribution with P(X = 1) = p. The pdf is f (x) = p x (1 − p) 1−x , where x is 0 or= 1. The sum of the observations of course has a binomial distribution. We wish to compare hypothesis H 1 : p = p 1 with H 2 : p = p 2 , where p 1 and p 2 are specified values. The log-likelihood ratio is From Equations (4) and (9) we find that In the top panel of Figure 3, simulated values of the probability of strong evidence for model H 1 , given by V 1 = 1 − M 1 − W 1 , are compared with the values as approximated with the CLT (Equations 38, 40). The simulated values create a jagged curve due to the discrete nature of the Bernoulli distribution but are well-characterized by the CLT approximation. The lower panel of Figure 3 portrays the probability of misleading evidence given by M 1 as a function of n. The discrete serrations are even more pronounced in the simulated values of M 1 , and the CLT approximation (Equation 38) follows only the lower edges; the approximation could likely be improved (i.e., set toward the middle of the serrated highs and lows) with a continuity correction. The CLT nonetheless picks up the qualitative behavior of the functional form of M 1 .
2.1.7. P-values, Severity, and Evidence-The concept of evidence allows reinterpretation of P-values in a clarifying manner. Suppose we denote by l 1 /l 2 the realized (i.e., post-data) value of the LR, the lower case signaling the actual outcome rather than the random variable (pre-data) version of the LR denoted by L 1 /L 2 . The classical P-value is the probability, given the data arise from model H 1 , that a repeat of the experiment would yield a LR value more extreme than the value l 1 /l 2 that was observed. In our CLT setup, we can write P = P L 1 L 2 ≤ l 1 l 2 | H 1 ≈ Φ − n σ 1 1 n log l 2 l 1 + K 12 .
Comparing P with the expression for M 1 (Equation 38), we find that P is the probability of misleading evidence under model f 1 if the experiment were repeated and the value of k were taken as l 2 /l 1 .
If the value of l 1 /l 2 is considered to be the evidence provided by the experiment, the value of P is a monotone function of l 1 /l 2 and thereby might be considered to be an evidence measure on another scale. P however is seen to depend on other quantities as well: for a given value of l 1 /l 2 , P could be greater or less depending on the quantities n, K 12 , and σ 1 . Furthermore, K 21 and σ 2 are left out of the value of P, giving undue influence to model f 1 in the determination of amount of evidence, a finger on the scale so to speak. The evidential framework therefore argues for the following distinction in the interpretation of P: the evidence is l 1 /l 2 , while P, like M 1 , is a probability of misleading evidence, except that P is defined post-data.
In fairness to both models, we can define two P-values based on the extremeness of evidence under model f 1 and under model f 2 : These are interpreted as the probabilities of misleading evidence under models 1 and 2, respectively if the value of k were taken to be l 2 /l 1 . The quantity 1 − P 2 in this context is the severity as defined by Mayo (1996Mayo ( , 2018 and Mayo and Spanos (2006). Taper and Lele (2011) termed P 1 or P 2 as a local probability of misleading evidence (M L in their notation), as opposed to a global, pre-data probability of misleading evidence (M G in their notation; M 1 and M 2 here) characterizing the long-range reliability of the design of the datagenerating process.

Misspecified Models
George Box's (Box, 1979) oft-quoted aphorism that "all models are wrong, but some are useful" becomes pressing in ecology, a science in which daily work and journal articles are filled with statistical and mathematical representations. Ecologists must assume in abundance that Type 3 errors are prevalent, even routine, in their work. Here we compare Neyman-Pearson hypothesis testing with evidential statistics to try to understand how analyses can go wrong, and how analyses can be made better, in ecological statistics. For a statistical method of choosing between f 1 (x) or f 2 (x), we now ask how well the method performs toward choosing the model closest to the true model g (x) when both candidate models are misspecified.

Neyman-Pearson Hypothesis Testing Under Misspecification-
Statisticians have long cautioned about the prospect that both models f 1 and f 2 in the Neyman-Pearson framework, broadly interpreted to include testing composite models with generalized likelihood ratio and other approaches, could be misspecified, and as a result that the advertised error rates (or by extension the coverage rates for confidence intervals) would become distorted in unknown ways (for instance, Chatfield, 1995). The approximate behavior of the LR under the CLT under misspecification (Equations 20-22) allows us to view directly how the error probabilities α and β can be a ected in Neyman-Pearson testing when the models are misspecified.
The critical value c (Equation 28) is chosen as before, under the assumption that the observations are generated from model f 1 . We ask the following question: "Suppose the real Type 1 error is defined as picking model f 2 when the model f 1 is actually closest to the true pdf g (x) (that is, when ΔK > 0). What is the probability, let us say α′, of this Type 1 error, given that f 1 is the better model?" We now have after substituting for c (Equation 28), and so the CLT (Equation 22) tells us that In words, the Type 1 error realized under model misspecification is generally not equal to the specified test size. Note that Equation (53) collapses to Equation (28), as desired, if f 1 = g.
Whether the actual Type 1 error probability α′ is greater than, equal to, or less than the advertised level α depends on the various quantities arising from the configuration of f 1 (x), f 2 (x), and g (x) in model space. Because the standard normal cdf Φ ( ■ ) is a monotone increasing function, we have α′ > α n σ g K 12 − ΔK − σ 1 σ g z α > − z α .
The ratio (K 12 − ΔK) / (σ 1 − σ g ) compares the difference between what we assumed about the LR mean (K 12 ) and what is the actual mean (ΔK) with the difference between what we assumed about the LR variability (σ 1 ) with what is the actual variability (σ g ). The left-hand inequalities for each case are reversed if α′ < α.
The persuasive strength of Neyman-Pearson testing always revolved around the error rate α being known and small, and the P-value, if used, being an accurate reflection of the probability of more extreme data under H 1 . When L 1 /L 2 ≤ c in the Neyman-Pearson framework with correctly specified models, the reasoned observer is forced to abandon H 1 as untenable. However, in the presence of misspecification, the real error rate α′ is unknown, as is a real P-value for a generalized likelihood ratio test. Furthermore, α′ is seen in Equation (53) to be an increasing function of n if K 12 > ΔK (remember that for a generalized LR test the Type 1 error is predicated on ΔK > 0), with 1 as an upper asymptote. If model f 2 is very different from model f 1 (K 12 large) but is almost as close to truth as f 1 (ΔK small), then Type 1 errors will be rampant, more so with increasing sample size.
That greater sample size would make error more likely seems counterintuitive, but it can be understood from the CLT results for the average log-LR given by (1/n) log (L 1 /L 2 ) (Equations 12, 21). If the observations arise from f 1 (x) (correct specification), the average log-LR has a mean of K 12 and its distribution becomes more and more concentrated around K 12 as n becomes large. If however the observations arise from g (x) (misspecification), the average log-LR has a mean of ΔK and its distribution becomes more and more concentrated around ΔK as n becomes large. A Neyman-Pearson test based on a statistic that has a null hypothesis mean of K 12 will become more and more certain to reject the null hypothesis when the true mean is ΔK. Thus, the Neyman-Pearson framework can be a highly unreliable approach for picking the best model in the presence of misspecification.
The error probability β′ is defined and approximated in similar fashion. If model f 2 is closer to truth, we have ΔK < 0, and from Equations (28-30) we now have L 1 L 2 > c n σ g 1 n log L 1 L 2 − ΔK > n σ g K 12 − ΔK − σ 1 σ g z α .
The CLT then gives As a function of n, β′ goes to zero as n becomes large, preserving that desirable property of β from Neyman-Pearson testing under correct specification. However, if the experiment or survey is being planned around the value of β, under misspecification the actual value as defined by β′ could be quite different. In particular, if β′ > β, we must have n σ g K 12 − ΔK − σ 1 σ g z α < n σ 2 K 12 + K 21 − σ 1 σ 2 z α .
The left inequalities for the three cases are reversed for β′ < β. The degree to which β′ departs from β is seen to depend on a tangled bank of quantities arising from the configuration of f 1 (x), f 2 (x), and g (x) in model space.

P-values, Equivalence Testing, and Severity Under Misspecification-
The problems with α and β, and with P-values as defined for the generalized LR setting in Equations (50) and (51), under misspecification highlight problems that might arise in significance testing, equivalence testing or severity analysis. With misspecification, the true P-value (P′ say) can differ greatly from the P-value (Equation 49) calculated under H 1 and thereby could promote misleading conclusions (P′ is formed from Equation (49) by substituting σ g for σ 1 and −ΔK for K 12 ). Equivalence testing, being retargeted hypothesis testing, will take on all the problems of hypothesis testing under misspecification. Severity is 1 − P 2 as defined by Equation (51), but with misspecification the true value of P 2 is Equation (51) with σ g substituted for σ 2 and −ΔK substituted for K 21 . With misspecification, the true severity could differ greatly from the severity calculated under H 2 . One might reject H 1 falsely, or one might fail to reject H 1 falsely, or one might fail to reject H 1 and falsely deem it to be severely tested. Certainly, in equivalence testing and severity analysis, the problem of model misspecification is acknowledged as important (for instance, Mayo and Spanos, 2006;Spanos, 2010) and is addressed with model evaluation techniques, such as residual analysis and goodness of fit testing.

Evidence Under Misspecification-
To study the properties of evidence statistics under model misspecification, we redefine the probabilities of weak evidence and misleading evidence in a manner similar to how the error probabilities were handled above in the Neyman-Pearson formulation. We take W 1 ′ and M 1 ′ to be the probabilities of weak and misleading evidence, given that model f 1 is closer to truth, that is, given that Δ K > 0: P( weak evidence | ΔK > 0) = P 1/k < L 1 /L 2 < k | ΔK > 0 = W 1 ′, P( misleading evidence | ΔK > 0) = P L 1 /L 2 ≤ 1/k | ΔK > 0 = M 1 ′ .
Similarly, given model f 2 is closer to truth,

Author Manuscript
Author Manuscript
We thus obtain The other error probability under misspecification, with ΔK < 0, likewise becomes The expression is identical to Equation (69) where ΔK > 0 and so we may write M i ′ ≈ Φ − n σ g 1 n log(k) + ΔK , i = 1, 2.
In words, for models with no unknown parameters under misspecification, the error probabilities M 1 ′ and M 2 ′ are identical. Using different LR cutoff points k 1 and k 2 to control error probabilities M 1 and M 2 under correct specification would break the symmetry of errors under misspecification. The consideration of evidential error probabilities in study design forces the investigator to focus on what types of errors and possible model misspecifications are most important to the study.
The symmetry of error rates is preserved for weak evidence, for which we obtain The formulae for α′ (Equation 53), β′ (Equation 59), and M i ′, W i ′, i = 1, 2 (Equations 71, 72) allow the investigation of how these error rates change as a function of the sample size n. However, given that these formulae also involve ΔK, K 12 , and K 21 , multiple configurations should be explored in model space. Figure 4 illustrates how changing parameters can change KL divergences. For instance, the generating process and the approximating models could be aligned in space (see Figure 4A) or not ( Figure 4B). Other configurations are explored in Figures 4C,D. The error rates for each one of these configurations are shown in Figure 5.
Four properties of the error probabilities under misspecification are noteworthy. First, M 1 ′, M 2 ′, W 1 ′, and W 2 ′ all asymptotically approach zero as n becomes large provided ΔK ≠ 0 (that is, provided one of the models is measurably better than the other), consistent with their behavior under correct specification. Second, for a given value of |ΔK|, that is, for a given difference in the qualities of models H 1 and H 2 in representing truth, M 1 ′ is equal to M 2 ′, and W 1 ′ is equal to W 2 ′. Thus, neither model has special standing. Third, M 1 ′ and W 1 ′ asymptotically approach M 1 and W 1 as model f 1 becomes better at representing truth (i.e., as K (g, f 1 ) → 0), and likewise M 2 ′ and W 2 ′ approach M 2 and W 2 as f 2 becomes better. Fourth, if ΔK = 0, that is, if both models are equal in quality, then M 1 ′ and M 2 ′ each approach 1/2, and W 1 ′ and W 2 ′ each approach zero, as n becomes large. The above four properties are intuitive and sensible.
The total error probability under misspecification given by M i ′ + W i ′ (i = 1, 2) is identical for both models and remains a monotone decreasing function of n: The probability of strong evidence for model f i if f i is closer to g is given by V i ′ = 1-M i ′ −W i ′ thus remains a monotone increasing function of n with an asymptote of 1. As was the case for correctly specified models, V i ′ > M i ′. Also, M i ′ increases at first as a function of n, rising to a maximum value before decreasing asymptotically to zero. The value n i ′ at which M i ′ is maximized is given by with the corresponding maximum value of M i ′ being The expressions for n i ′ and M i ′ revert to their counterparts ñ i and M i when one of the models is correctly specified. If both models are of equal quality, that is, ΔK = 0, then the probabilities M i ′ can be considered as probabilities of evidence favoring (wrongly, as the models are a tossup in quality) one or the other models. When ΔK = 0, M i ′ as a function of n has no local maximum and asymptotically approaches 1/2 as sample size increases. The possibility that M i ′ might be as great as 1/2 seems distressing, but this only occurs when the two models become equally good (not necessarily identical) approximations of the generating process.

Illustration of Neyman-Pearson Testing and Evidence Under
Misspecification-An extension of the Bernoulli example from Figure 3 serves to sharply contrast the error properties of NP testing and evidence analysis. We construct as before two candidate Bernoulli models with respective success probabilities p 1 and p 2 . Suppose however that the data actually arise from a Bernoulli distribution with success probability p g .
From Equation (17), the value of ΔK becomes ΔK = p g log p g p 2 + 1 − p g log 1 − p g 1 − p 2 − p g log p g p 1 − 1 − p g log 1 − p g 1 − p 1 = log 1 − p 1 1 − p 2 + p g log p 1 1 − p 2 1 − p 1 p 2 Note that ΔK is here a simple linear function of p g . In the Figure 3 example, p 1 = 0.75 and p 2 = 0.50. If we take p g = 0.65, we have a situation in which model 1 is slightly closer to the true model than model 2. As well, we readily calculate that K 12 = 0.130812 and ΔK = 0.02095081, so that K 12 > ΔK, a situation in which we expect α′ to be an increasing function of n (as dictated by Equation 53).
The top panel of Figure 6 should give pause to all science. Shown is the probability ( In the bottom panel of Figure 6, the probability of misleading evidence for model 2 (M 2 ′ ), that is, of picking the model farther from truth, increases at first but eventually decreases to zero ( Figure 6, bottom panel shows simulated values as well as CLT approximation given by Equation 70). Under evidence analysis, the probability of wrongly picking the model farthest from truth converges to 0 as sample size increases.
The example illustrates directly the potential effect of misspecification on the results of the Neyman-Pearson Lemma. The lemma is of course limited in scope, and we should in all fairness note that a classical extension of the lemma to one-sided hypotheses seemingly ameliorates the problem in this particular example. Suppose the two models are expanded: model 1 is the Bernoulli distribution with p ≥ 0.75, with model 2 becoming the Bernoulli with p < .75. Then, the "Karlin-Rubin Theorem" (Karlin and Rubin, 1956) finds the LR test to be uniformly most powerful size α (or less) test of model 1 vs. model 2. Three key ideas enter the proof of the theorem. First, for any particular value p 2 such that p 2 < p 1 , the Neyman-Pearson Lemma gives the LR test as most powerful. Second, the cutoff point c for the Neyman-Pearson LR test does not depend on the value of p 2 . Third, the LR is a monotone function of a sufficient test statistic given by (x 1 +x 2 +…+x n )/n. The upshot is that α would remain constant in the expanded scenario, and β would decrease toward zero as advertised.
However, the one-sided extension of our Bernoulli example expands the model space to eliminate the model misspecification problem. We regard the hypotheses H 1 : p ≥ 0.75 and H 2 : p < 0.75 to be a case of two non-overlapping models (Figures 1, 2, bottom) which may or may not be correctly specified. The Karlin-Rubin Theorem would govern if the models are correctly specified. Misspecification in this one-sided context would be exemplified, for instance, by data arising from some other distribution family besides the Bernoulli(p), such as an overdispersed family like a beta-Bernoulli (Johnson et al., 2005). Under misspecification, Karlin-Rubin lacks jurisdiction. (2004) took Royall's (Royall, 1997) approach to using the LR for model comparison and generalized it into the concept of evidence functions. Evidence functions are developed mathematically from a set of desiderata that effective measures of evidence intuitively should satisfy (see Taper and Ponciano, 2016).

Evidence Functions-Lele
The basic insight is that the log-LR emerges as the function to use for model comparison when the discrepancy between models is measured by the KL divergence (Equation 3). The reason is that (1/n) log (L 1 /L 2 ) is a natural estimate of ΔK, the difference of divergences of f 1 (x) and f 2 (x) from truth g (x). However, numerous other measures of divergence or distance between statistical distributions have been proposed (see Lindsay, 2004;Pardo, 2005;Basu et al., 2011), the KL divergence merely being the most well-known. Each measure of divergence or distance would give rise to its own evidence function. Lele (2004) defines an evidence function for a given divergence measure as a data-based estimate of the difference of divergences of two approximating models from the underlying process that generated the data. The motivating idea is to use the data to estimate which of two models is "closer" in some sense to the data generating process. The evidence function concept requires a measure of divergence of a model f (x) from the true data generating process g (x) and a statistic, the evidence function, for estimating the difference of divergences from truth of two models f 1 (x) and f 2 (x). Important among the desiderata for evidence functions (Taper and Ponciano, 2016) is that the probabilities of strong evidence as defined under misspecification should asymptotically approach 1 as sample size increases (and so the error probabilities as embodied in M 1 ′, M 2 ′, W 1 ′, and W 2 ′ would approach zero). It is noteworthy that the prospect of model misspecification is baked into the very definition of an evidence function. Lele (2004) further proved an optimality property of the LR as evidence function similar to the optimality of the LR in the Neyman-Pearson Lemma. Lele's Lemma states that, out of all evidence functions, asymptotically, that is for large sample sizes, the probability of strong evidence is maximized by the LR. The result combines the Neyman-Pearson Lemma of hypothesis tests with Fisher's lower bound for the variance of estimators (see Rice, 2007), extending both. Thus, the information in the data toward quantifying evidence is captured the most by the LR statistic or, equivalently, KL divergence. Other divergence measures, however, have desirable properties, such as robustness against outliers. Modified profile likelihood and conditional likelihood also lead to desirable evidence functions that can account for nuisance parameters, although these modifications to the original LR statistics still are unexplored in terms of their optimality.

Information-Theoretic Model Selection Criteria
The latter part of the 20th Century saw some statistical developments that made inroads into the problems of models with unknown parameters (composite models), multiple models, model misspecification and non-nested models, among the more widely adapted of which were the model selection indexes based on information criteria. The work of Akaike (Akaike, 1973(Akaike, , 1974 Figure 7) revealed a novel way of formulating the model selection problem and ignited a new statistics research area. Akaike's ideas found immediate use in the time series models of econometrics (Judge et al., 1985), were studied and disseminated for statistics in general by Sakamoto et al. (1986) and Bozdogan (1987) and popularized, especially in biology, by Burnham and Anderson (2002).
The information criteria are model selection indexes, the most widely used of which is the AIC (originally, "an information criterion," Akaike, 1981; now universally "Akaike information criterion"). The AIC is minus two times the maximized log-likelihood for a model, the maximization taken across unknown parameters, with a penalty for the number of unknown parameters added in: AIC i = − 2log L i + 2r i , where L i is the maximized likelihood for model H i , and r i is the number of unknown parameters in model H i that were estimated through the maximization of L i . We are now explicitly considering the prospect of more than two candidate models, although each evidential comparison will be for a pair of models.
Akaike's fundamental intuition was that it would be desirable to select models with the smallest "distance" to the generating process. The distance measure he adopted is the KL divergence. The log-likelihood is an estimate of this distance (up to a constant that is identical for all candidate models). Unfortunately, when parameters are estimated, the maximized log-likelihood as an estimate of the KL divergence is biased low. The AIC is an approximate bias-corrected estimate of an expected value related to the distance to the generating process. The AIC is an index where goodness of fit as represented by maximized log-likelihood is penalized by the number of parameters estimated. Penalizing likelihood for parameters is a natural idea for attempting to balance goodness of fit with usefulness of a model for statistical prediction (which starts to break down when estimating superfluous parameters). To practitioners, AIC is attractive in that one calculates the index for every model under consideration and selects the model with the lowest AIC value, putting all models on a level playing field so to speak.
Akaike's inferential concept underlying the AIC represented a breakthrough in statistical thinking. The idea is that in comparing model H i with model H j using an information criterion, both models are assumed to be misspecified to some degree. The actual data generating mechanism cannot be represented exactly by any statistical model or even family of statistical models. Rather, the modeling process seeks to build approximations useful for the purpose at hand, with the left-out details deemed negligible by scientific argument and empirical testing.
Although AIC is used widely, the exact statistical inference presently embodied by AIC is not widely understood by practitioners. What Akaike showed is that under certain conditions −AIC i /(2n) is (up to an unknown constant) an approximately unbiased estimator of E g K g(x), f i x, θ i , where θ i is a vector of unknown parameters and θ i is its ML estimate, the parameter penalty in AIC being the approximate bias correction. The expectation has two variability components, (1) the distribution of f i X, θ i given the ML estimate value, and (2) the distribution of the ML estimate, both expectations with respect to truth g (x) (In Akaike's formulation, truth was a model f( ■ ) with some high-dimensional unknown parameter, while all the candidate models are also in the same form f( ■ ) except with the parameter vector constrained to a lower-dimensional subset of parameter space. Truth in Akaike's approach is as unattainable as g (x)). The double expectation is termed the "mean expected log-likelihood." The difference AIC i − AIC j then is a point estimate of which model is closer on average to truth, in the sense estimating (−2n) times the difference of mean expected log-likelihoods. The approximate bias correction incorporated in AIC is technically correct only if f i x, θ i is rather "close" to g (x); Takeuchi (1976) subsequently provided a mathematically improved (but statistically more difficult to estimate) approximation. "Information theoretic" indexes for model selection have proliferated since, with different indexes refined to perform well for different sub-purposes (Claeskens and Hjort, 2008).
In practice, the AIC-type inference represents a relative comparison of two models, not necessarily nested or even in the same model family, requiring only the same data and the same response variable to implement. The inference is post-data, in that there are (as yet) no appeals to hypothetical repeated sampling and error rates. All candidate models, or rather, all pairs of models, can be inspected simultaneously simply by obtaining the AIC value for each model. But, as is the case with all point estimates, without some knowledge of sampling variability and error rates we lack assurance that the comparisons are informative.

Differences of Model Selection Indexes as Evidence Functions
We propose that information-based model selection indexes can be considered as generalizations of LR evidence to models with unknown parameters, for model families obeying the usual regularity conditions for ML estimation. The evidence function concept clarifies and makes accessible the nature of the statistical inference involved in model selection. Like LR evidence, one would use information indexes to select from a pair of models, say f 1 (x, θ 1 ) and f 2 (x, θ 2 ), where θ 1 and θ 2 are vectors of unknown parameters.
Like LR evidence, the selection is a post-data inference. Like LR evidence, the prospect of model misspecification is an important component of the inference. And critically, like LR evidence, the error probabilities W i and M i (i = 1, 2) can be defined for the information indexes and can in principal be calculated (or simulated) as discussed below. Additionally, as discussed below, many of the existing information indexes retain the desirable error properties of evidence functions. Oddly, the AIC itself does not.

Nested Models, Correctly Specified
As noted earlier, the generalized LR framework of two nested models under correct model specification is a workhorse of scientific practice and a prominent part of applied statistics texts. It is worthwhile then in studying evidence functions to start with the generalized LR framework, in that the model selection indexes are intended in part to replace the hierarchical sequences of generalized LR hypothesis testing (stepwise regression, multiple comparisons, etc.) for finding the best submodel within a large model family.
The model relationships diagrammed in the top portion of Figure 1 depict the two cases. In case 1 (top left), a parameter vector in model f 1 identifies the true model giving rise to the data. Technically the parameter vector is contained in model f 2 as well, but the scientific interest focuses on whether the additional parameters in the unconstrained parameter space of f 2 can be usefully ignored. Case 2 (top right) portrays the situation in which the true parameter vector is in the unconstrained parameter space of model f 2 ; model f 1 is too simple to be useful.
Suppose we decide to use ΔAIC 12 = AIC 1 − AIC 2 as an evidence function. For convenience, we have defined this AIC-based evidence function to vary in the same direction as G 2 (Equation 31) in NP hypothesis testing, so that large values of ΔAIC correspond to large evidence for f 2 (opposite to the direction for the ordinary LR-evidence function given by Equation 33). For instance, the early rule of thumb in the AIC literature was to favor model f 1 when ΔAIC 12 ≤ −2 and to favor model f 2 when ΔAIC 12 ≥ 2. Note that where ν = r 2 − r 1 , the difference of the numbers of unknown parameters in the two models.
The behavior of our candidate evidence function ΔAIC 12 can be studied using the Wilks/ Wald results for the asymptotic distribution of G 2 . Under case 1, ΔAIC 12 has (approximately) a chisquare(ν) distribution that has been location-shifted to begin at −2ν instead of at 0 (top of Figure 8). Under case 2, ΔAIC 12 has (approximately) a non-central chisquare(ν, λ) distribution with the same −2ν location shift (bottom of Figure 8). The areas under the shifted chisquare pdf in the intervals (−2, + 2) and (+2, ∞) are respectively the generalized error probabilities W 1 and M 1 (Figure 8, top). Likewise, the areas under the shifted non-central pdf in the intervals (−2ν,−2) and (−2,+2) are respectively the generalized error probabilities M 2 and W 2 (Figure 8, bottom).
As sample size increases, the error probabilities W 1 and M 1 for the AIC-based evidence function do not go to zero but rather remain positive (Figure 8, top). The value of n appears nowhere in the location-shifted chisquare pdf for ΔAIC 12 , and so the error probabilities W 1 and M 1 remain static. Thus, for the AIC, the probabilities of weak and misleading evidence given model f 1 generates the data both behave like the Type 1 error probability α in Neyman-Pearson testing. The simulation results of Aho et al. (2014) showing a Type-1-like behavior of the AIC with increasing sample size for particular statistical models are thereby explained (see also Taper and Ponciano, 2016).
As sample size increases, the error probabilities W 2 and M 2 for the AIC-based evidence function do go to zero (Figure 8, bottom). The non-centrality parameter λ in the locationshifted non-central chisquare pdf for ΔAIC 12 is proportional to the value of n, and the mean (ν + λ) of the non-central distribution increases faster than the standard deviation ([2 (ν + 2λ)] 1/2 ), driving the error probabilities W 2 and M 2 to zero. Thus, for the AIC, the probabilities of weak and misleading evidence given model f 2 generates the data both behave like the Type 2 error probability β in Neyman-Pearson testing.
Thus, within the generalized likelihood ratio framework, the AIC appears to bring no particular improvement in the sense of evidence to ordinary Neyman-Pearson testing using G 2 . Indeed, at least in the Neyman-Pearson approach, the value of α is fixed by the investigator and is therefore known if the models are correctly specified. The error probabilities attending the use of AIC however are unknown, as they generally are in evidence functions, although they in principle can be estimated with simulation. AIC-based model selection does not have the error properties of an evidence function within the classical milieu of nested statistical models.
Other information-theoretic indexes used for model selection, however, do have performance characteristics of evidence functions. Consider the Schwarz information criterion (SIC; also known as Bayesian information criterion or BIC) given by The index originally had a Bayesian-based derivation (Schwarz, 1978), but its frequentist error properties when employed as an evidence function become apparent with the methods used above for the AIC. As with the AIC, the evidence function version of the SIC would use the difference of SIC values: ΔSIC 12 = SIC 1 − SIC 2 = G 2 − vlog(n) .
As with the AIC also, the asymptotic distributions of the SIC evidence function under model f 1 and model f 2 are respectively, location-shifted chisquare and non-central chisquare distributions. For the SIC though, the location of the lower bound of the two distributions at −νlog (n) decreases as sample size increases (Figure 9, top). If the data arise from model f 1 , the chisquare distribution is pulled to the left, and the areas under the pdf corresponding to and eventually decrease asymptotically to zero. If the data arise from model f 2 , although the non-central chisquare distribution is also pulled to the left at a rate proportional to log (n), the mean is pulled to the right at a rate proportional to n, and the coefficient of variation around the mean goes to zero at a rate 1/ n. The areas under the pdf corresponding to W 2 and M 2 eventually decrease asymptotically to zero (Figure 9, bottom). Thus, unlike the AIC, for nested, correctly specified models the SIC possesses a key quality of an evidence function: all the probabilities of weak and misleading evidence eventually decrease asymptotically to zero.

Misspecified Models
To be fair, AIC as well as evidence functions were forged in the fiery world of misspecified models. Does the AIC difference gain the properties of an evidence function when neither f 1 nor f 2 give rise to the data?
If the models are nested or overlapping, the answer is no. To understand this, we must appeal to modern statistical advances in the theory of maximum likelihood estimation and generalized likelihood ratio testing when models are misspecified. The relevant and general theory can be found in White (1982), Nishii (1988), Vuong (1989), and references therein.
Suppose a model with pdf f (x, θ) is fitted using ML estimation to observations that came from a distribution with pdf g (x). Under a variety of regularity conditions on the pdfs, the ML estimate has an asymptotic multivariate normal distribution centered on a value θ*, where θ* is the value of θ that minimizes K (g (x), f (x, θ)) (White, 1982). The multivariate normal distribution furthermore concentrates around θ* as n becomes large, reflecting the fact that the ML estimate under misspecification is a statistically consistent estimate of (converges in probability to) θ*.
Now, any two models f 1 (x, θ 1 ) and f 2 (x, θ 2 ) being compared will be in one of nested, overlapping, or non-overlapping configurations (see Figure 2). Under misspecification in each case, the truth g (x) is out there, somewhere. We now ask of an evidence function: "Which model contains a parameter set that brings it closer to truth? Is K (g (x), f 1 (x, θ 1 *)) smaller than K (g (x), f 2 (x, θ 2 *)) or vice versa?" The question needs modification in the nested and overlapping cases. If f 1 is nested within f 2 , K (g (x), f 1 (x, θ 1 *)) cannot be smaller than K (g (x), f 2 (x, θ 2 *)). The modified question becomes "Is f 1 (x, θ 1 *) as close to truth as f 2 (x, θ 2 *)?" The question in the nested case is a natural extension of the question asked under correct specification. In the nested case, K (g (x), f 1 (x, θ 1 *)) being the same as K (g (x), f 2 (x, θ 2 *)) signifies that f 1 (x, θ 1 *) and f 2 (x, θ 2 *) are the same model. If f 1 overlaps f 2 , the model closest to truth could be the overlapping region, K (g (x), f 1 (x, θ 1 *)) would be the same as K (g (x), f 2 (x, θ 2 *)), and f 1 (x, θ 1 *) and f 2 (x, θ 2 *) would be the same model. However, in the overlapping case, K (g (x), f 1 (x, θ 1 *)) being the same as K (g (x), f 2 (x, θ 2 *)) does not necessarily signify that f 1 (x, θ 1 *) and f 2 (x, θ 2 *) are the same model. The question in the overlapping case becomes "Is the best model in the overlapping region?" Vuong (1989) derived the asymptotic distributions of G 2 under the nested, overlapping, and non-overlapping cases in the presence of misspecification. His main results relevant here are the following, presented in our notation: A.
When f 1 (x, θ 1 *) and f 2 (x, θ 2 *) are the same model (either f 1 is nested within f 2 , or f 1 overlaps f 2 , and the best model is in the nested or overlapping region), then the asymptotic distribution of G 2 is a "weighted sum of chisquares" in the form , in which the Z j are independent, standard normal random variables (each Z j 2 being chisquare with 1 df) and the a j values are eigenvalues of a square matrix (r 1 × r 2 rows) of expected values of various derivatives of the two log-pdfs with respect to the parameters (generalization of the Fisher information matrix). The point is, the asymptotic distribution of G 2 does not depend on n. ΔAIC 12 and ΔSIC 12 , along with evidence functions formed from other information indexes, then have location-shifted versions of the weighted sum of chisquares distribution. The error probabilities M 1 ′ and W 1 ′ defined for AIC become static and do not decrease to zero as n becomes large. The error probabilities M 1 ′ and W 1 ′ defined for SIC do decrease to zero, because the location quantity decreases as becomes large, pulling the weighted sum of chisquares pdf to the left (similar to the chisquare distribution in Figure 9). This scenario is simulated and then plotted in Figure 10A.
The result parallels the CLT results (Equations 20-22) for completely specified models, with the added condition that each candidate model is evaluated at its "best" set of parameters. In this situation, the mean of G 2 increases or decreases in proportion to n, while the standard deviation increases only in proportion to n. All of the error probabilities, M 1 ′, M 2 ′, W 1 ′ and W 2 ′ defined for ΔAIC 12 as well as for ΔSIC 12 do decrease to zero as n becomes large. This scenario is simulated and plotted in Figure 10B.
We must point out that a generalized Neyman-Pearson test (via simulation/bootstrap) of two non-overlapping models with misspecification can suffer the same fate as the completely specified models in the Neyman-Pearson Lemma. The large sample distribution of G 2 , assuming model 1 generates the data, would have a mean involving K 12 (evaluated at true parameter value in model 1 and best parameter value in model 2); the cutoff point c and other test characteristics would be obtained from this distribution. Under misspecification, the true asymptotic distribution of G 2 has a mean involving ΔK* (Equation 78). As was the case for the two models in the Neyman-Pearson Lemma (Figure 6), discrepancy between K 12 and ΔK* can cause the generalized Neyman-Pearson test to pick the wrong model with Type 1 error probability approaching 1. The Karlin-Rubin Theorem and the forceful language of uniformly most powerful tests does not rescue Neyman-Pearson testing from derailment when inadequate models are deployed.
Error probabilities going to zero can alternatively be derived as a consequence of the (weak or strong) "consistency" of the model selection index. Consistency here means that the index asymptotically picks the model closest to truth as sample size becomes large. Nishii (1988) studied information indexes in the form −2log L i − c n r i , where the parameter penalty coefficient c n is a possible function of n. The parameter penalty determines whether an information-theoretic index behaves like an evidence function. If c n grows at a rate < n but > log log (n) then an information-theoretic index will asymptotically pick the model closest to truth Nishii (1988). The difference of such indexes will therefore behave as an evidence function, as the probabilities of picking any of the contending models go to zero. If, however, the penalty term is constant or asymptotically constant, and the model closest to truth is in a parameter region common to two or more models, then the probabilities of weak and misleading evidence are or become constant. The problematic error properties of Neyman-Pearson testing from the standpoint of evidence are thereby preserved in such model selection indexes. For instance, the AIC-corrected index is (Hurvich and Tsai, 1989).
AICc i = AIC i + 2r i r i + 1 / n − r i − 1, in which the correction term is designed to improve the behavior of the index under small sample sizes. However, the correction term asymptotically approaches zero as n becomes large, and so AICc reverts to AIC, with all its asymptotic error properties, for large samples.
Thus, for either correctly specified or misspecified models in which the best model is in a region of model space that does not overlap any other model under consideration, ΔAIC 12 indeed behaves like an evidence function. However, many model selection problems, such as in multiple regression, involve collections of models in which model pairs can be nested or overlapping as well as non-overlapping. ΔAIC 12 will behave more like Neyman-Pearson hypothesis testing for models within overlapping regions and therefore will not possess evidence function properties. differences of information indexes that adjust G 2 with a constant or asymptotically constant location shift, such as the TIC and AICc will share the Neyman-Pearson properties of ΔAIC 12 and cannot be regarded as evidence functions. differences of those information indexes, such as SIC that produce a location shift that decreases to −∞ as n increases (provided that rate is within the Nishii (1988) bounds) will have the error properties of evidence functions.

Comparing Approaches to Statistical Inference
We have shown that key inferential characteristics for Fisher significance analysis, Neyman-Pearson hypothesis testing, and evidential comparison differ substantially. Evidence has inferential qualities that match or surpass Fisher significance and Neyman-Pearson tests (see Table 1): • Equal status for both models. In Fisher significance analysis, there is only one model under consideration. Neyman-Pearson testing compares two models but one of them is accorded special status as the null model and endowed with a fixed error rate (α). Evidence analysis compares two models without giving either model special status.
• Evidence for the null. Neither Fisher significance analysis nor the conventional form of Neyman-Pearson testing provides evidence for the null hypothesis. Extra analyses (equivalence testing, severity) have been proposed to quantify evidence for the null hypothesis, but such approaches reverse model roles and give special status to the alternative hypothesis. In evidence analysis, one statistic called an evidence function quantifies the evidence for one model and against each of the models in the model set.
• Accommodates multiple models. Under Fisher significance analysis, the Pvalues for different models are based on different sufficient statistics and are not strictly comparable. One could compare multiple P-values using a shared goodness of fit statistic (not necessarily sufficient), such as the Kolmogorov-Smirnoff. However, pure goodness of fit favors overparameterization (overfitting). Neyman-Pearson testing has been jury-rigged in various forms (stepwise regression, multiple comparisons) to sort through multiple models, but the results at best have only had fair statistical properties. With evidence analysis, all pairs of candidate models can be compared, and thereby all candidate models can be ranked.

•
All error rates go to zero. Neyman-Pearson testing fixes the Type 1 error probability to be constant, thereby structuring the error rate to be constant regardless of sample size. Fisher significance analysis acquires such a constant error rate when the decision to reject a model is based on a threshold for the Pvalue. Under evidence analysis all error rates approach zero asymptotically with increasing sample size.
• Total error monotonically decreasing. In evidence analysis, the total error under each model (1 minus the probability of strong evidence under the model) decreases monotonically and asymptotically to zero with increasing sample size. Because of the special status of the null hypothesis in Neyman-Pearson testing, the total error rate is the Type 1 error rate which remains constant. Fisher significance analysis dons the Type 1 error properties of Neyman-Pearson testing if the decision to reject the model is based on a P-value threshold.
• Non-nested models. Fisher significance analysis deals with one model at a time, so the idea of comparing two non-nested models is not applicable. The standard extensions (such as generalized likelihood ratio) of the original Neyman-Pearson framework to models with unknown parameters assume that one of the models is nested within the other. Evidence analysis compares two models regardless of their nested or non-nested configuration.
• Evidence and errors rates distinguished. The interpretation of a P-value has long been a source of confusion among scientists. Because the P-value is calculated under the properties of just one model, it is not satisfactory as a measure of evidence for one model over another (Royall, 1986(Royall, , 1997. Evidence analysis regards error rates and evidence as separate concepts. The evidence approach clarifies P-values as error rates defined post-data (see section 2.1.7).

•
Robustness to model misspecification. Evidence functions are defined in terms of the misspecification of two candidate models. Evidence functions are statistical estimates of which of two models is closer to the true data-generating process. The error rates of evidence analysis, defined robustly as the probabilities of wrong conclusions about which model is closer, go to zero as sample size increases, even under model misspecification. Under model misspecification, Neyman-Pearson testing can fail spectacularly: the Type 1 error rate, defined as the probability of wrongly picking the alternative hypothesis model when the null hypothesis model is just as close to truth, can approach 1 asymptotically as sample size increases. Fisher significance analysis, being in essence a test of whether a given model is misspecified, can be considered to be defined under a presumption of misspecification.
• Promotes exploration of new models. Perhaps the most important property of evidence analysis in scientific endeavors is that it explicitly encourages discovery of new models that are closer to truth than models already analyzed. An evidence analysis leaves "room at the top," or the possibility that a new approach could yield a much better model for the data. In the scientific world, the daily t-tests and regressions under Neyman-Pearson testing produces an inertia, a perfunctory routine in statistical analysis often characterized by working scientists as "cookbook" in nature. Barnard's (1949) observation had Bayesian statistics as its target, but his excruciating words apply to any kind of modeling: "To speak of the probability of a hypothesis implies the possibility of an exhaustive enumeration of all possible hypotheses, which implies a degree of rigidity foreign to the true scientific spirit. We should always admit the possibility that our experimental results may be best accounted for by a hypothesis which never entered our own heads."

Prediction-Efficiency-AIC
and its asymptotic relatives like AICc are built around statistical prediction. The difference of mean expected log-likelihoods is different from what we have defined above as ΔK*. The mean expected log-likelihood has a second, predictive layer of expectation in its definition, the idea being to identify the model that could best predict a new observation from g (x), taking into account the uncertainty in the estimation of unknown parameters. For this reason these criteria have been termed the efficient, asymptotically efficient, or prediction-efficient criteria (Shibata, 1980;Hurvich and Tsai, 1990).
The tendency for AIC related criteria to over fit is a natural consequence of their design goal of prediction mean square error (MSE) minimization. When parameters are estimated, the increase in prediction MSE due to adding a spurious covariate is generally less than the reduction in prediction MSE caused by including a relevant covariate.
The tendency of stepwise regression to overfit using Neyman-Pearson testing has long been noted (Wilkinson and Dallal, 1981;Hurvich and Tsai, 1990;Harrell, 2001;Rao et al., 2001;Blanchet et al., 2008;Mundry and Nunn, 2008). The fixed Type 1 error rate as a criterion for entry (or exit) of a variable is at the heart of the overfitting problem, and methods for altering the Type 1 error rate based on the number of model parameters have been proposed (e.g., Foster and George, 1994). Such interventions without sample size in the recipe do not produce error rates that universally converge to zero as sample size becomes large.
Model selection with AIC or AICc improves somewhat on the Neyman-Pearson overfitting problem in that the misleading error probabilities both go to zero as sample size increases when two non-overlapping models are being compared. However, overlapping models, in which AIC and AICc are prone to overfit, are typically a substantial subset of the models in contention in multiple regression. The AIC and AICc indexes will tend to include spurious variables too often and thus represent only a partial improvement over stepwise regression.

Identifying
Causal Structure-Scientific prediction, however, can be broader than pure statistical prediction. The scientist often desires to predict the outcome of a system manipulation: what will happen if harvest rate is increased, or if habitat extent is halved? Modeling such manipulation might translate as a structural change in a statistical model of the system. The predictive quality of the model then lies more in getting mechanisms in the model as right as possible.
The consistent criteria will asymptotically select the generating process if it is in the model set. If the generating process is not in the model set, the consistent criteria will asymptotically select the model in the set that under best possible parameterization is closest (in the KL sense) to the generating process. The estimation of ΔK* by the difference of SIC values represents a quest for a different kind of prediction that might come from a structural understanding of the major forces influencing the system under study. The tendency of the prediction efficient criteria to include spurious covariates promotes a mis-understanding of the generating mechanism (Taper, 2004).
Certainly, the finite-sample properties of SIC and other consistent indexes require substantial further study, but the property that more data should be able to distinguish among candidate models with fewer errors seems an important property to preserve.
The scientific allure of information-theoretic indexes resided in the idea that all models were evaluated on a level playing field. One would calculate the index for each model and select the model with the best index, a procedure which promised considerably more clarity over hierarchical sequences of Neyman-Pearson tests, such as stepwise regression.

Uncertainty in Evidence
AIC and its descendants were originally built around concepts of statistical point estimation. The statistical inference represented by AIC is that of an approximately unbiased point estimate of the mean expected log-likelihood. The statistical concepts of errors and variability in information indexes have by contrast not often been emphasized. Partly as a result, model selection with information indexes has been somewhat of a black box for investigators, as achieving a good understanding of the inferences represented by model selection analyses is a mathematical challenge (see Taper and Ponciano, 2016).

Evaluating Model Adequacy-
We have illustrated that, unlike the error rates in Neyman-Pearson hypothesis testing, all of the error rates of evidence analysis converge to zero as sample size increases. However, the errors we have discussed deal only with the determination which of two models is closer to truth; the error rates do not shed light on whether either model is close enough to truth to be scientifically or managerially valuable. This question is the realm of model adequacy analysis.
Whether the statistical inference is a hypothesis test, equivalence analysis, severity analysis, or evidence analysis, whether for a pair of models or multiple pairs of models, a follow-up evaluation of model adequacy looms ever more important as a crucial step (Mayo and Spanos, 2004;Spanos, 2010). Lindsay (2004) and Markatou and Sofikitou (in review) discuss ideas about the statistical evaluation of model adequacy. Mac Nally et al. (2018) give an impassioned editorial plea for routine model adequacy evaluation in scientific model selection. Ponciano and Taper (2019) show how to directly incorporate model adequacy evaluation into information criterion based model selection.
Considering the likely prevalence of model misspecification in ecological statistics, analysts will need to consider how a candidate model could be misspecified as well as the effects of such misspecification on the intended uses of the model. Practically, the analyst can introduce models formulated in diverse fashions and let the model identification process itself reduce model misspecification. Further experimental or observational tests of model predictions (e.g., Costantino et al., 2005) and their associated error rates are necessary to map the conditions under which a given model is reliable.
The error properties of evidence analysis are more difficult to calculate than classical NP tests because model misspecification is involved. But once calculated, the rates are likely to be more accurate than classical tests that pretend misspecification does not exist.

Approaches to Estimating
Post-data Error Rates-Error rates are different pre and post-data. W, M and α are pre-data error rates calculated under a model that is assumed to be true. The P-value is a post-data error rate. The pre-data error rates are useful for experimental design, but should be viewed with suspicion as a post-data inference tool because as we have shown these error rates are only accurate if the generating process is the assumed model. Little work has been performed on evidential error rates under the realistic assumption of model misspecification (but see Royall and Tsou, 2003). This area is an important field for future work.
Non-parametric bootstrapping shows great promise for calculating evidential error rates, for data structures that allow bootstrapping. In work in preparation, we (Taper, Lele, Ponciano, and Dennis) show that bootstrapping greatly aids in the interpretation of evidential results. Figures 4, 5 indicate that evidential error rates depend on the structure of the model space. Taper and Ponciano (2016) and Ponciano and Taper (2019) show that given data and a set of models, estimation of the model space structure including the location of the unknown generating process is feasible. This gives a direct measure of model adequacy. Future extensions of this work may allow for the direct estimation of realistic error rates as well.

How Should One Use Evidential Statistics in Practice?
A basic recommendation is to stop using NP tests for inference and be cautious about using the AIC family of information criteria for model selection. These are known as the "efficient" or "MSE minimizing" criteria and include the AIC, the AICc, the TIC, many forms of ICOMP and the EIC. These criteria are recognized by a complexity penalty whose expectation is asymptotically constant. Asymptotically equivalent to the AIC is the use of leave-one-out cross-validation (Stone, 1977); cross-validation will have model selection properties similar to AIC but has the advantage that it can be calculated in the absence of a likelihood function.
There is no reason that the multiple comparisons inference from traditional ANOVAs cannot be made using information criteria (e.g., Kemp et al., 2004;Jerde et al., 2019).
Classical methods will work well for state description and less well for process identification. Unbiased scientific inferences of process are better made using consistent information criteria (see Jerde et al., 2019;Lorah and Womack, 2019 for examples).
Analysts have a convenient spectrum of choices for many standard modeling situations in a suite of consistent information criteria: The HQIC (also known as the HQC, Hannan and Quinn, 1979), the HIC (aka BIC* and HBIC, Haughton, 1988), the SIC (aka BIC and SBC, Schwarz, 1978), and the CAIC (Bozdogan, 1987). The analyst can opt for a criterion that matches her goals. The sample size multiplier in the HQIC grows at the minimal rate to generate a consistent form. As a consequence the HQIC will behave very much like the AIC, selecting models with low MSE of prediction by capturing real but small effects at the cost of including spurious covariates. The HIC tends to balance underfitting and overfitting errors. The SIC and CAIC both favor compact models, with all the included components well-supported, and both tend to underfit. The CAIC has the strongest complexity penalty and thus makes the most underfitting errors and the fewest overfitting errors.
Besides being influenced by inferential goals, the choice of evidence function should depend on the modeling framework. Information criteria had their beginnings as a tool for variable selection in linear regression with independent observations. In such situations, as derived by Akaike, the number of parameters is a good first order bias correction to the observed likelihood. But, statistics is a world of special cases. The dizzying diversity of information criteria in the literature produces the desire to optimize the bias correction under different modeling frameworks. For instance, in mixed models, even the meaning of the number of parameters or the number of observations becomes ambiguous due to the dependence structure of mixed models. Information criteria have been developed using estimates of the effective number of parameters (e.g., Vaida and Blanchard, 2005;You et al., 2016). Similarly, information criteria have been constructed using estimates of the effective number of observations (e.g., Jones, 2011;Berger et al., 2014).
If the generating process is in the model set, or in flat model spaces, such as those in linear regression, the ΔAIC is an unbiased estimate of 2nΔK regardless of how near or far each of the approximating models is to the generating process (Burnham and Anderson, 2002;Choi and Kiefer, 2011). In curved model spaces (as in Efron, 1975), ΔAIC is not unbiased, and the estimation is only good if both approximating models are close to the generating process. The Takeuchi's information criterion, the TIC (Takeuchi, 1976;Shibata, 1989), is nearly unbiased even for curved models at great distances from the generating process (Burnham and Anderson, 2002;Choi and Kiefer, 2011). Optimal multiplicative coefficients of bias adjustment for the AIC and TIC have been given (Ogasawara, 2016). Also, Ogasawara showed that when the penalty term in TIC (a random variable, not a constant) is negatively correlated with the main term, the higher-order asymptotic variance of the TIC becomes smaller than that common to the AIC and BIC. Unfortunately, the complexity penalty for the TIC must be estimated from data and cannot be specified a priori, as with the other criteria mentioned. The uncertainty in penalty estimation makes the use of the TIC impractical unless sample size is large. A second problem with the TIC is that like the AIC, it is not consistent, but any efficient information criterion can be made consistent either by multiplying the complexity penalty by a consistent multiplier (Nishii, 1988) or by averaging the penalty with a consistent penalty (Lorah and Womack, 2019). Lorah and Womack (2019) also report on testing a list of various model selection criteria. In a nutshell, model selection criteria made into evidence functions as a whole give reasonable and responsible results, with none of the criteria being universally best. Which evidence function is better depends on the nature of the problem at hand, that is, the characteristics of the model space being investigated. The technical difficulties of criterion selection aside, the most important aspect of applying evidential statistics is approaching problems evidentially.

CONCLUSION
Evidence is not so much a new statistical method for model selection as it is a new way of thinking about the inference involved with existing model selection methods. The evidential way of thinking has two main components: (1) A post-data trichotomy of outcomes (strong evidence for model f i , weak or inconclusive evidence, strong evidence for model f j ).
(2) A framework of pre-data error probabilities, which are assured to go to zero as sample size increases. The evidential approach invites exploration of the error probabilities, usually via simulation, to aid in study design, the selection of evidence thresholds, the effects of different types of misspecification, and the interpretation of study results.
We have proposed here a different way of thinking about statistical analyses and model selection, based on the concept of evidence functions. Evidence is an intuitive way to decide between two models that avoids the famously upside-down logic that accompanies Neyman-Pearson testing. Evidential thinking has helped us reveal the shortcomings of Fisher significance analysis and Neyman-Pearson testing. The errors that can arise in evidence analysis are straightforward to explain, and the frequentist properties of such errors as functions of sample size and effect size are easy to understand and highly compelling in a scientific chain of argument. The information indexes, when differenced, represent a collection of potential evidence functions that extend the evidence ideas to models with unknown parameters. The desirable error properties are preserved in the presence of model misspecification, when the model choice is generalized to be an inference about which model is closer to the stochastic process that generated the data. The error properties of AIC and AICc are similar to those of Neyman-Pearson testing when the candidate models are nested or overlapping and so the AIC-type indexes are not satisfactory evidence functions in those common circumstances. The indexes like SIC in which the parameter penalty is an increasing function of sample size retain the frequentist error properties of evidence functions for all model pairs.
Evidence works well for science in part because its explicit conditioning on the model set invites thinking about new models. Evidence has inferential qualities that match or surpass Fisher significance analysis and Neyman-Pearson tests. Evidence represents a compelling scientific warrant for formulating statistical analyses as model selection problems.

FIGURE 1 |.
Model topologies when models are correctly specified. Regions represent parameter spaces. Star represents the true parameter value corresponding to the model that generated the data.
Top: a nested configuration would occur, for example, in the case of two regression models if the first model had predictor variables R 1 and R 2 while the second had predictor variables R 1 , R 2 , and R 3 . Middle: an overlapping configuration would occur if the first model had predictor variables R 1 and R 2 while the second had predictor variables R 2 and R 3 . Three locations of truth are possible: truth in model 1, truth in model 2, and truth in both models 1 and 2. Bottom: an example of a non-overlapping configuration is when the first model has predictor variables R 1 and R 2 while the second model has predictor variables R 3 and R 4 .