The Self-Organization of a Spoken Word

Holden, John G.; Rajaraman, Srinivasan

doi:10.3389/fpsyg.2012.00209

ORIGINAL RESEARCH article

Front. Psychol., 06 July 2012

Sec. Psychology of Language

Volume 3 - 2012 | https://doi.org/10.3389/fpsyg.2012.00209

The self-organization of a spoken word

John G. Holden*

Srinivasan Rajaraman

Department of Psychology, CAP Center for Cognition, Action, and Perception, University of Cincinnati, Cincinnati, OH, USA

Pronunciation time probability density and hazard functions from large speeded word naming data sets were assessed for empirical patterns consistent with multiplicative and reciprocal feedback dynamics – interaction dominant dynamics. Lognormal and inverse power law distributions are associated with multiplicative and interdependent dynamics in many natural systems. Mixtures of lognormal and inverse power law distributions offered better descriptions of the participant’s distributions than the ex-Gaussian or ex-Wald – alternatives corresponding to additive, superposed, component processes. The evidence for interaction dominant dynamics suggests fundamental links between the observed coordinative synergies that support speech production and the shapes of pronunciation time distributions.

Reading scientists concern themselves with the cognitive activity governing the translation of printed words into meaningful language. Speech production scientists seek a coherent description of the processes that guide production in speaking aloud. Traditionally, both these dimensions of reading aloud are studied as separate component process of cognition. Nevertheless, both aspects of performance are in play as an individual pronounces even a single visually presented word in a speeded naming experiment.

Speaking aloud, whether reading or in conversation, is a fundamentally dynamic, time varying activity. The simplest of utterances require speakers to adaptively coordinate an untidy throng of anatomical and physiological variables. Self-organizing coordinative synergies tame the inherent complexity of speaking, however (Turvey, 2007). This article describes how pronunciation times derived from a speeded naming task map into the coordinative speech activity along empirical and theoretical dimensions.

Historically, cognitive scientists used ideal probability distributions as models to explore cognitive performance. The most widely invoked working hypothesis was that a model’s parameters may selectively refer to the activity of specific cognitive processes (e.g., Townsend and Ashby, 1983; Luce, 1986). The term component dominant dynamics refers to this long-standing hypothesis in cognitive science: Behavioral measurements reflect the activity of isolable cognitive components, themselves, their time-course, and their functional details (e.g., Sternberg, 1969; Simon, 1973). A corollary assumption is that computational or symbolic cognitive processing operations occupy the bulk of the elapsed time between signal and response.

This article explores an alternate tack: The parameters of an ideal “cocktail” mixture distribution are hypothesized to collectively reflect the relative stability and flexibility of coordinative structures – the dynamics of the self-organized synergies that support reading and speaking aloud in speeded naming performance. The term interaction dominant dynamics codifies this alternative hypothesis: Behavioral measurements reflect emergent, irreducible, coordination and coupling among the processes that support the act of word naming.

Large samples of pronunciation times, obtained from many different individuals, are used to contrast the assumptions of the component dominant and interaction dominant accounts of the empirical patterns. As we describe next, speech production was shown, via independent sources of evidence, to be governed by self-organizing coordinative synergies. Coordinative synergies rely on interaction dominant dynamics. Since word naming entails a speech act, it offers a unique opportunity to directly asses how that coordinative activity is reflected in the shapes of pronunciation time distributions.

Coordinative Synergies Behind the Spoken Word

Pronouncing-aloud an individual word may seem effortless and prosaic. It is well-established, however, that successfully uttering even a single phoneme requires a speaker to produce a complicated spatiotemporal arrangement among at least 70 muscles (Kelso et al., 1984, 1986; Turvey, 1990, 2007). Human speech presents an example of Bernstein’s (1967/1996) “degrees of freedom” problem; any theory of speech control that posits direct command of the relevant individual processes and variables is quickly overwhelmed and rendered implausible by the sheer number of processes that must be controlled. Coordinative synergies resolve the paradox. Synergies are couplings among relevant component processes; couplings that reduce the degrees of freedom among the possible arrangements of the components of mind and body to produce speech (Tuller et al., 1982; Turvey, 1990, 2007). Synergies impose constraints that compress the potential degrees of freedom in the possibilities for articulation or action.

A mechanical example of a coordinative link is a tie-rod that connects the two front wheels of a vehicle. Two degrees of freedom are required to independently control each wheel. But any wheel arrangement that points them in different directions is dysfunctional. A tie-rod coordinates their movements and reduces the control problem to a single degree of freedom, instantiated in the position of the steering wheel. Of course biological coordinative synergies are more subtle, flexible, and abstract, but the essentials of the solution are the same (Tuller et al., 1982; Riley et al., 2011).

Kelso et al. (1984) conducted a seminal test of the synergy hypothesis in speech production. The task required each participant to say “bab” and “baz” despite an unpredictable, occasional, mid-utterance, force perturbation to their jaw. The applied force altered the participant’s articulatory trajectory. In the case of “bab,” participants’ upper and lower lip movements immediately compensated for the jaw perturbation such that no distortion in the utterance could be detected (<30 ms, arguably faster than the minimum lag of a sensory-motor loop). Similarly, the same jaw perturbation while saying “baz” yielded virtually instantaneous and appropriate adaptation in the tongue’s trajectory that again left no distortion in the final utterance. The linkages that compose synergies predict such online, reciprocal compensation among the components of speech articulation.

What is most remarkable is the local, online compensation that emerges to render the intended utterance fully intact. The linkages that compose the synergies allow simultaneous reciprocal compensation among the elements of the pronunciation of “bab” and “baz.” Speech acts, such as word pronunciation, are accomplished through coordinative synergies (Kelso et al., 1984, 1986; Saltzman et al., 1998; Turvey, 2007; for a review, see Van Lieshout, 2004). Even elementary speech gestures require cooperative relationships among ensembles of elemental perceptual, cognitive, and neuromuscular variables. The online, reciprocal compensation that is observed at the behavioral level is an expression of interaction dominant dynamics (Jensen, 1998; Van Orden et al., 2003a, 2005; Kello et al., 2008;Holden et al., 2009, 2011). This category of coordinative behavior is also indicative of biological self-organization in general, and represents an established instance of self-organization in speaking aloud.

Linking Synergies to a Distribution’s Shape

We now explain how measures of pronunciation time are fundamentally related to the coordinative synergies that are known to govern natural speech production. Pronunciation time distributions offer a probabilistic event-based assessment of the coordinative dynamics that arise to support performance for each target item, or so we claim. Pronunciation times, themselves, measure the time required for an utterance to coordinate and unfold to a point where a participant’s audible voice-amplitude exceeds a threshold of measurement.

While this coordinative activity unfolds over time, it is nevertheless sampled as a discrete “pronunciation time” once an utterance triggers a voice key. As such, pronunciation times supply point-samples of the evolving multidimensional dynamics of physiological, anatomical, and acoustic variables as they unfold and give rise to a spoken word. Instantaneous pronunciation time measures are thus substituted for an underlying, continuous dynamic trajectory and that trajectory is formally collapsed into a point process (Lowen and Teich, 2005). Clearly, some portion of each pronunciation time is unique to the particulars of the target word: Its unique acoustic profile, measurement uncertainty, and other idiosyncrasies. However, all aspects of the dynamic must enfold to yield a successful, audible word pronunciation.

It is well known that cognitive, kinematic, and articulatory manipulations routinely influence mean pronunciation and response time. This fact establishes that both pronunciation and response time measurements are sensitive to a broad spectrum of performance dynamics (e.g., Abrams and Balota, 1991; Van Orden and Goldinger, 1994; Van Orden et al., 2003b; Perry et al., 2010). In fact, the discovery that coordinative synergies support speech performance implicates speech as dynamically assembled and governed. If so, dynamical systems theory dictates that the trajectories entailed by the larger “speaking-a-printed-word” system are in evidence in the pronunciation time measurements. This is key consequence of the interconnectedness of dynamical systems.

Events that affect the dynamics of one process reverberate through and change the dynamics of other process because they are reciprocally intertwined. This coupling across degrees of freedom means that appropriate measurements of just a single observable normally reveals information about the dynamics of the system as a whole (e.g., Takens, 1981). In fact, Takens’ embedding theorem and related findings gave rise to an entire signal-processing discipline dedicated to uncovering and describing the dynamics of complex systems based, in many cases, on just a single observable (Abarbanel, 1995; Kantz and Schreiber, 1997; Gao et al., 2007). Embedding theorems are as foundational to the statistics of non-linear dynamic systems as the Central Limit theorem is to the statistics of stochastic linear systems.

Our working hypothesis is that if coordinative synergies underlie acts of reading and speaking words aloud, it will be corroborated in the shape of pronunciation distributions. There are two ideal probability distributions that are symptomatic of discrete samples of the aforementioned mutually contingent, coupled dynamics (Montroll and Shlesinger, 1982; West and Deering, 1995; Holden et al., 2009). The first is a lognormal distribution (Evans et al., 2000; Limpert et al., 2001). It is a positively skewed distribution that appears as a symmetric Gaussian distribution after a logarithmic transform of the measured variable. The second, an inverse power law distribution, expresses a more pronounced positive skew than a lognormal distribution (Clauset et al., 2009). Power law behavior is symptomatic of self-organizing physical systems poised are near a critical point (Bak, 1996; Jensen, 1998).

Article Overview

Relating standard pronunciation time measurements to an underlying, largely unobserved dynamic flow requires a basis in evidence. It must be demonstrated that pronunciation times are plausibly associated with interaction dominant dynamics. The detailed analyses of this article are directed at establishing whether pronunciation time distributions conform to shapes that are consistent with event-based samples of coordinated dynamic flows. Thus, we seek evidence of interaction dominant dynamics from the pronunciation time distributions of individual participants. If pronunciation time distributions reasonably conform to either idealized lognormal, inverse power law, or mixtures of these distributions then pronunciation events are likely contingent on interaction dominant dynamics. After all, both classes of distributions are widely identified as symptomatic of interaction dominant dynamical systems in nature (Montroll and Shlesinger, 1982; West and Deering, 1995; Bak, 1996; Jensen, 1998; Limpert et al., 2001; Holden et al., 2009).

Over the course of this article, a refined incarnation of the Holden et al. (2009) cocktail description based on mixtures of lognormal and inverse power law distributions is introduced. Its key advantage is its suitability for maximum-likelihood fitting techniques. The refined description facilitates density estimation and hazard function tests that contrast common exponential and inverse power law descriptions of individual participant’s pronunciation time distributions. The outcomes of our tests indicate that power law behavior is likely a more accurate description of the stretched, slow tails of individual participant’s pronunciation time distributions than exponential behavior.

So far, we have detailed how pronunciation times, traditionally viewed as indexes of processing time, can be reinterpreted as indicators of the dynamic coupling entailed in the act of speaking printed words. In the next section, we introduce and contrast distributions that are symptomatic of superposed, component dominant dynamics, such as an exponential or a Gaussian, from those symptomatic of interaction dominant dynamics, such as an inverse power law or a lognormal distribution. The correspondence test section details the findings of our exponential versus power law statistical contrasts on a 30 participant pronunciation time data set from Experiment 2 of Holden et al. (2009). Following that, the same contrasts are generalized to a much larger 470 participant pronunciation time data set from the English Lexicon Project (ELP) described by Balota et al. (2007). In both cases, mixtures of lognormal and inverse power law distributions described the empirical patterns reliably better than ex-Gaussian or ex-Wald alternatives. Finally, our general discussion illustrates how interaction dominant dynamics might impact the interpretation of several standard empirical word recognition results. We illustrate how the understanding of linear item-level regression analyses on large-scale pronunciation time data sets must be conditioned by the fact the cognitive dynamics word recognition tend to unfold on faster times-scales than articulatory dynamics.

Distinguishing Exponential from Power Law Behavior

In science, some distinctions matter more than others. Critical differences involve the identification of patterns with broad implications for the core assumptions of a research program. For cognitive scientists, determining whether exponential decay or inverse power law decay best characterizes the slow tails of pronunciation time distributions is just such a crucial question – the answer could at once render large classes of cognitive models as either plausible or implausible.

Exponential Density Functions

Historically the positive skew commonly expressed in both pronunciation and standard response time distributions is approximated as a form of exponential decay. That is, past the distribution’s mode the probability density is thought to decay as p(t) ≈ (1/λ)e^−t where t is just the time axis of pronunciation time. Exponential decay is, for instance, the dominant term in the slow tails of the ex-Gaussian, the ex-Wald, the Gamma, and the inverse Gaussian distributions (Moscoso Del Prado Martín, 2010). Each distribution has been put forward as either a description or a model of response time distributions (e.g., see Luce, 1986; Van Zandt, 2000; Schwarz, 2001). In all that follows, we adopt the ex-Gaussian and the ex-Wald distributions as canonical examples of pronunciation time descriptions that express exponential decay.

An exponential tail could signify, for example, cognitive and perceptual processes that conform to stochastic “counting” or queuing output processes: The steady, reliable accrual of cognitive or perceptual information, as characterized by the mean (λ) of the distribution (e.g., Townsend and Ashby, 1983; Balakrishnan and Ashby, 1991). Furthermore, if an exponential rate parameter is sufficient to characterize a cognitive process then it could, in principle, be identified and discriminated from other processes with different characteristic rate parameters or distribution functions. Finding in favor of an exponential description of the slow tails of pronunciation time distributions supports the hypothesis that cognitive components, themselves, dominate the transactions associated with reading and speech performance – a finite set of independent, superposed processes that conform to a characteristic time-scale – indicated by a set of characteristic exponential rate parameters, for instance. Given this outcome, the reductive methods of linear analysis may best reveal and individuate the fundamental processes that govern reading, speaking, and related cognitive activities.

The ex-Gaussian description

The ex-Gaussian distribution represents the convolution of a Gaussian distribution and an exponential distribution. The convolution operation yields a “child” random variable that represents the sum of these two independent “parent” random variables. The original description of the ex-Gaussian was rooted in a hypothesis that response times represented the serial output of two processing stages, a perceptual-decision stage that yields an exponential variate and a motor-execution stage that results in a Gaussian variate (Hohle, 1965). An opposing interpretation was also proposed, where the motor response generated an exponential and the perceptual-decision stage was Gaussian (McGill and Gibbon, 1965). Some contemporary authors limit their use of the ex-Gaussian to a purely descriptive role (e.g., Heathcote et al., 1991; Balota and Yap, 2011; Yap et al., 2012) while others have interpreted the ex-Gaussian parameters in terms of distinct cognitive process (e.g., Balota and Spieler, 1999). Still others question the motivation for using the ex-Gaussian at all, given that it lacks the peaked hazard function that is widely indicated in empirical response time distributions (e.g., Schwarz, 2001; Holden et al., 2009; Van Zandt, unpublished manuscript).

The ex-Wald description

The ex-Wald distribution represents an additive combination of a Wald (a.k.a., an inverse Gaussian) and an exponential distribution (Schwarz, 2001). It is aimed at describing cognitive processes that entail an additive combination of a bottom-up accumulation of information that outputs variation in the form of a Wald distribution, plus a non-decision process associated with a response threshold that yields exponentially distributed variability. The resulting convolution of a Wald and an exponential results in an ex-Wald distribution.

The ex-Wald distribution is typically implemented as an additive, three parameter convolution of the parent Wald and exponential distributions which facilitates selective influence tests (Schwarz, 2001). We implemented a more flexible four-parameter version of the ex-Wald that includes an onset threshold for the exponential portion of the distribution. The added parameter allows the model to capture a broader potential set of functional forms than its three parameter counterpart. Since the ex-Gaussian uses three parameters, we added a fourth parameter to the ex-Wald to insure that a lack of flexibility resulting from too few parameters could not be the sole reason that an exponential description could succeed or fail to capture the empirical patterns.

Inverse Power Law Density Functions

More recently cognitive scientists introduced descriptions of pronunciation and response time distributions that posit inverse power law decay as the dominant term describing the positive skew in the slow tails of response time distributions. If the extreme slow tail of a distribution decays as a power function, then the probability of observing a particular response time, p(t), is the inverse of the pronunciation time value, t, itself, raised to a scaling exponent α, i.e., p(t) ≈ t^-α. An inverse power law entails a more dramatic positive skew than an exponential. Examples include the Pareto distribution, Fieller’s distribution (Moscoso Del Prado Martín, 2010), the Lévy distribution (Rhodes and Turvey, 2007), power law tail behavior (Sigman et al., 2010), and the cocktail description (Holden et al., 2009). We adopt a refined version of the Holden et al. cocktail description as a canonical example of pronunciation and response time models that express power law behavior in their slow tails.

An inverse power law tail implies the processes supporting reading and speaking aloud are capable of spanning a wide range of time scales, from very short to very long. Power law behavior is associated with complex systems composed of processes that interact to self-organize their behavior across multiple temporal or spatial scales (Jensen, 1998; Van Orden et al., 2003a). The behavior of countless complex systems relies on strongly coupled, interdependent processes that yield emergent patterns, such as power law distributions (Bak, 1996; Johnson, 2007). If power law behavior is commonly expressed in naming performance then the tools of complexity science may best aid investigations concerning the fundamental principles and patterns expressed in speaking printed words aloud.

The catch, however, is that in practice exponential and power law functions, such as probability densities, are notoriously difficult distinguish empirically since both functions are so similar (Clauset et al., 2009). In the context of favorable parameter settings each function can be made to closely mimic the other. Our statistical analyses are aimed at making just such a distinction regarding the tail behavior of pronunciation time distributions derived from the speeded word naming task. As we explain, the hazard functions of the exponential and power law distributions are quite distinct, despite the similarity of their density functions. Thus, a distinction between exponential and power law decay in the slow tails of a distribution does have the potential to be determined.

Pronunciation Times as Mixtures of Lognormal and Power Law Samples

Each naming trial requires participants to pronounce aloud an individual word that appears on a visual display. The resulting pronunciation time, the elapsed time between the presentation of the word and the moment a participant begins to utter the word, is then recorded. Normally, participants are asked to name hundreds of words in a single session, or perhaps thousands of words across several sessions.

Individual pronunciation times produced by different individuals are heterogeneously distributed. Figure 1 allows a cursory contrast of three individual ELP participant’s pronunciation time distributions. It reveals an apparently broad continuum of probability density functions. The distribution in the left plot is compact and nearly symmetric, the distribution in the right plot reveals a clear and potent positive skew, and the shape of the distribution in the center plot falls between these two extremes. In lay terms, some participants are just faster more often than others. However the differences in the shapes of the distributions across participants are striking. Arguably, any accurate characterization of the variety of shapes of pronunciation time distributions is a critical first step in providing a basis for understanding the process that governs the act of speaking aloud a printed word. This and other concerns motivated the so-called cocktail model. It mixes samples from two idealized distributions, proportionally, like the molecules of different liquids in a cocktail.

FIGURE 1

Figure 1. Three example empirical pronunciation time distributions from the English Lexicon Project data set (Balota et al., 2007). Each plot depicts the probability density function of an individual participant’s empirical pronunciation time distribution. The black lines represent kernel-smoothed empirical probability density functions. Maximum-likelihood fits of the refined cocktail description are depicted in white, behind the empirical density functions. In each plot, the white line depicting the cocktail fit is hardly visible because it closely matches the empirical distribution.

The refined cocktail description

The original Holden et al. (2009) cocktail description approximated empirical pronunciation time distributions as a probabilistic mixture of synthetic “trial” samples from either a lognormal or an inverse power law distribution. The refined cocktail model is a simpler and more compact parametric formulation of the original lognormal and inverse power law mixture distributions. The lognormal mean and standard deviation are treated as unknown free parameters. The inverse power law density entails two unknown free parameters including an onset threshold (a necessarily positive-valued lower bound of support for the distribution) and a scaling exponent.

The refined model is formulated such that it is a pure lognormal distribution for the values of the random variable below the power law threshold (the “front” or left side of the distribution). Above the threshold (the “tail” or right side of the distribution), the refined cocktail is a mixture of a lognormal and an inverse power law distribution with unknown weights. The refined cocktail model is defined so that it is guaranteed to be a continuous and smooth probability density function over the entire positive real line – except at the threshold value – depending on the weights of the mixture components the density may become discontinuous at the threshold. Thus, an interrelationship or constraint among the parameters of the cocktail model components and their mixture weights was imposed to ensure an idealized smooth, unimodal probability density function, that by definition, integrates to unit area.

The refined cocktail model is a function of four independent parameters: the lognormal distribution mean and standard deviation (Ω_LN and σ), the inverse power law distribution scaling exponent (α), and the relative weight of the power law tail in its mixture with the portion of the lognormal that falls above the lognormal mean (ρ_PL). Three remaining parameters are all implicitly dependent on the other four free parameters (see Appendix). For a given pronunciation time dataset, the cocktail model parameters can be estimated with standard parameter estimation techniques, such as maximum-likelihood estimation or maximum spacing estimation. The crux motivation for the cocktail mixture is the hypothesis that the dynamics of word naming vary largely in terms of their relative stability, and span a continuum ranging from relatively stable lognormal behavior to less stable inverse power law behavior.

A Correspondence Test of the Original and Refined Cocktail Models

There are any number of ways one could implement a probabilistic mixture of lognormal and power law samples. Thus, it is important that the results of the refined cocktail mixture display evidence of correspondence to the original cocktail description. The correspondence test amounted to comparing the parameters from the original and refined cocktail descriptions to fits of the pronunciation time data from Experiment 2 of Holden et al. (2009).

We also used the ex-Gaussian and ex-Wald models to approximate the same empirical distributions. Parameter estimates from all three models were then used to introduce and apply a rigorous goodness of fit test recommended by Clauset et al. (2009) for distinguishing heavy-tailed distributions, such as the power law and exponential. The test is based on a Monte Carlo version of the standard Kolmogorov–Smirnov goodness of fit statistic. Following that, we introduce a novel hazard function analysis, using the Bayesian Information Criterion or BIC statistic, that contrasts mean integrated squared error between the empirical and ideal best-fit cocktail, ex-Gaussian, and ex-Wald hazard functions.

Since the experimental procedures used to collect these data are described elsewhere, the bulk of our Method section supplies the details and rationale of our statistical procedures. Our Results and Discussion sections offer brief overviews of each of these procedures, however. The full derivation of the refined cocktail model appears in an Appendix. Matlab^® code is available at http://homepages.uc.edu/∼holdenjn/. Readers that do not require all the details of our statistical procedures may choose to skip the Method section of this experiment.

Method

Participants and procedure

The Holden et al. (2009) Experiment 2 data set is composed of 30 participants’ pronunciation times to 1100 randomly selected single and multisyllabic English words that ranged from 4 to 15 letters in length with an average frequency of occurrence of 70.2 times per million (SD = 295.16) according to the Kuĉera and Francis (1967) norms. Readers may refer to the original article for a more complete description of the word stimuli and laboratory methods.

Distribution fitting

Standard maximum-likelihood estimation methods were applied to the empirical distributions to estimate a set of density function parameters for each participant’s pronunciation time distribution. Each participants’ distribution was approximated using the refined cocktail distribution, the ex-Wald distribution, and the ex-Gaussian distribution.

One typically uses a Kolmogorov–Smirnov (K–S) goodness of fit test to determine the relative match between an empirical distribution and a candidate model distribution. The test returns the maximum distance (D) between the two distributions over the interval of their cumulative density functions. If D is small enough, there is insufficient evidence to reject the model as a description of the data.

However, the assumptions of the above testing scheme are valid only if the model is known a priori – and that criterion is so far unmet for pronunciation times. If the D value is derived from a contrast between the best-fit model and the empirical distribution, then a correlation between the data and model arises. Thus, we used a more rigorous Monte Carlo procedure to generate an unbiased goodness of fit test (Clauset et al., 2009). First, for each model, and each participant, an ideal best-fit cumulative distribution function was inverted to generate simulated pronunciation times of the same size as the participant’s empirical distribution. For a given model, 2500 synthetic data sets were generated using the best-fit model parameters. Next, each synthetic distribution was compared to the idealized model distribution function, using the same parameters, and a D goodness of fit statistic was generated for each synthetic data set. This operation yielded a distribution of 2500 D values resulting from the synthetic fits. Finally, we compared the D statistic derived from the given model’s best-fit of the participant’s empirical data set to the distribution of 2500 D values derived from all the synthetic fits.

Goodness of fit was defined in the following manner: If the observed D statistic fell on or below the 90th percentile (i.e., p ≥ 0.1) of the distribution of synthetic D statistics, the model could not be reasonably ruled out as a plausible description of the empirical data. In our Section Results, we report the number and percent of empirical distributions that passed this test. Next, we describe an additional test, aimed at contrasting the hazard functions of each model.

Hazard function estimation

We estimated empirical hazard functions by first computing a variable-width Gaussian-kernel probability density and cumulative distribution estimates for each individual’s pronunciation time distribution (Silverman, 1986). At each point on the increasing X-axis of pronunciation time t, the probability density is f(t) and the cumulative distribution function is F(t). The empirical hazard function, h(t) was computed as h(t) = f(t)/[1 − F(t)]. Likewise, ideal model hazard functions were computed on the density and cumulative distribution functions specified by the maximum-likelihood fits of each participant’s empirical distribution using the cocktail, ex-Wald, and ex-Gaussian distributions. Hazard functions were computed over 1024 equally spaced points on the interval, beginning with the minimum pronunciation time censoring value and ending with the largest observed pronunciation time in each participant’s empirical distribution. The empirical and ideal hazard functions were computed over the same pronunciation time intervals for each participant. We contrasted empirical and model hazard functions by first computing the mean squared integrated error (MISE) between them (i.e., MISE, Silverman). We compared the three models with a BIC statistic (i.e., BIC, e.g., Schwarz, 1978; Wagenmakers and Farrell, 2004; Chatterjee and Hadi, 2006) was computed as:

BIC = n \times ln (\frac{MISE}{n - 1}) + θ \times (ln n)

(1)

where n is the number of pronunciation times in the empirical distribution, MISE is the mean integrated squared error, θ is the number of free parameters in the model (three for the ex-Gaussian, four for the cocktail and ex-Wald). In this form, lower BIC values indicate relatively better fits than larger values. The model with the lowest BIC score was classified as the winning description.

We also address a need for an alternative and more flexible hazard function routine than the classic Miller and Singpurwalla (1980) method (see Discussion). We implemented a routine based on a Gaussian-kernel-smoothed probability density estimate that used a variable-width kernel. The empirical kernel-smoothed density functions were transformed into hazard functions and used in contrasts with ideal hazard functions based on the best-fit model for each candidate model. The main advantage of using the Gaussian-kernel method lies in its ability to represent a wider range of hazard function shapes than the randomized smoothing methods that were the historical standard in response time analyses.

The hazard contrast represents a separate, quasi-independent statistical test from the maximum-likelihood fits of the models. On one hand, it depends on the fits, on the other hand, the model and empirical hazard functions are generated and compared by a wholly different means. We adopted procedures to guard against biasing the analyses in favor of either the exponential or power law hypothesis. First, the asymptotic tail behavior of hazard functions computed in the manner described is to increase toward positive infinity as the denominator [1 − F(t)] approaches zero, but an exponential hazard is constant and a power law hazard decays toward zero. Second, we applied the BIC statistic to compensate for the fact that both the cocktail and ex-Wald entail one more free parameter than the ex-Gaussian distribution. Given a distribution with 1100 observations the BIC penalty for the ex-Gaussian distribution is 21.00 and 28.01 for the more flexible cocktail and ex-Wald models which represents a 25% increase in the penalty score for one additional free parameter. Next we describe, in turn, the outcome of the maximum-likelihood fitting and hazard discrimination analyses.

Results

Distribution fitting outcomes

All statistical analyses included both correct and incorrect pronunciation times that fell between 350 and 3500 ms. The small proportion of error pronunciation times (2.45%) were included because they made no difference in the outcome of our analyses. Twenty-seven of the 30 participants distribution’s were reasonably approximated by the refined cocktail model (90%). Using the same Monte Carlo standard for goodness of fit described in the methods, the ex-Gaussian plausibly described 57% of the data sets. The Monte Carlo test for goodness of fit revealed that the four-parameter ex-Wald distribution plausibly described only 50% of the data sets. Our procedure for assessing goodness of fit was described in the Method section.

At the level of the density function, the cocktail model successfully approximated the individual participant’s pronunciation time distributions within the confines of the rigorous statistical standards recommended by Clauset et al. (2009). The ex-Gaussian captured every distribution that was successfully approximated by the ex-Wald. Neither the larger ex-Wald parameter set, nor the lack of a penalty for more parameters enhanced its ability to capture empirical patterns at the level of the density function.

The fitting outcome is informative on several points. First, the cocktail distribution successfully approximated a much larger portion of the empirical distributions than either the ex-Gaussian or ex-Wald models. The success of the cocktail distribution is not likely due exclusively to the fact that it has more free parameters than the ex-Gaussian. The ex-Wald entailed the same number of free parameters as the cocktail distribution, but it actually fared worse than even the three parameter ex-Gaussian. It therefore does not appear that we are exclusively observing the impact of fitting more or less flexible candidate descriptions of the pronunciation time distributions.

Cocktail correspondence assessment

The key location parameters estimated using the original and refined versions of the cocktail model agreed with each other surprisingly well. The average values of the four free refined cocktail parameters were: lognormal mean Ω_LN, 6.31, or 550 ms, (SD = 0.12), lognormal SD, σ, 0.12 (SD = 0.01), scaling exponent, α, 6.68 (SD = 1), and the power law proportion tail, ρ_PL, 0.33 (SD = 0.16). Among the 26 distributions that were plausibly approximated by both the original and refined cocktail, the lognormal mean (Ω_LN), the power law scaling exponent (α), power law threshold (Ω_PL) were strongly correlated with the same parameters derived from the original cocktail description r(24) = 0.97, p < 0.05 and r(24) = 0.91, p < 0.05 and r(26) = 0.93, p < 0.05 respectively. Regression analyses revealed slopes very near 1 and intercepts near 0 for these three parameters.

The free parameters that correspond to variability, such as the lognormal SD (σ) and the power law weight parameter (ρ_PL) were also positively correlated, but less so, r(24) = 0.53, p < 0.05 and r(24) = 0.65, p < 0.05 for σ and ρ_PL, respectively. The primary difference between the two descriptions was that the refined cocktail model successfully approximates the tails of the more heavy-tailed empirical distribution’s with a lower weight or proportion of samples from a lognormal distribution in the distribution’s right tail. The main discrepancy between the two descriptions loads on variables that describe variability because the original cocktail fits were derived from contrasts of synthetic and empirical kernel density functions, both of which relied on automatic smoothing parameters that increase relative variability as the square of the smoothing parameter (Silverman, 1986, p. 37). Overall the correspondence between the original and refined implementations of the cocktail description is quite remarkable given they were arrived at by such disparate means.

Next we present an additional evaluation, focused on describing the tail behavior of the distributions. It is based on hazard function contrasts. Both the ex-Gaussian and the ex-Wald descriptions of pronunciation time entail exponential tails but the cocktail model entails power law decay. Transforming the density and distribution functions into the hazard function domain amplifies their differences and often renders a categorical distinction between exponential and power law tail behavior.

Hazard function contrasts

Mathematical psychologists articulated several empirical criteria that any general description of response time distributions should approximate. The consensus regarding the empirical standards that must be demonstrated, in order of relative difficulty, are: (1) capture the descriptive statistics of response times, e.g., mean, SD, and skew. (2) Recover the cumulative distribution and probability density functions of response time. (3) Express the three characteristic hazard functions of response time (e.g., see Luce, 1986; Townsend, 1990). To avoid the pitfalls of ad hoc statistical mimicking, all this must be accomplished in the context of a compelling theoretical motivation that arises for reasons beyond the observed empirical patterns (Van Zandt and Ratcliff, 1995).

A probability density function depicts the instantaneous probability of observing an event within an interval on the X-axis. Similarly, a hazard function describes an instantaneous event rate, per unit time, given that it has not yet occurred. Hazard functions tend to amplify subtle quantitative differences among distributions so they become visible, qualitative differences. Figure 2 illustrates that while the probability density functions of the exponential and power law distributions are reasonably similar, their hazard functions are quite distinct.

FIGURE 2

Figure 2. The ideal cumulative distribution, probability density, and hazard functions of a pure power law distribution (solid white line, top three plots), and a pure exponential distribution (dashed-gray line, bottom three plots). The cumulative distribution and probability density functions of each distribution are visually quite similar, but the hazard functions are qualitatively distinct. The power law hazard, itself, decays as a power law. By contrast, the exponential hazard is constant. The BIC mean squared integrated error analysis capitalizes on the qualitative and quantitative difference in the hazard domain.

An exponential distribution’s hazard function is a constant function of time. That is, it rises to a constant, asymptotic rate as depicted in bottom right plot of Figure 2. This fact is the basis of the characterization of an exponential process as “memory less.” The instantaneous event rate, given that an event has not yet occurred, is constant. That is, the ratio between elapsed events and potential remaining events is constant for any given interval of time. Hence the saying “used is as good as new” is directed at tools and machinery that break or otherwise fail as an exponential function of time. For a given interval of time, a brand new tool is just as likely to fail as a very old tool, so there is no benefit to paying the typical premium for “newness.”

The ex-Gaussian distribution yields a hazard function that rises monotonically to a constant asymptote. This behavior straightforwardly reflects its constituent Gaussian and exponential ingredients. A Gaussian distribution’s hazard function is increasing and the exponential distribution’s hazard function is constant.

The ex-Wald hazard function’s behavior is more complex. It can display a peaked then decaying hazard that resembles a lognormal hazard function. The basis of its peak is the Wald distribution, which itself has a peaked hazard function. In the context of certain parameters, the ex-Wald hazard function can (1) rise above an ex-Gaussian fit to the same data (2) closely mimic the increasing then constant ex-Gaussian hazard function, and (3) display a peaked and decreasing hazard function that resembles that of the cocktail model. This is the basic pattern that must be expressed to capture typical empirical pronunciation time hazard functions. Thus the superposition of a Wald distribution and a variable threshold exponential tail allows the ex-Wald to exhibit the hazard behavior of both its constituents. However, the normal tendency of the ex-Wald hazard is to maintain higher overall hazard rates than a comparable cocktail hazard function, due to its basis in exponential rather than power law decay.

Lognormal and inverse power law distributions both yield peaked hazard functions, they rise quickly to a maximum and then decay toward zero past that point. The peak and rate of decay of the lognormal hazard function is governed largely by the standard deviation of the distribution. The ideal hazard function of an inverse power law distribution itself decays as a power law with a scaling exponent of 1, regardless of the scaling exponent of the distribution’s probability density function, as in the top right plot of Figure 2. Thus, an ideal power law hazard function, itself, is scale-free, as is the power law density function. While specific details of the hazard functions depend on the distribution’s parameter values, the cocktail model often yields peaked hazard functions that decay more rapidly than the exponentially based hazard functions.

Hazard functions are normally assessed qualitatively, for peaks and other asymptotic behavior. This is due in part to a lack well-established quantitative methods that can be used to distinguish hazard functions. Our Method section describes the details of a test we developed to contrast the hazard functions of the three models. It adopts a BIC statistic to evaluate the MISE between the empirical hazard function and each best-fit model’s hazard function (Silverman, 1986). According to the BIC hazard test the cocktail hazard was a better description of 24 of the 30 (80%) empirical hazard functions. The ex-Gaussian and ex-Wald each successfully captured 3 of the 30 hazard functions for an exponential total of six (20%) of the empirical distributions.

Figure 3 illustrates the empirical and the three ideal hazard functions for each of the nine participants in the experiment. In each plot, the solid black line represents an individual participant’s empirical pronunciation time hazard function. The solid white line depicts the ideal cocktail hazard function corresponding to the best-fit parameters, as estimated from the empirical distribution. The lighter solid and dashed-gray lines depict the hazard functions of the best-fit ex-Gaussian and ex-Wald distributions, respectively. In four cases both exponential hazard functions fell nearly atop one another, which resembles a single solid gray line on the plots. The hazard functions of the leading edges of all the model distributions were highly similar. However the hazard functions of the exponential and power law descriptions tended to diverge in the slow tails of the model distributions. Overall, the pattern of decay in the tails of the empirical distributions was better represented by the power law behavior entailed in the cocktail description than by the exponential behavior entailed in the ex-Gaussian and ex-Wald descriptions.

FIGURE 3

Figure 3. The X-axes of each plot track pronunciation time and the Y-axes indicate the hazard rate, in events per millisecond. The heavy solid black line represents the empirical hazard functions for each of the nine participants. The ideal hazard functions for cocktail description are depicted as heavy white lines. The hazard functions of the ex-Gaussian and ex-Wald models are depicted as lighter, solid gray, and dashed-gray lines, respectively. Cases that appear to only depict one solid gray line indicate the ex-Gaussian and ex-Wald hazard functions were effectively identical. The majority of the empirical hazard functions entail a peak and their slow tails more closely match the power law decay indicated in the cocktail description than the exponential decay indicated by either the ex-Gaussian or ex-Wald alternatives. The distribution in the top left plot yielded the smallest power law scaling exponent (α = 3.44), the bottom right plot yielded the largest scaling exponent (α = 11.36). α = power law scaling exponent.

Benchmarking simulations

We assessed our BIC hazard discrimination routine by replacing, in turn, each of the 30 participant’s empirical pronunciation times with synthetic data derived from the best-fit parameters of the cocktail, the ex-Gaussian, and ex-Wald models (also, see Wagenmakers et al., 2004 for a similar technique). Replacing the empirical data with data from known models allowed us to assess the ability of the hazard routine to discriminate the power law from exponential models. Twenty-five realizations of the 30 participant simulations were run for each ideal model. Correct classification (hit) rates and incorrect classification (false alarm) rates were averaged across the 25 realizations of each 30 participant simulation.

The winning model was that model with the lowest BIC score for the given simulated data, as specified by the parameters of one of the three models. If either the ex-Gaussian or ex-Wald model were the true models, we treated a win for either model as a win for an exponential description. The ex-Gaussian description often captured ex-Wald data since they are functionally similar and the ex-Gaussian entails one less parameter. Similarly, if the cocktail description was the true model, we treated a win for either the ex-Gaussian or ex-Wald as an exponential false alarm.

When the cocktail model was the true model, it was correctly classified as a power law 89% of the time, the exponential false alarm rate was 11%. Given an ex-Gaussian as a true model, it was accurately classified as exponential 90% of the time, with a power law false alarm rate of 10%. A true ex-Wald was classified as exponential 98% of the time, with a power law false alarm rate of only 2%. These hit and false alarm rates, in turn, allowed us to estimate the sensitivity of the hazard classification routine with the help of a d-prime (d′) analysis.

The ability of the BIC hazard classification routine to distinguish the expression of exponential and power law behavior, its sensitivity, was estimated by computing a d′ statistic for each hit and false alarm rate. The d′ computation works by transforming hit and false alarm rates into a standardized z-score that represents the distance between the means of two hypothetical Gaussian distributions that represent the “signal” (true model) and “noise” (alternate model) distributions (Green and Swets, 1988). The values of d′ for the power law versus exponential contrasts using synthetic data were 3.95, 2.56, and 2.49 for the ex-Wald, the ex-Gaussian, and the cocktail description, respectively.

So, given the ranges of parameters represented by this data set, the ex-Wald was the easiest to classify as an exponential model. This was despite the ex-Wald being the most flexible exponential model, and likely resulted from the BIC penalty imposed for the additional parameter. The ex-Gaussian was a bit more difficult to distinguish from the power law, as it was occasionally misclassified as such. Likewise, the cocktail model was occasionally misclassified as an exponential. The fact that the cocktail description and the ex-Gaussian yielded comparable d′ values indicates that, on average, the BIC adjustment produced a reasonably balanced contrast between the simplest exponential model the more complex cocktail description.

Discussion

The maximum-likelihood fits to the three ideal models strongly favored the cocktail description of the pronunciation time distributions over an ex-Gaussian or a four-parameter version of the ex-Wald distribution. Our density fitting methods were conservative in that we followed Clauset et al.’s (2009) criteria for evaluating the plausibility of each model. The technique was explicitly developed to distinguish different classes of heavy-tailed distributions, such as the exponential and power law. Ninety percent (90%) of the participant’s distributions could be plausibly approximated using the cocktail description. By contrast the ex-Gaussian plausibly approximated only 57% of the distributions and the ex-Wald passed muster only 50% of the time.

The BIC hazard classification routine took model complexity explicitly into account by comparing goodness of fit with a BIC statistic that strongly punishes more complex models over simpler models (Wagenmakers and Farrell, 2004). This test specifically examined the hazard function shapes. The hazard shape is a highly diagnostic property, not readily represented in outcomes of goodness of fit tests conducted on density functions. The BIC hazard test classified 80% of the empirical distributions as most plausibly consistent with the cocktail description of pronunciation times, the two exponential models captured the remaining 20% of the distributions.

Alternate censorship criteria

Notably, imposing a common but more liberal response trimming criteria from the ex-Gaussian literature allowed the ex-Gaussian to successfully fit the density functions of 29 of the 30 participant’s distributions (by contrast, the cocktail captured 25 of 30, the ex-Wald, 19 of 30). This censoring method removes errors, observations beyond 200 and 3000 ms as well as any observations beyond 2.5 standard deviations from the distribution’s mean. Consequently, the ex-Gaussian captured five additional distributions that it otherwise failed to fit; all of which previously appeared to express significant power law behavior.

To better understand this outcome, we generated 30 simulated cocktail distributions, based on parameter estimates for the 30 participants and applied the beyond 200 and 3000 and beyond 2.5 SD trimming criteria to the simulated data sets. We then attempted to fit the remaining synthetic observations with the ex-Gaussian, and all 30 data sets were successfully approximated at the p ≥ 0.1 level. Similarly, we generated synthetic data sets that entailed compelling power law behavior, with scaling exponents ranging between 2 and 4. We then applied the same exclusive 2.5 SD trimming criteria to the data, and 43% of the distributions were subsequently approximated by the ex-Gaussian using the same p ≥ 0.1 criterion described earlier. By contrast, the ex-Gaussian never successfully approximated compelling power law behavior if the 2.5 SD trimming step was omitted.

Clearly the 2.5 SD trimming rule found its utility by facilitating ancillary linear analyses (e.g., Yap et al., 2012). However, the historic rationale for adopting a 2.5 SD rule stems directly from the properties of the tame, symmetric Gaussian distribution function and no contemporary cognitive scientist argues that pronunciation or response time distributions conform to a Gaussian. In any case, the 2.5 SD rule is clearly not appropriate in the context of attempting to distinguish exponential from power law behavior in light of its ability to transform unequivocal power law behavior into apparent exponential behavior.

Alternate hazard routines

Response time researchers have repeatedly described typical empirical hazard functions as conforming to one of three categories: They either (1) rise monotonically to an asymptomatic value, (2) rapidly rise to a peak and then decline to an asymptote, or (3) they rapidly rise to a much higher peak and quickly fall off past that point. Taken at face value, the second and third characteristic hazard functions could be consistent with a lognormal or a Wald distribution. Both distributions can produce peaked hazard functions – and the first characteristic hazard function of response time sounds surprisingly consistent with the constant, asymptotic exponential hazard function.

An important caveat must be considered, however. The lion’s share of empirical hazard functions that have appeared in the response time literature used a particular hazard estimation method called the “random smoothing” method, as described by Miller and Singpurwalla (1980). This method uses a smoothing parameter that specifies how many observations to include in each point-by-point hazard estimate. The technique captures hazard function peaks, and works well around the mode of the distribution where observations are plentiful, but it asymptotes to a constant value once the number of remaining observations falls below the value specified by the smoothing parameter.

While the random smoothing method supplies hazard functions that facilitate some types of contrasts, it is challenged in the case of representing decreasing hazard functions. Since the random smoothing method is asymptotically biased to a constant hazard function in the context of sparsely populated tails of a distribution, it was not optimal for even-handed comparisons of the hazard function behavior of models that entail exponential or power law tails. This motivated the use of the variable-kernel BIC procedure described earlier in the Method section. The observed shapes of empirical pronunciation time hazard functions varied widely but the kernel-based method revealed their tendency to resemble the ideal cocktail hazard, as peaked and decreasing, an asymptotic pattern that is consistent with power law behavior. Now that the aforementioned statistical issues surrounding distribution fitting, censorship criteria, and hazard estimation are understood, we turn to a stronger test of the generality of the power law hypothesis by applying our suite of statistical analyses to a much larger pronunciation time data set.

A Test of the Generalizability of the Cocktail Description

The ELP (Balota et al., 2007) maintains a repository of pronunciation times that is available to researchers via the web. A total of 470 participants named two lists of approximately 1500 and 1030 items, respectively, during two experimental sessions. The ELP database includes a vast sample of named items, and a considerable sample of participants, originating from six different universities across the US and Canada. Clearly, it supplies a broad cross-section of word naming performances.

We used these data to determine whether the refined cocktail model could accurately approximate this unusually large sample of participant’s pronunciation time distributions and hazard functions more generally. We were also particularly interested in determining whether the parameters derived for the two different sessions were correlated. The cocktail description is predicated on gaging the dynamics that support cognitive performance and there should be a relationship among the parameters derived from two separate laboratory sessions of a single individual naming different lists of randomly selected words.