Re-evaluating Phoneme Frequencies

Causal processes can give rise to distinctive distributions in the linguistic variables that they affect. Consequently, a secure understanding of a variable's distribution can hold a key to understanding the forces that have causally shaped it. A storied distribution in linguistics has been Zipf's law, a kind of power law. In the wake of a major debate in the sciences around power-law hypotheses and the unreliability of earlier methods of evaluating them, here we re-evaluate the distributions claimed to characterize phoneme frequencies. We infer the fit of power laws and three alternative distributions to 166 Australian languages, using a maximum likelihood framework. We find evidence supporting earlier results, but also nuancing them and increasing our understanding of them. Most notably, phonemic inventories appear to have a Zipfian-like frequency structure among their most-frequent members (though perhaps also a lognormal structure) but a geometric (or exponential) structure among the least-frequent. We compare these new insights the kinds of causal processes that affect the evolution of phonemic inventories over time, and identify a potential account for why, despite there being an important role for phonetic substance in phonemic change, we could still expect inventories with highly diverse phonetic content to share similar distributions of phoneme frequencies. We conclude with priorities for future work in this promising program of research.


INTRODUCTION
Linguistic theorists seek to reveal causal mechanisms which explain the observable diversity of human language.Good causal hypotheses are often suggested by the mathematical distribution that a linguistic variable is described by, owing to the fact that the distribution can be understood as an emergent outcome of some underlying causal process, and that a given mathematical distribution will be consistent with only certain mathematical kinds of underlying processes.Consequently, it is important for the development of theory that proposed claims about distributions be as sounds as possible.For instance, one of the most famous distributions in linguistics is the Zipfian distribution, which technically speaking is a kind of power law.Recently, however, the evaluation of putative power laws across the sciences have come under intense scrutiny and often been found wanting.In response, methodologists have developed more rigorous and secure methods for diagnosing power laws and for distinguishing them from similar but significantly different distributions.For linguists, this creates an opportunity, to re-examine our own putative power law distributions, and by doing so to improve the pathway to sound explanatory theorizing.
Here, we re-evaluate the status of mathematical distributions for characterizing phoneme frequencies.Previous studies have proposed that phoneme frequencies follow a particular member of the power law family, the Yule-Simon distribution (Martindale et al., 1996;Tambovtsev and Martindale, 2007).But in the wake of a recent, major debate across the sciences regarding power laws, a reconsideration of this earlier research is timely.In this paper, we apply state-of-the-art maximum likelihood methods for the detection and assessment of power laws, to derive a better understanding of the distributions that do and do not describe phoneme frequencies well.Our results clear the way for more informed research into the ultimate, processual causes behind the frequency patterns of phonemes in human language.
Our focus here is on distributions and their potential for shedding light on language.Distribution are properties of variables.A variable can be defined as the set of values that characterize something, be it a sample (e.g. a set of languages), or a real population that the sample is drawn from (e.g. the set of all current languages), or even an idealized population which the real population is believed to approximate (e.g. the set of all possible languages).Often, the ultimate object of scientific interest is an idealized population, and thus its distribution.In empirical work, we cannot directly access this ultimate object of interest, and so we rely on real populations, or very often, a sample.Consequently, although we may have direct access only to the literal distribution of a sample, with its many idiosyncrasies, we tend to be more concerned with an overall pattern which we believe it approximates, one which often is elegantly characterized by a distribution which mathematically is relatively simple.
With these motivations in mind, how can an investigator decide that a certain distribution characterizes a variable satisfactorily?One method is visual inspection.This typically involves observing a close match between two histogram plots: one of the data and one of the candidate distribution, or plotting a regression of the data against the candidate distribution.In work on phoneme frequencies, visual inspection has been the primary method of assessing candidate distributions.A more rigorous alternative is to apply quantitative, statistical tests to evaluate how well a particular distribution model (such as the normal distribution) fits a data sample.The purpose of these tests is not to prove definitively that some variable follows a particular distribution, but rather to quantify the degree to which a sample's distribution is consistent with its having been drawn from a population of a particular distribution.1Some of these tests will be the subject of the sections that follow.
A strong motivation for testing the consistency of observed data with a particular distribution, is that many distributions can be described as the outcome of certain kinds of processes (Frank, 2014).Thus there is a direct link between the quality of our evaluation of distributions, and the reasonableness of the causal, explanatory hypotheses we subsequently entertain.Consider for example a so-called preferential attachment process, also known as a Yule process or rich-get-richer process.This can be imagined as having a set of urns, into which balls are added one at a time.Specifically, the urn to which a new ball is added is selected with a probability proportional to the number of balls already in the urn.This simple process has an interesting outcome.Initially, each urn is equally likely to be selected, but the distribution will soon skew, as urns with more balls accumulate additional balls faster than the others.If we rank the urns in terms of which has the most balls, then with time, the relationship between an urn's rank and how many balls it contains will come to obey a power law.A power law is a mathematical a relationship between two quantities where one varies as a power of the other.Consequently, if, a variable can be shown to be consistent with a power law distribution, then this is consistent with there existing a preferential attachment process as the causal mechanism underlying the behaviour of variable.Udny Yule (1925; see also Albert et al., 2011) made this connection a century ago.Yule showed that among flowering plants the level of species richness within a genus follows a power law, and linked that observation to a preferential attachment mechanism.Since Yule's first demonstration of the link between power law distributions and preferential attachment processes, power laws have been used to characterize the distributions of a diverse array of phenomena in the natural and physical world and in human society (Clauset et al., 2009, 661).City populations (Gabaix, 1999;Levy, 2009;Malevergne et al., 2011), authorship of scientific publications, income distribution (Simon, 1955), the superstar phenomenon in the music industry (Chung and Cox, 1994) and the network topology of the Internet (Faloutsos et al., 1999) are but a few of the phenomena for which power laws have been proposed (see Newman (2005), pp.327-329 for further examples).And in linguistics, the Zipfian distribution has been used to characterize word frequencies in text corpora (Estoup, 1916;Zipf, 1932Zipf, , 1949)).
Given the apparent pervasiveness of power laws in a diverse range of unrelated contexts, it is little surprise that there is a rich vein of literature dedicated to evaluating power laws (Clauset et al., 2009, 662) as well as a century of theorizing on the mechanistic processes by which they arise (see Newman, 2005, 336-348;Mitzenmacher, 2004, 230-243).However, verifying the presence of a power law is not a straightforward task (Stumpf and Porter, 2012, 666), and validation of earlier power law proposals using increasingly robust and powerful statistical methods is an active line of inquiry across many fields of science (Malevergne et  al., 2011). 2   The traditional approach to power law validation, following Pareto's (1897) work on wealth distribution, was visual inspection.When visually inspecting a histogram plotted on log-log scales, a straight line would suggest the presence of a power law.The defining shape parameter (see below), α, could then be obtained by calculating the slope of the straight line using standard linear regression (Clauset et al., 2009, 665;Urzúa, 2011, 254) and the R2 statistic could give an indication of the goodness of fit of the model.However, it has since been demonstrated that this traditional approach can be systematically unreliable (Clauset et al., 2009, 665).A key assumption when estimating the standard error of the slope (shaded in grey) is that noise in is data are normally distributed, however, this is not the case for the logarithms of frequency data (Clauset et al., 2009, 691).Further, the R 2 statistic commonly used to validate the presence of a power law (including by Tambovtsev and Martindale, 2007), has low statistical power.That is, it often fails to distinguish between data truly drawn from a power law distribution and data drawn from other distribution types.This unreliability becomes particularly acute when there is a small number of observations, since the ability to distinguish a power law distribution from other similar distributions, including the log-normal, using R 2 is reduced (Clauset et al., 2009, 691).
To remedy the shortcomings of earlier methods, Clauset et al. (2009) developed power law validation procedures within a more rigorous, maximum likelihood framework.These procedures have since been adopted widely in the literature (for example, Touboul and Destexhe, 2010; Cho et al., 2011;Brzezinski, 2014;and Lee and Kim, 2018), but have not previously been applied to phoneme frequencies.Some brief mathematical preliminaries will be useful here.When we refer to distributions, we are referring to mathematically-defined functions, that relate one quantity to another.Those functions may in addition have free parameters which can be varied in order to produce a family of closely related distributions.A power law is a relationship between two quantities where one varies as a fixed power of the other, for example y = x 3 , or y = x −2 (which can also be written y = 1/x 2 ).For present purposes, where we will not be concerned with negative quantities or zeroes, we will use a more narrow definition by Clauset et al. (2009, 662), who define a power law as a relationship in which a quantity, x, is drawn from the distribution defined in Equation ( 1), where the free parameter α is greater than zero and the variable x likewise is greater than zero.3(The symbol '∝' means 'is proportional to'.)For example, x might denotes items' frequencies, while p(x) is the probability that a given item has a frequency of x.
In practice, Clauset et al. (2009, 662) observe that the exponent (or 'scaling parameter'), α, typically, though not exclusively, falls in the range 2 < α < 3.They also observe that, in practice, many phenomena will not actually obey a power law for all values of x.Rather, the power law will apply to values only above some minimum threshold value, x min .For example, in frequency data, it may be that only items whose frequencies meet or exceed a lower threshold will follow a power law.More generally, power law distributions come in a variety of specific forms, with different numbers of free parameters.We detail some of these in greater depth in the Discussion section below.
A distinction can be made between power laws that apply to continuous variables and those that apply to discrete ones.Frequency data, including the phoneme frequencies used in this study, are typically discrete.Zipf's Law (2) applies to a discrete number of n observations whose values, x, are ranked by descending magnitude x 1 ≥ x 2 ≥ . . .≥ x n .For example, x may be the token frequency of n types, with x k the frequency of the kth-ranked type.In (2), the quantity p(x k ) is the relative frequency of the kth-ranked type (i.e., its frequency scaled such that the n relative frequencies sum to 1).4 There is a long history of studying power laws in linguistics, however, the evaluation of statistical support for a power law relationship is far from straightforward and remains topical across a wide range of scientific fields (Stumpf and Porter, 2012).Although several different models have been compared for their goodness-of-fit to the frequencies of phonological segments (Martindale et al., 1996;Tambovtsev and Martindale, 2007), the method used to measure fit (using the R 2 statistic) has been shown to be systematically unreliable (Clauset et al., 2009).Accordingly, the methodological limitations of previous studies and the renewed, general scientific interest in power law phenomena motivate the re-evaluation of a power law model with respect to phoneme frequencies.Our goal here is to verify the presence or otherwise of power law behaviour in the frequency distributions of phonological segments in the lexicons of Australian languages.It is, to the best of our knowledge, the first attempt to validate a power law model for phonological segments using a maximum likelihood framework as suggested by Clauset et al. (2009).

Data
As our data, we take phoneme frequencies in the lexicons of 168 language varieties of Australia.Readers familiar with Australian phonologies may at first find this a curious choice, since Australian languages are known to have similar phonemic inventories across the continent (Capell, 1956;Dixon, 1980;Busby, 1982;Hamilton, 1996;Baker, 2014;Round, 2019Round, , 2020)).We prefer to see Australia as an ideal controlled experiment.The phoneme inventories per se may be similar, but the phonemes themselves exhibit considerable variation in their frequency distributions (Gasser and Bowern, 2014;Macklin-Cordes and Round, 2015).Likewise, phonemic bigram frequencies in the large Pama-Nyungan family exhibit diversity with a strong phylogenetic signal (Macklin-Cordes et al.), suggesting that variations in Australian phonological frequencies have evolved over a deep time span.
The phonemic frequencies in this study are extracted from wordlists.Consequently, a difference between our test data and that of earlier work is that because we extract frequencies of phonological segments from lexicons, each unique word is weighted equally (since each word appears once in a language's wordlist), 5whereas in text corpora, the frequencies of different words can differ considerably.This difference is meaningful, since in the latter case phoneme frequencies and word frequencies will not be independent of one another.Phonemes in very high frequency words will have their frequencies boosted due to word frequencies, and consequently some degree of Zipfian-like skewing may appear in the distribution of phonological segments due to the confounding contribution of word frequency.Our method effectively controls for such effects and thus removes a potential confound from the phoneme frequency data.
Our data comes from the Ausphon-Lexicon database, under development by the second author (Author, 2017).Ausphon-Lexicon extends the Chirila resources for Australian languages (Bowern, 2016).It adds additional varieties and applies extensive data scrubbing, manual and automatic error-checking, and phonemic conversion using language-specific orthography profiles (Moran and Cysouw, 2018).A standing challenge for typological phonemic research is the long-recognized fact that phonemic analysis itself is non-deterministic (Chao, 1934;Hockett, 1963;Hyman, 2008;Dresher, 2009).Presented with identical sets of language data, two linguists may produce differing phonological analyses, not due to any error on the part of the linguist but due to differing applications of the multitude of criteria by which decisions are made during the analysis of a phonemic system.As a consequence, cross-linguistic phonological variation can be attributed not only to language facts, but also to variation in linguistic practice.In cross-linguistic research, it is desirable for information to be represented in a comparable way throughout a dataset, and so recent phonological literature has emphasized the value of normalizing source descriptions prior to cross-linguistic analysis (Lass, 1984;Hyman, 2008;Hulst, 2017;Round, 2017;Kiparsky, 2018).Phonemic representations in Ausphon-lexicon are normalized in this sense.Section S2, Supplementary Materials, details the normalizations applied, together with bibliographic details of original data sources.
To illustrate an example of the phoneme frequencies in our sample, Figure 1 plots the frequencies of phonological segments in the Walmajarri lexicon (Hudson and Richards, 1993).Equivalent plots for every language in our sample can be viewed through an interactive visualization app that we provide in S4 of the Supplementary Materials.
Phonological frequency data differs in some respects from the data types most commonly encountered in scientific power law studies, such as word frequencies or city populations.Typically, in order to understand a population (and some property of it), such as the cities in the United States (and their sizes), or the words of English (and their frequencies), it is impractical to examine every last member of the population, and so the study will examine a sample.Ensuring that a sample is of sufficient size is an important consideration, firstly in order to adequately represent the population and additionally, because a sufficiently large sample Frequency rank Relative frequency (log) Figure 1.Frequency of phonemes in Walmajarri lexicon (Hudson and Richards, 1993).Plot (A) displays relative frequencies of each segment type.Plot (B) shows the same frequencies on log-transformed x and y axes-the traditional visual device used to identify power laws.
size is an important requirement in maximum likelihood estimation (Barndorff-Nielsen and Cox, 1994;Newman, 2005).In contrast, the phonemic inventory of any language is relatively small, and it is entirely feasible to examine exhaustive populations of phonemes. 6An advantage of this is that the sample is highly representative of the population, but a disadvantage is that the number of observations is small and cannot be increased.
Given a sample of phonemes, we require an estimate, or measurement, of their frequencies.Measurement error is a potential concern in this study.Our segment frequencies are calculated from documented wordlists, which necessarily are limited representations of the complete vocabulary of the languages that the wordlists represent.One concern is that the particular morphology of a language's citation forms may cause certain segments in the language to be overrepresented in a wordlist which contains only citation forms.This would represent a bias, that is, a factor that pushes observations in a certain direction.We have attempted to control for this, by removing identifiable citation-form tense morphology from verbal words and noun-class prefixes from nominals.Another source of concern is that wordlists with a smaller number of words will necessarily entail a greater level of uncertainty in the observed segment frequencies.This will be a source of noise in the data.It does not push observed frequencies in any particular direction, but makes them generally less accurate.To address this, in our study, we restrict the language sample to language varieties with a minimum wordlist size of 250 lexical items.We selected 250 lexical items as a cut-off on the basis of Dockum and Bowern (2019), who investigate the effect of wordlist size on phonological segment frequencies.Dockum and Bowern (2019) report accelerating losses in the fidelity of segment frequency estimates as a wordlist drops below 250 items.While more words will always yield better frequency estimates, we select a minimum of 250 as a reasonable compromise.This gives us a sample of 168 Australian language varieties.Wordlist sizes range from 268 to 8742 (median 1072, mean 1438).

Statistical framework
We test for the presence or absence of a power law in the distributions of phonological segments following the maximum likelihood framework described by Clauset et al. (2009).In brief, Clauset et al.'s (2009, 663) proposed procedure consists of three steps: 1. Estimate the parameters x min and α of the power law model using the maximum likelihood method (Barndorff-Nielsen and Cox, 1994;Newman, 2005) 7 .2. Calculate the goodness-of-fit between the data and the power law using the Kolmogorov-Smirnov (KS) statistic, where a larger value corresponds to a worse fit.Using a Monte Carlo procedure, a bootstrapped p value is calculated8 , and used to evaluate the plausibility of the power law.Namely, if this p value falls below a plausibility threshold of 0.1, the power law model is rejected.9Otherwise, the power law model remains an initially plausible hypothesis, and we proceed to step 3.
3. Compare the power law model with a set of models representing alternative hypotheses.For each alternative model, a bootstrapped p value is calculated as in steps 1 and 2 above.A likelihood ratio test is performed, comparing the fit of the alternatives with those of the power law model.If the calculated likelihood ratio is significantly different from zero, this indicates a significant difference in plausibility, and its sign (positive or negative) indicates which model is favoured (Clauset et al., 2009, 680).
We use the poweRlaw package (Gillespie, 2014) in R statistical software (R Core Team, 2017) to infer all maximum likelihood estimates and conduct bootstrapping to derive p values.We run 10,000 bootstrap iterations per language, per distribution type. 10s a brief point of comparison to prior work, we return to the Walmajarri example and plot the linear relationship between phoneme frequencies and rank on a log-log plot.Tambovtsev and Martindale (2007) find that a Zipfian distribution consistently underestimates the frequency of both high-and low-ranking segments while overestimating the frequency of those in the middle.The dashed black slope on Figure 2 shows a similar pattern.However, when the five lowest-frequency segments (i.e., those with the greatest statistical rank) are removed from the equation, the linear model fits much better (solid blue line).This is consistent with the observation by Clauset et al. (2009) that, in practice, power laws are rarely observed across the whole distribution-rather, there is a threshold, the x min parameter, below which the power law ceases to apply.Visual inspection of other languages in the dataset indicates that Walmajarri's pattern of phoneme frequencies is common, although there is a good deal of variation (and, consequently, variation in the fit of a linear model).However, given the known limitations of applying a linear model to a log-log plot, we now turn to more reliable methods for validating the presence of a power law, using the maximum likelihood method outlined above.

RESULTS
We firstly infer the fit of a power law to the full distribution of phoneme frequencies for each language, without estimating an x min parameter.In Table 1 we summarize the maximum likelihood estimates of the power law distribution's defining shape parameter, α, the goodness-of-fit of the estimated power law distribution to the observed distribution of phoneme frequencies, and bootstrapped p values for the null hypothesis that the data are plausibly drawn from a power law distribution.
Mean α is 1.38 (SD 0.16).As discussed earlier, the standard range of α is 2 < α < 3 (Clauset et al., 2009, 662).α falls within this range for only 1 language.Furthermore, p values are very low.Just 2 of the 168 languages gives a p value above the plausibility threshold.
Throughout this study, the possibility of type I error (false positives) must taken into consideration.By setting our implausibility range at p ≤ 0.1, we accept a one in ten chance of incorrectly rejecting a power law hypothesis which in fact is plausible-this can occur when the distribution's poor fit is due to chance fluctuation alone.Given 168 tests (one test per language), we would therefore expect to reject H 0 incorrectly in around 17 (10%) of those tests.In this instance though, we have rejected H 0 as implausible in 99% of the language sample.Thus it is clear that the power law distribution is not being deemed implausible just by chance.It is genuinely a poor fit for the vast majority of languages.This result accords well with earlier work which has found that a simple, one-parameter form of the power law distribution poorly characterizes phoneme frequencies (Sigurd, 1968;Martindale et al., 1996;Tambovtsev and Martindale, 2007).
As discussed earlier, our dataset of phoneme frequencies is very likely to contain the complete population of phonemes in each language.At the same time, the number of observations per language is low-ranging Table 2. Power law distribution (with x min ).Summary of α paramter, goodness-of-fit and p values for the power law distribution fitted to a subset of more frequent phonemes in each language.
Mean SD Min Max α 2.72 0.59 1.80 5.74 goodness-of-fit 0.14 0.03 0.08 0.23 p 0.61 0.27 0.03 0.99 from 16 to 34 segments in our language sample (mean 24.5, SD 3.7).Such a small set of observations can be a barrier to highly accurate maximum likelihood estimation.Clauset et al. (2009, 669) suggest that a minimum sample size of around 50 is needed to get a maximum likelihood estimate of α accurate to at least 1%.This is simply not possible for most of the world's languages (including all languages in this study) due to the limited size of segment inventories.Thus, in phonemic studies such as ours there is likely to be an unavoidable uncertainty in the estimate of α.

Power law distribution with xmin
If the power law distribution, as inferred above, is inadequate for characterizing phoneme frequencies, then what other options are there?There are a couple of approaches to this question.One is to add an additional parameter to improve the fit of the power law; the other is to consider alternative distribution types.In this and the following sections we explore both approaches.
Here, we infer the fit of a power law distribution with an additional x min parameter, whose effect is to remove some of the least-frequent observations from the sample which is being fitted.As above, we use maximum likelihood to infer the best-fitting x min threshold for each language.Results are summarized in Table 2.
After inferring an x min parameter, the power law distribution is fitted to an average of only 13.9 segments, though there is a wide degree of variation (mean 13.9, SD 3.8).In percentage terms, the power law distribution is fitted to an average of 57% of a language's segmental inventory (SD 15%).124 languages (74%) fall within the normal 2-3 range for α.Having only a small number of included observations above the x min threshold can drive unreasonably high estimates of the α scaling parameter.A sizeable portion of our sample (39 languages, 23%) fall in this high range with α above 3.At the other extreme, 5 languages (3%) have an unusually low α under 2. Mean α is 2.72 (SD 0.59).
When x min is included, the power law hypothesis is accepted as plausible (though, to emphasize, not necessarily correct) in the 159 of 168 language varieties for which p > 0.1.p falls below the 0.1 plausibility threshold in the remaining 9 languages.The lowest p value for any language is 0.03.This puts the chance of incorrectly rejecting H 0 at around one in thirty, which is still high in a set of 168 tests.Overall, since the number of p values below 0.1 is considerably fewer than the number we would expect to observe through chance, and since there is a high chance that the lowest p value, 0.03, is a type I error, we cannot confidently rule out the power law hypothesis for any language in our sample.
Although we have failed to rule out the power law distribution as implausible for any specific language, this still does not mean that the power law distribution is the correct one for our data, and there are some important caveats to our results so far.
A distribution will always fit a set of data at least as well as the same distribution with one less parameter.Thus, the observation that the power law distribution fits better when x min is added requires some interpretation.Of greatest interest in this respect is the striking degree of improvement in fit, such that the power law distribution shifts from a largely implausible fit against full phoneme inventories, to a largely plausible fit after we exclude the least-frequent observations from samples.This raises the obvious question of why this might be so.We consider this in our Discussion, after we have also examined distributional alternatives to power laws.
The inclusion of an x min parameter when fitting power laws is common practice, but its use is most obviously motivated in contexts where there are very many possible observations.For example, Clauset et al. (2009, 684) fit a power law to frequencies of unique words in Moby Dick and find a best-fitting x min of 7 (±2).Words occurring fewer than 7 times can be disregarded and this still leaves nearly 3,000 unique words to which the power law distribution can be fitted.In contrast to this typical use case, where a large number of observations remain in play and do fit the power law, our use of x min with phoneme datasets results in the exclusion of data points from an already small sample, leaving an even smaller set of data being fitted.As a general fact, it is inherently difficult to identify the most appropriate distribution for a small collection of observations.Correspondingly, it is not automatically an insightful finding, that a power law can be plausibly fitted to such small datasets.However, as mentioned just above, it is noteworthy that the same power law did not fit well to the slightly larger datasets that were being used without the x min parameter.This suggests that it is not the merely small size of the dataset which is causing the good plausibility of the fit.
p values can be inflated when the sample of observations is small, as it is when investigating phonemes.We have good reason to suspect our p values are being inflated by the low number of observations per language, the evidence being that the number of p values we observe below 0.1 is considerably fewer than we would expect by chance.The difficulty we find in ruling out the power law distribution may reflect this.

Alternative distributions
In addition to considering the merits of adding extra parameters to a distribution, we must also consider whether a completely different distribution would provide an equal or better fit to the data.We consider three alternative distributions, which are not part of the power law family and may suggest different underlying generative processes.These are the lognormal, exponential and Poisson distributions.Like the power law distribution, the shape of these distributions can have a sharp initial peak and a rapidly decaying tail.

Lognormal distribution
The lognormal distribution is one where the data form a normal distribution when transformed on a log scale.Once again, we use the poweRlaw package (Gillespie, 2014) to estimate parameter values using maximum likelihood.In this instance, the parameters to be estimated are log mean and log standard deviation parameters-the log-scale equivalent of the two parameters that define a normal distribution.We fit the distribution to the whole set of segment frequencies for each language-we do not estimate an x min parameter at this stage (though see below).The lognormal distribution narrowly construed is a continuous distribution, however the poweRlaw package contains a corresponding discretized version, appropriate to phoneme frequency data.
As for the power laws above, we calculate bootstrapped p values to assess the plausibility of the fit of the lognormal distribution for each language.The p values obtained are highly variable throughout the dataset.There are 74 languages (44% of the language sample) for which p falls in the range of implausibility, below 0.1.This is over twice as many as we would expect if the lognormal distribution were plausible for all languages and p ≤ 0.1 values were due to type I error alone.This result is a little difficult to interpret, given the previously discussed difficulties with small samples of observations per language.What seems clear is that, given the rate of p ≤ 0.1 values is elevated beyond chance, we cannot say that the lognormal distribution plausibly characterizes the segment frequencies of all languages.Nevertheless, for many languages-56% of languages in our sample-we cannot confidently rule out the lognormal distribution.Overall, this makes the lognormal distribution with no x min a better fit than the power law distribution with no x min , which we ruled out for up to 99% of languages in the sample.One caveat to keep in mind is that the lognormal distribution is minimally defined by two parameters rather than one, which potentially puts it at an advantage compared to the single-parameter power law distribution.

Exponential distribution
An exponential distribution, and its discrete analogue, a geometric distribution, is one in which frequencies decay at a constant proportional rate.Thus for the frequencies of any two successively-ranked phonemes, x k and x k+1 , the distribution is characterized by a rate parameter, λ, so that x k+1 = λ.xk .Here, as above, we use maximum likelihood to estimate the parameter λ and the bootstrapping procedure to obtain a p value.
Bootstrapped p values are above the 0.1 plausibility threshold for 148 of 168 languages.The number of languages for which p ≤ 0.1 is 20, right on the 17 or so that we would expect from type I errors.This, on the face of it, seems to make the exponential distribution quite a plausible model for phonological segment frequencies more generally.It must be noted, however, that there are a few languages for which the exponential distribution is a very poor fit.The most extreme, Warlmanpa, has goodness-of-fit statistics greater than 0.25 and a p value of just 0.002.The poor quality of fit is visually evident on a log-log plot (see S4, Supplementary Materials).

Poisson distribution
The final distribution we consider is the Poisson distribution, which is related to the exponential distribution.The Poisson distribution is typically used to model the frequency of an event within some interval of time or space.Our case is a bit different since we are modelling the relationship between the frequency of many different events (different phonological segments) and their frequency rank in a language's phonological inventory.As with the exponential distribution, we use maximum likelihood to estimate a single parameter, λ, and use bootstrapping to obtain a p value for the plausibility of the distribution.
The Poisson distribution is totally implausible for all languages in our language sample.Goodness-of-fit statistics range from 0.45 to 0.75 (mean 0.59, SD 0.07).We find p values indistinguishable from 0 in all cases.
Table 3. Results summary.For each of the four distributions considered, this table lists the number of languages (and percentage of the total language sample) for which the distribution plausibly fits, as indicated by an uncorrected p > 0.1 value.'Prop.fitted' gives the average proportion of each language's phoneme inventory above x min .

Summary of results by individual distribution type
In Table 3, we summarize results for the four distribution types evaluated in this study.For each distribution type, we give the number of languages for which the distribution's fit was deemed plausible (p > 0.1).For completeness, we give results for the exponential, lognormal and Poisson distributions when x min is included, just as we did for the power law distribution.(Note: for one language, the bootstrapped p value estimation procedure failed to converge for the lognormal distribution with x min .This is the only distribution we tested which has three free parameters, and in this instance, the algorithmic procedure struggles to differentiate solutions with very similar likelihoods.)Perhaps most noteworthy among these results is the greatly increased inconclusiveness of the method when applied to the reduced set of data points lying above the x min threshold.When the fitting task is restricted to a subset of only the most frequent segments in a language, it is possible to plausibly fit all but the Poisson distribution to any language, after type I error is factored in.One difference, which we nuanced further in the next section, is that power law distributions with x min are fitted on overage to only 57% of a language's phonemes, whereas the lognormal and exponential distributions are fitted to closer to 80%.

Evaluation of comparative best fit
The third and final step in Clauset et al.'s (2009) framework is a likelihood ratio test.This third step may not always be neccessary.If the bootstrapping procedure, above, were to show that only one distribution type plausibly fits the data, it would already have been shown the have the best fit among the candidates examined.However, bootstrapping may indentify multiple distribution types as plausible.It will be recalled that just because a distribution is judged plausible via the bootstrapping process does not mean that it is the correct one, since there may be other equally or more plausible distributions.Accordingly, when there are multiple plausible candidate distributions, Clauset et al. (2009) recommend using Vuong's (1989) likelihood ratio test for model selection, to determine the best-fitting of any pair of competing models.Full results of all likelihood ratio tests described in this section are tabled in Section S3 of the Supplementary Materials.Vuong's (1989) test uses the Kullback-Leibler Information Criterion (Kullback and Leibler, 1951) to calculate the log likelihood of observing the data given a distribution model, and compares this to the log likelihood of observing the same data given a competing distribution model.The test returns a test statistic, which gives an indication of how strongly one model is favoured over another, and a p value, indicating whether the difference in the support for each model is statistically significant.
We begin by comparing distributions without the x min parameter.As summarized in Table 3, two of these distributions (the power law and Poisson distributions, without x min ) have already been rejected as implausible for all or nearly all languages.Accordingly, we conduct just one likelihood ratio test per language, comparing the fit of the exponential versus lognormal distributions.Overall, we find that Vuong's likelihood ratio test somewhat favours the exponential distribution.Likelihood ratios favour the exponential distribution for 119 languages, and the lognormal distribution for 49 languages.However after Bonferroni correction, the difference in the likelihood of exponential and lognormal models is statistically significant for only two languages, Thaynakwithi and Linngithigh, both of the Northern Paman subgroup of Pama-Nyungan, both favouring the exponential distribution.
Turning to distributions with the x min parameter, since we have already rejected the Poisson distribution, we conduct likelihood ratio tests pairwise among the remaining three distributions.In order to compare distributions with x min parameters, it is necessary to set x min to the same value in both distributions (Gillespie, 2014).Thus, to make a pairwise comparison, we take the x min value from distribution A and using it, re-estimate the other parameters of distribution B, and conduct one likelihood ratio test.Then we take x min from B, use it and re-estimate the other parameters of distribution A, and conduct a second likelihood ratio test, giving two results for each pair of distributions.
Comparing the exponential and lognormal distributions, the likelihood ratios favour the lognormal distribution (141 languages to 27) using x min from the lognormal fit, and favours the exponential distribution (104 languages to 64) using x min from the exponential fit, however none of these comparisons reaches significance after Bonferroni correction.
Comparing the power law and lognormal distributions, likelihood ratios favour the lognormal distribution (145 languages to 23) using x min from the power law fit, and all languages when using x min from the lognormal fit, however only two of these comparisons reaches significance after Bonferroni correction.Yir Yoront favours the power law when using x min from the power law fit and Malyangapa favours the lognormal distribution using x min from the lognormal.
Comparing the power law and exponential distributions, the likelihood ratios favour the power law (138 languages to 30) when taking x min from the power law fit, though no comparison reaches significance.They favour the exponential distribution 165 languages to 3 when x min is taken from the exponential fit.Thirteen of those comparisons reach significance.
In sum, we found earlier that when parameterized without x min , only the exponential and lognormal distributions were broadly plausible.Voung's likelihood ratio test marginally favours the exponential test over the lognormal when fitted against entire phonemic inventories, but the difference is at most slight.When parameterized with x min , the power law distribution is fitted to around 60% of languages' phonemes on average, while the exponential and lognormal are fitted to around 80% (Table 3).Pairwise likelihood ratio tests, which apply one distribution's x min parameter to the other, provide slender evidence of the following.Even when fitted against the small phonemic subsets favoured by the power law, the lognormal distribution may weakly outperform the power law, but the exponential distribution does not.Fitted against the larger subsets favoured by the exponential and lognormal distributions, the power law is outperformed by the exponential and lognormal.The performance of the latter two distributions is indistinguishable.

DISCUSSION
Here we contextualize the current study more fully in the history of research into power laws and phoneme frequencies, before drawing implications and conclusions from the new findings we have obtained.

Power laws in linguistics
Investigation of power laws in the linguistic sphere has a long history.One of the oldest and best-known examples of a power law in any discipline is the distribution of word frequencies in text corpora, first noted by Estoup (1916) and subsequently described by Zipf (1932Zipf ( , 1949).Zipf's Law, as it has come to be known, is a discrete power law distribution.Its exponent parameter, α, is typically very close to 1, in which case, the second ranked item will be approximately half as frequent as the first, the third ranked item will be one third as frequent as the first, and so on.Zipf's Law continues to garner considerable attention, for example in Kucera et al. (1967), Montemurro (2001), and more recently in Baayen (2001Baayen ( , 2008)).Various modifications to Zipf's formula have been suggested (notably Mandelbrot, 1954) and theoretical explanations put forward (Li, 1992;Naranan andBalasubrahmanyan, 1992, 1998).
Power laws have also been proposed to describe the distribution of phoneme frequencies.The use of Zipf's Law to model the frequencies of phonological segments initially appears to be an attractive prospect (Witten and Bell, 1990, 565-566).Nevertheless, a selection of alternative, non-power law distributions has also been suggested.Sigurd (1968) is an early study evaluating the fit of a Zipfian distribution to phoneme frequencies, where the exponent, α, is set to 1. His evaluation method is a simple visual inspection, comparing observed phoneme frequencies in five languages (selected for their variety in segmental inventory size) with their expected frequencies assuming a Zipfian rank-frequency relationship.Sigurd (1968, 8) observes that the phoneme frequency distributions do not approximate a Zipfian curve, particularly for the most common segments.Rather, Sigurd (1968) finds better approximations using a geometric series equation, so that x k+1 = λ.xk for some rate parameter λ, giving the discrete distribution: Good (1969,577) suggests an alternative method of approximation: following Whitworth (1901), Good calculates the expected frequencies of each phoneme given a process whereby a unit interval probability space [0, 1] is divided into n parts at random (where n is the number of phonemes in the language), following a uniform distribution.This is equivalent to a so-called stick-breaking process: imagine a stick, which represents the unit interval probability space.The stick is broken into n parts; the n − 1 places along the stick at which a break is made are selected randomly and all at once, with any place along the stick equally likely to be selected as any other.When these parts are rearranged by size, from smallest to largest, their expectation follows the distribution: Which is to say: In support of this model, Good (1969, 577) provides a table of observed versus expected frequencies of both graphemes and phonemes in English, however the sample size is modest (1000 words) and does not extend to any other languages.Furthermore, there is no visual or statistical evaluation of the goodness-of-fit.Good (1969) intends for the results to be taken as a curious observation only, with no strong theoretical position or claim of generalisability.
As n in equation ( 5) grows large, the summation term n i=k and Rota, 1989, 12), meaning that Good's distribution can be considered an approximation of ( 6), a discretized negative logarithmic distribution.
Gusein-Zade (1988) and Borodovsky and Gusein-Zade (1989) visually evaluate the fit of ( 6) to the graphemes of English, Estonian, Russian and Spanish. 11They also use the equation to describe the distribution of DNA codons (Borodovsky and Gusein-Zade, 1989).Witten and Bell (1990, 563-566) examine the frequencies of single graphemes, graphemic bigrams and trigrams in the Brown Corpus and compare the fits of Good's distribution and Zipf's Law by comparing expected entropy values for each model to observed entropy scores.They find that the quality of the fit of Good's model declines with bigrams and trigrams compared to single graphemes, although the observed distribution curves are broadly of the same shape (and resemble the shape of Good's distribution rather than the Zipfian distribution).When assessed using metrics based on entropy, Good's distribution fits better than or around equally as well as the Zipfian distribution for all three datasets.Good's distribution and the negative logarithmic distribution it approximates also have the advantage of parsimony, since they are parameter-free: knowing how many unique items (phonemes, graphemes, bigrams, etc.), n, are in the dataset is sufficient to calculate their expected distribution of frequencies-there are no additional parameters to estimate such as α in (2) or λ in (3).Martindale et al. (1996) compare the fit of four different distributions to frequencies of both graphemes and phonemes in text corpora from 18 languages.Using the R 2 statistic in a linear regression, they compare the fit of the parameter-free negative logarithmic equation of Borodovsky and Gusein-Zade (1989) in ( 6) to the Zipfian power-law distribution (2), Sigurd's geometric series distribution (3), and the Yule-Simon distribution (Yule, 1925;Simon, 1955), which can be written: The Yule-Simon equation in ( 7) is the product of the power law in (2) and the geometric equation in (3).Because of the differing rates at which the two parts of the equation decay as k increases, equation ( 7) 11 Of course, the statistics of graphemes are different from the statistics of phonological segments.As Bloomfield (1935,(136)(137) rather emphatically points out: "If we take a large body of speech, we can count out the relative frequencies of phonemes and of combinations of phonemes.This task has been neglected by linguists and very imperfectly performed by amateurs, who confuse phonemes with printed letters."Nevertheless, the frequencies of graphemes has been of interest historically in many applications; for example, in traditional printing, the development of Morse code, and library cataloguing (Witten and Bell, 1990, 550-551).
produces a distribution which is more like a power law (2) for low values of k (and thus for high-frequency items, for instance) and more like the geometric (3) for high values of k (low-frequency items) (Simon, 1955).
The Yule-Simon equation in ( 7) has not just one free parameter but two, the exponent α and the rate λ, and the Zipfian and Sigurd equations are effectively special cases of it, each with one parameter fewer.The Zipfian distribution is equivalent to (7) with λ set to 1 (so that λ k = 1), while the geometric equation is equivalent to (7) with α set to zero (so that 1/k α = 1).This is important, since as a general fact, if distribution A is a special case of distribution B, with fewer free parameters than it, then B will always perform at least as well as A when fitting the same set of data.Thus, the Yule-Simon distribution will necessarily fit the same set of data at least as well as the Zipfian distribution, and Sigurd's geometric distribution.Martindale et al. (1996) find that the Yule-Simon distribution fits best, for both graphemes and phonemes.They find that the Zipfian distribution tends to overestimate both high-and low-frequency items, although the differences they observe between models are only small.On this basis, they conclude that it is "a matter of taste" whether one opts for the more precise Yule-Simon distribution or simpler models with fewer parameters to estimate (Martindale et al., 1996, 111).Tambovtsev andMartindale (2007) expand Martindale et al.'s (1996) study to include phoneme frequencies in 95 languages (90 of these are Eurasian; 2 are from Oceania and 1 each from Australia, Africa and South America).The sample is divided into four language groups (Indo-European, Altaic and Yukaghir-Uralic-plus a miscellaneous group) and a series of pairwise sign tests are conducted to test whether the difference in mean R 2 is significant between different distributions for each language group.Again, they find that the Yule-Simon distribution fits best overall. 12btaining a better fit by using a distribution with an additional parameter may be relatively trivial mathematically speaking, but this does not mean it is uninteresting.The extra parameter may work to capture a significant real-world nuance in an underlying causal process or describe the effect of one or more secondary processes.A compelling causal explanation of a complex distribution might therefore be formulated by identifying some real-world factor and explaining how its mathematical effect on the distribution is expected to match what we find.It is also important to consider the possibility of equifinalitythe fact that multiple, different real-world phenomena may have equivalent mathematical effects.Tests of goodness-of-fit examine only the mathematical aspect, and cannot distinguish between different phenomena whose detectable mathematical contribution is equivalent.Martindale et al. (1996, 111) and Tambovtsev and Martindale (2007, 9) note that a Zipfian distribution describes frequencies of phonological segments less well then it describes frequencies of words, in part because the highest-frequency phonemes are not frequent enough.They speculate that this may be so, because if the most-frequent phonemes did pattern in a Zipfian way, then perception problems could arise for language users owing to the small size of a phonological inventory.This speculation does not meet the criteria for a compelling causal explanation though.It is not clarified what the linguistic mechanism is, that acts to prevent such perceptual problems, and thus we do not have a real-world phenomenon whose mathematical properties could be interrogated.Nor is it explained why, if such a mechanism exists, its mathematical effect would be to contribute something like the extra geometric term λ k that differentiates the Yule-Simon distribution (7) from the Zipfian (2).The Yule-Simon equation, which Martindale et al. (1996) and Tambovtsev and Martindale (2007) find to be a superior fit, describes a distribution which is most similar to a power law for high-frequency (low k) items, and most like the geometric for low-frequency (high k).The claim that its superior fit is due to non-power-law-like behaviour of high-frequency items is therefore hard to reconcile with the mathematics.We turn next to consider this history of investigation together with our new results.

Findings from a more reliable evaluation procedure
Power laws have attracted wide and sustained scientific interest.Recent debates on their validity have prompted the development and widespread adoption of novel evaluation methods that are more reliable than those used in the past (Clauset et al., 2009;Stumpf and Porter, 2012).In this study, we re-evaluated the plausibility of several distribution types as characterizations of phoneme frequencies using a maximum likelihood statistical framework presented by Clauset et al. (2009) and a sample of 168 Australian language varieties.
Using more a reliable evaluation procedure than previous investigations, we have confirmed the finding that a basic power law distribution, with a single free parameter, is generally insufficient for characterizing phoneme frequencies.Additionally, we reconfirm a result going back to Sigurd (1968), that an exponential (or geometric) distribution, with a single free parameter, is a good plausible fit for full phonemic inventories.Furthermore, we find that a lognormal distribution, with two free parameters, is an additional plausible fit, whereas a Poisson distribution, with a single free parameter, is implausible.(We did not attempt to fit the two-parameter Yule-Simon distribution in (7), since to our knowledge, there is no maximum likelihood estimation procedure currently available for estimating its parameters.However, we return to the question of this distribution just below.) A second novel contribution was to consider the addition of an x min parameter, a practice which is now common in power law research.Notably, while power laws are largely implausible fits for entire phoneme inventories, their plausibility is improved strikingly once a subset of the least-frequent phonemes is removed from the sample.This is despite that fact that the full inventories and the reduced ones share the property of comprising notably small samples.The subset removed in order to achieve maximum likelihood is on average large, at 43%.This result indicates that power laws constitute a plausible characterization for the more-frequent portion of phonemic iventories, and explains why the upper end of a Yule-Simon distribution, which most closely approximates a power law, should be a reasonable fit.We note however, that the lognormal distribution also performs well in this same, high-frequency region of phonemic inventories.Exponential (or geometric) distributions do not fit the higher-frequency portion of inventories as well the power law or lognormal do, but they are good fits for entire inventories, suggesting that they fit particularly well in lower-frequency portions.This would explain why the lower end of a Yule-Simon distribution, which most closely approximates a geometric distribution, should be a reasonable fit.
Using an evaluation procedure which has since been shown to be unreliable, Martindale et al. (1996) and Tambovtsev and Martindale (2007) concluded that the two-parameter Yule-Simon distribution fit whole inventories better than a power law or a geometric distibution.Implicitly, this is a conclusion in two parts: more-frequent phonemes are more power-law-like, and less-frequent are more geometric-like.Here we have not been able to directly evaluate the Yule-Simon distribution using the more reliable, maximum likelihood method.However, we have found evidence supporting a similar conclusion, that the more-frequent and less-frequent portions of phonemic inventories are characterized by different distributional properties.The more-frequent portion better matches a power law, though also a lognormal distribution.The less-frequent portion better matches a geometric distribution.These two findings serve to clarify and qualify the two implicit halves of the main finding of Martindale et al. (1996) and Tambovtsev and Martindale (2007), and here we have arrived at them by more reliable methods.Furthermore, by estimating x min parameters, we have provided some estimate of where power-law-like behaviour starts to cut out within a phonemic inventory.To understand what these results entail for theory, we return to the question of causal processes.

Implications and conclusions
Linguistic theorizing will be aided by a sound knowledge of which distributions plausibly characterize a variable x (such as phoneme frequency), since those distributions will be consistent with only certain mathematical kinds of underlying causal processes.Thus knowledge of distributions helps us by placing an empirical filter upon viable causal explanations.In this paper, we have improved the certainty of our understanding of the distributions of phoneme frequencies, using state-of-the-art statistical methods.By the same token, we should not expect that this one empirical filter will do all the work.In our case, we are not yet able to decide empirically, for example, whether a lognormal or power law distribution better characterizes high-frequency phonemes.However, it may be possible to distinguish between such options on other grounds.For instance, it may be that the causal processes themselves, which generate such distributions, are differentiable in terms of their plausibility, on some basis other than merely the distribution that emerges from them.
It is beyond the scope of this paper to pursue questions of causation that lie behind the distributions we have uncovered.However, a fruitful next step used widely in other sciences is to explicitly consider mathematical families of stochastic processes, using these as a bridge between real-world candidate causal processes and the mathematical implications they have such as observed empirical distributions.For example, many discrete systems can be profitably conceptualized in terms of urn processes, that have characteristically associated distributions.As Kuba and Panholzer (2012, 87) remark, "[u]rn models are simple, useful mathematical tools for describing many evolutionary processes in diverse fields of application".There exist well-studied urn process which yield many kinds of distributions, and it will be profitable in linguistic research to more clearly relate our own theories of change, including change in phonemic inventories, to these mathematically more generalized processes.By doing so, linguists will be able to tap into related mathematical results (such as relating processes to distributions), that can assist us to further differentiate the theories that are more viable from those that are less so.
In this paper we have also demonstrated a template for future work on distributions themselves.Ideally, such work should begin with critical assessment of links that can be made between existing or new causal hypotheses, including diachronic processes, and particular distributional outcomes.Subsequently, the fit of the hypothesized distribution to real-world data should be evaluated rigorously using robust statistical methods.Lastly, an attempt must be made to rule out competing distribution types and alternative generative mechanisms.As we have demonstrated, this may be challenging, given the inherent limitations of working with small sets of observations.The challenges of small datasets should in turn motivate innovation in the kinds of variables that linguists investigate.There may be gains to be made in combating small dataset sizes through methodological innovations around how frequency data is assembled, for example, through paying attention to phonotactic position, natural classes of phonemes, or by creating aggregated datasets for subgroups or language families.In this study we have focused on phoneme frequencies because they have occupied such a prominent place in the history of investigations of distributions.However, by doing so we emphatically do not suggest that phonemes ought to continue to occupy such a prominent place, when more interesting and possibly more tractable phenomena still await investigation.

Figure 2 .
Figure 2. Log-log plot of frequencies versus frequency ranks in Walmajarri.When a linear model is fitted to the full distribution (dashed black), high-and low-frequency segments are overestimated and mid-rank segments are underestimated.When lowest-frequency segments are removed from the model (solid blue), the model appears to fit well.

Table 1 .
Power law (without x min ).Summary of α paramter, goodness-of-fit and p values for the power law distribution fitted to each language's full phonemic inventory.
x min With x min Prop.fitted