^{1}

^{2}

^{*}

^{1}

^{2}

^{3}

^{1}

^{2}

^{*}

^{1}

^{2}

^{3}

Edited by: Dabao Zhang, Purdue University, United States

Reviewed by: Yanzhu Lin, National Institutes of Health (NIH), United States; Jie Yang, University of Illinois at Chicago, United States; Qin Shao, University of Toledo, United States

*Correspondence: Boon Kin Teh

Siew Ann Cheong

This article was submitted to Mathematics of Computation and Data Science, a section of the journal Frontiers in Applied Mathematics and Statistics

This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

An increasingly important problem in the era of Big Data is fitting data to distributions. However, many stop at visually inspecting the fits or use the coefficient of determination as a measure of the goodness of fit. In general, goodness-of-fit measures do not allow us to tell which of several distributions fit the data best. Also, the likelihood of drawing the data from a distribution can be low even when the fit is good. To overcome these limitations, Clauset et al. advocated a three-step procedure for fitting any distribution: (i) estimate parameter(s) accurately, (ii) choosing and calculating an appropriate goodness of fit, (iii) test its significance to determine how likely this goodness of fit will appear in samples of the distribution. When we perform this significance testing on exponential distributions, we often obtain low significance values despite the fits being visually good. This led to our realization that most fitting methods do not account for effects due to the finite number of elements and the finite largest element. The former produces sample size dependence in the goodness of fits and the latter introduces a bias in the estimated parameter and the goodness of fit. We propose modifications to account for both and show that these corrections improve the significance of the fits of both real and simulated data. In addition, we used simulations and analytical approximations to verify that convergence rate of the estimated parameters toward its true value depends on how fast the largest element converge to infinity, and provide fast inversion formulas to obtain

The current era of Big Data has ushered in a new way to look at Science—and that is letting the data speak for itself. Because of this, we are now much more concerned about empirical distributions than we have in the past, and to check what the empirical distributions could be in statistically rigorous ways. In the past, many tests on empirical data were performed against the univariate normal distribution [

Among these tests, the KS and Lilliefors tests can also be applied to non-normal distributions. In fact, many real-world data do not follow normal distributions. For instance, many social systems are known to have power-law distributions [

Since the CSN test can be applied across distributions, we also use it to fit data that appear exponentially distributed. On many occasions, we discovered that the exponential fits look good visually, but have significance values (

We realized there are two issues associated with fitting data to distributions defined over (0, ∞). First, there is the

Sometimes we have reasons to believe that our large data sets may be described by well known distributions, such as the normal distribution, power law distribution, exponential distribution, and so on, but with best-fit parameter values that we need to determine. Commonly used methods to perform ^{2}) and other forms of distance measure [

In a recent statement, the American Statistical Association warned the scientific community that the

In 2009, Clauset, Shalizi, and Newman (CSN) did precisely this by coming up with a

for _{min}, ∞), with exponent α. The CSN

_{1}, _{2}, …, _{S}}, sorted such that _{i} ≤ _{i+1}, the CSN algorithm (^{(j)}, we estimate α^{(j)} using the MLE method that maximizes the log-likelihood function,

Applying the maximizing condition

where the hat indicates an estimated parameter and

_{X} with cumulative distribution function _{X}, then its probability integral transform _{X}(_{1} = _{min}, _{2}, …, _{N}} with estimated

between ^{(s)} and _{U}(

_{min}_{min}, (^{(j)} with its corresponding ^{(j)} that yields the lowest KS distance (

_{1}, _{2}, …, _{S}}, we test how plausible it is for

is the fraction of simulated samples whose fits are poorer than that of the data.

Extending the CSN method to other distributions, we performed

_{min}) fitted. the black dots represent empirical data while the blue dashed line represents the fit. All fits are visually good, yet only the _{KS} in percentage) for Taiwan home price is appreciable.

At this stage, we might wonder whether the Taiwan income data would have been better fitted to a truncated EXP (TEXP) distribution

since it is obtained by removing the power-law tail. The Taiwan home price per square foot data was also truncated, but for a different reason: the small number of largest elements are clearly outliers that would not fit the EXP distribution. Ideally, we should be using untruncated data, like the Straits Times Index data, to illustrate the method that we will describe in the following sections. In the rest of the paper, we will use all three data sets as if they were untruncated, to illustrate how well our method works on different data types. To do so, we will compare the adjusted parameter and test statistic against the unadjusted parameter and test statistic meant for the untruncated EXP distribution.

Here, we will illustrate the effects of FLE using an asymptotic EXP distribution. The same discussion can be generalized to other distributions (see Supplementary Information section

The EXP distribution is defined as

with β as a sole parameter for _{min}, ∞). Maximizing the likelihood function

If we use the mean obtained from data 〈_{data} as 〈_{unadj}. However, due to the FLE, we can only average up till _{max}. As such 〈_{data} will be biased downwards and Equation (8) over-estimates

To adjust for the FLE, we add the truncated part back into 〈_{data}, to define the adjusted 〈_{adj} as

Inserting 〈_{adj} into Equation (8), we obtain a nonlinear equation

that we solve using MATLAB's builtin nonlinear solver function

To test the performance of this adjustment formula, we simulated 1, 000 sets of EXP distributed data for

This transforms _{i}} to EXP distributed elements {_{i}}. Using this transformation _{min} and ∞ respectively. It is also useful to note that Equation (11) is the inverse of the CDF of the EXP distribution,

To simulate the effect of a FLE with _{min} = 0. Thereafter, we estimated

of _{T}. As we can see from the Figure ^{2} and decreases to 34% for large samples ^{4}. On the other hand, _{max}. In contrast, _{max} as we have accounted for the FLE.

Relative estimation errors of _{T} and _{min} = 0 and _{unadj} remains high (close to the theoretical relative error of ϵ(δ = 0.1, _{min} = 0) = 0.1[1 − ln(0.1)] ≈ 33%) even for large _{adj} decreases rapidly with increasing

In the Supplementary Information section

By defining _{max} in Equation (14) with δ, the theoretical relative estimation error is expressed as

Equation (15) shows that the estimation error has no explicit dependence on sample size. This tells us that the _{T} because of the FLE effect. The convergence rate then depends on how rapidly _{max} approaches infinity (δ approaches zero) with increasing sample size.

For a finite sample, _{EXP}(_{EXP}(_{KS} is obtained by comparing ^{(s)} by a factor of 1/(1−δ). Figure _{KS} measured for the 1000 sets of EXP distributed data with finite largest element _{T} and sample sizes ^{(s)} using Equation (12). After that, we measure _{KS} with Equation (4) to obtain unadjusted KS distance, _{unadj} and adjusted KS distance, _{adj} using the non-rescaled and rescaled ^{(s)}, respectively. _{unadj} goes from 0.14 for small samples ^{2}, to 0.10 for large samples ^{5}. In contrast, _{adj} decrease from 0.06 for small samples to 0.006 for large samples.

The median KS distances for _{unadj} and _{adj} measured from 1,000 simulated samples using different β_{T} and _{min} is set to 0 and _{unadj} remains above δ = 0.10 while _{adj} converges to zero for large

Until now, we have only discussed adjustments to the estimated parameter and the KS distance to eliminate the bias caused by the FLE. Besides the FLE effect, we also need to consider the bias caused by having a finite number of elements in the sample. As we can see from Figure _{KS} changes as a function of ^{6} samples of various sizes _{KS} using Equation (4), so that for each ^{6} KS distances. In Figure

that we settled for, after experimenting with several functional forms (see Supplementary Information section _{KS} → 0 as _{1} and _{2} from the same distribution, we should compare _{2} > _{1} then naturally _{2} sample fits the distribution better.

Log-log plot of _{KS} against ^{6} simulations.

In this section, we presented explicitly the procedures to obtain the adjusted parameter, as well as the steps to perform significance testing on this estimated parameter. Although we demonstrated this explicitly using the EXP distribution as an example, one should note that this method can also be applied to other distributions. The inclusion of _{max} when fitting empirical data have been previously considered by [_{max} to infinity, we can compute the probability integral transform to map arbitrary distributions to the standard uniform distribution, and ensure that during statistical significance testing our goodness-of-fit measure can be distribution independent [see

More importantly, fitting data to untruncated distributions defined over [_{min}, ∞) is commonly encountered in practice, where no _{max} is expected from theoretical considerations, but the largest element in our data is finite. If we fit to the truncated versions of the distributions, we might get better estimates of the distribution parameters, but we will not be able to justify inserting these estimates into the untruncated distributions, in the absence of a limiting procedure involving larger and larger _{max}. Moreover, when researchers expect to be dealing with the untruncated distribution, they will not use the truncated distribution for estimation. In contrast, our self-consistent adjustment procedure would be ontologically easier to justify.

Besides having to work with finite samples and finite largest elements, we will also in practice encounter imperfections while collecting samples for various reasons, such as undetected samples, contamination by background noise, and recording errors. We call such noises that occur at the element level

Suppose we now randomly generate a set of EXP data. After adjusting for FLE, we obtained the distribution parameters and use it to transform this set to _{i−1}, _{i}], such that the probability density is

As the theoretical probability density for _{DN} mathematically to be

where _{0} = 0 and _{N} = 1. We need to weigh the deviation of each bin by _{DN} finite.

Illustration of the distribution noise we would measure if we sample 10 elements from

As with section 3.3, we simulated 10^{6} samples from _{DN} using Equation (18) and plot its deciles against _{DN} and _{DN} as

where Φ(

is the analytically derived distribution noise, that converges to _{1} and _{2} with _{2} > _{1} from the same distribution, we should compare _{2} sample fits the distribution better if

Relationship between distribution noise _{DN} and sample size ^{6} simulations. The _{DN} value converges to

As measures for statistical deviations, _{DN} and _{KS} are different in that _{DN} measures deviation at the probability density level, whereas the _{KS} measure it at the cumulative density level. As a result, _{KS} assigns more weight to the tail of the distribution, while _{DN} is more sensitive to deviations in the body of the distribution. Therefore, if we wish to combine these two measures to estimate the significance level, we need to first investigate the relationship between _{KS} and _{DN}. We do this by simulating 10^{6} samples from _{KS} and _{DN} using Equations (4) and (18) respectively, to obtain 10^{6} pairs of _{KS} and _{DN}. We then compute the Pearson correlation between _{KS} and _{DN} and learned that (see Supplementary Information section

As expected, _{KS} is positively correlated with _{DN}. Since _{KS} is a measure at the cumulative level, the random distribution noises cancel each other, thus the correlation between _{KS} and _{DN} vanishes as

To perform significance testing given _{KS} and _{DN}, we need the percentile values ℘_{KS} and ℘_{DN}. ℘_{KS} can be obtained by inverting Equation (16), as

Similarly, we invert Equation (19), and solve

to get ℘_{DN}, where η = _{DN}_{DN}

Substituting the empirical KS distance _{KS}_{DN}_{KS}_{DN}

to avoid overestimating the significance level.

We follow the steps outlined in the CSN algorithm (section 2) to fit the empirical data, but with two important modifications: (Ii) the parameters (

Figure _{max}) is the main reason for Taiwan wealth to fail

_{KS/DN}-values (in percentage) are for unadjusted (blue) and adjusted (red) fits. We separate the

There are several limitations one should note while obtaining _{KS/DN} using Equations (22) or (23). First, it is only applicable to large samples (see Figures _{KS}_{min}_{KS}_{KS/DN}. We make the codes for the procedures used in parameter estimation and significance testing available at

All in all, when we test for statistical significance, we need to be aware of finite sample effects, namely the finite largest element effect and the finite number of elements effect. Beyond the KS distance measured at the cumulative distribution level, we also introduce an alternative measure of the goodness of fit based on the distribution noise at the probability density level.

BT, DT, and SC: designed research. BT: performed research. BT, DT, and SL: collected data. BT and DT: analyzed data. All authors wrote and reviewed the paper.

The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

The authors would like to thank Chou Chung-I for directing us to the Taiwanese data sets.

The Supplementary Material for this article can be found online at: