Rank dynamics of word usage at multiple scales

The recent dramatic increase in online data availability has allowed researchers to explore human culture with unprecedented detail, such as the growth and diversification of language. In particular, it provides statistical tools to explore whether word use is similar across languages, and if so, whether these generic features appear at different scales of language structure. Here we use the Google Books $N$-grams dataset to analyze the temporal evolution of word usage in several languages. We apply measures proposed recently to study rank dynamics, such as the diversity of $N$-grams in a given rank, the probability that an $N$-gram changes rank between successive time intervals, the rank entropy, and the rank complexity. Using different methods, results show that there are generic properties for different languages at different scales, such as a core of words necessary to minimally understand a language. We also propose a null model to explore the relevance of linguistic structure across multiple scales, concluding that $N$-gram statistics cannot be reduced to word statistics. We expect our results to be useful in improving text prediction algorithms, as well as in shedding light on the large-scale features of language use, beyond linguistic and cultural differences across human populations.


INTRODUCTION
The recent availability of large datasets on language, music, and other cultural constructs has allowed the study of human culture at a level never possible before, opening the data-driven field of culturomics (Lieberman et al., 2007;Michel et al., 2011;Dodds et al., 2011;Serrà et al., 2012;Blumm et al., 2012;Solé et al., 2013;Tadić et al., 2013;Gerlach and Altmann, 2013;Perc, 2013;Febres et al., 2015;Wagner et al., 2014;Piña-Garcia et al., 2016;Piña-García et al., 2018). In the social sciences and humanities, lack of data has traditionally made it difficult or even impossible to contrast and falsify theories of social behaviour and cultural evolution. Fortunately, digitalized data and computational algorithms allow us to tackle these problems with a stronger statistical basis (Wilkens, 2015). In particular, the Google Books N -grams dataset (Michel et al., 2011;Wijaya and Yeniterzi, 2011;Petersen et al., 2012b,a;Perc, 2012;Acerbi et al., 2013;Ghanbarnejad et al., 2014;Dodds et al., 2015;Gerlach et al., 2016) continues to be a fertile source of analysis in culturomics, since it contains an estimated 4% of all books printed throughout the world until 2009. From the 2012 update of this public dataset, we measure frequencies per year of words (1-grams), pairs of words (2-grams), up until N -grams with N = 5 for several languages, and focus on how scale (as measured by N ) determines the statistical and temporal characteristics of language structure.
We have previously studied the temporal evolution of word usage (1-grams) for six Indo-European languages: English, Spanish, French, Russian, German, and Italian, between 1800and 2009(Cocho et al., 2015. We first analysed the language rank distribution (Zipf, 1932;Newman, 2005;Baek et al., 2011;Corominas-Murtra et al., 2011), i.e. the set of all words ordered according to their usage frequency. By making fits of this rank distribution with several models, we noticed that no single functional shape fits all languages well. Yet, we also found regularities on how ranks of words change in time: Every year, the most frequent word in English (rank 1) is 'the', while the second most frequent word (rank 2) is 'of '. However, as the rank k increases, the number of words occupying the k-th place of usage (at some point in time) also increases. Intriguingly, we observe the same generic behaviour in the temporal evolution of performance rankings in some sports and games (Morales et al., 2016).
To characterize this generic feature of rank dynamics, we have proposed the rank diversity d(k) as the number of words occupying a given rank k across all times, divided by the number T of time intervals considered (for Cocho et al. (2015), T = 210 intervals of one year). For example, in English d(1) = 1/210, as there is only one word ('the') occupying k = 1 every year. The rank diversity increases with k, reaching a maximum d(k) = 1 when there is a different word at rank k each year. The rank diversity curves of all six languages studied can be well approximated by a sigmoid curve, suggesting that d(k) may reflect generic properties of language evolution, irrespective of differences in grammatical structure and cultural features of language use. Moreover, we have found rank diversity useful to estimate the size of the core of a language, i.e. the minimum set of words necessary to speak and understand a tongue (Cocho et al., 2015).
In this work, we extend our previous analysis of rank dynamics to N -grams with N = 1, 2, ...5 between 1855 and 2009 (T = 155) for the same six languages, considering the first 11, 140 ranks in all 30 datasets (to have equal size and avoid potential finite-size effects). In the next section, we present results for the rank diversity of N -grams. We then compare empirical digram data with a null expectation for 2-grams that are randomly generated from the monogram frequency distribution. Results for novel measures of change probability, rank entropy, and rank complexity follow. Next, we discuss the implications of our results, from practical applications in text prediction algorithms, to the emergence of generic, large-scale features of language use despite the linguistic and cultural differences involved. Details of the methods used close the paper. Figure 1 shows the rank trajectories across time for selected N -grams in French, classified by value of N and their rank of usage in the first year of measurement (1855). The behaviour of these curves is similar for all languages: N -grams in low ranks (most frequently used) change their position less than N -grams in higher ranks, yielding a sigmoid rank diversity d(k) (Figure 2). Moreover, as N grows, the rank diversity tends to be larger, implying a larger variability in the use of particular phrases relative to words. To better grasp how N -gram usage varies in time, Tables S1-S30 in the Supplementary Information list the top N -grams in several years for all languages. We observe that the lowest ranked N -grams (most frequent) tend to be or contain function words (articles, prepositions, conjunctions), since their use is largely independent of the text topic. On the other hand, content words (nouns, verbs, adjectives, adverbs) are contextual, so their usage frequency varies widely across time and texts. Thus, we find it reasonable that top N -grams vary more in time for larger N .

Rank Diversity of N -gram usage
As Figure 2 shows, rank diversity d(k) tends to grow with the scale N since, as N increases, it is less probable to find N -grams with only function words (especially in Russian, which has no articles). For N = 1, 2 in some languages, function words dominate the top ranks, decreasing their diversity, while the most popular content words (1-grams) change rank widely across centuries. Thus, we expect the most frequent 5-grams to change relatively more in time (for example, in Spanish, d(1) is 1 155 for 1-grams and 2-grams, 7 155 for 3-grams, 15 155 for 4-grams, and finally 37 155 for 5-grams). Overall, we observe that all rank diversity curves can be well fitted by the sigmoid curve where µ is the mean and σ the standard deviation of the sigmoid, both dependent on language and N value (Table 1).
In Figure 3 we see the fitted values of µ and σ for all datasets considered. In all cases µ decreases with N , while in most cases σ increases with N , roughly implying an inversely proportional relation between µ and σ. It is interesting to note that for Romance languages (Spanish, French, and Italian), σ increases when moving from 3-grams to 5-grams, while for Germanic languages (English and German) and Russian (a Slavic language), there is a decrease in σ from N = 3 to N = 4.

Null model: random shuffling of monograms
In order to understand the dependence of language use -as measured by d(k) -on scale (N ), we can ask whether the statistical properties of N -grams can be deduced exclusively from those of monograms, or if the use of higher-order N -grams reflects features of grammatical structure and cultural evolution that are not captured by word usage frequencies alone. To approach this question, we consider a null model of language in which grammatical structure does not influence the order of words. We base our model on the idea of shuffling 1-gram usage data to eliminate the grammatical structure of the language, while preserving the frequency of individual words (more details in Methods, Section 4.2).

Rank diversity in null model
As can be seen in Figure 4, the rank diversity of digrams constructed from shuffled monograms is generally lower than for the non-shuffled digrams, although it keeps the same functional shape of Equation (1) (see fit parameters in Table 1). In the absence of grammatical structure, the frequency of each 2-gram is determined by the frequencies of its two constituent 1-grams. Thus, combinations of high frequency 1-grams dominate the low ranks, including some that are not grammatically valide.g. 'the the', 'the of', 'of of' -but are much more likely to occur than most others. Moreover, the rank diversity of such combinations is lower than we see in the non-shuffled data because the low ranked 1-grams that create these combinations are relatively stable over time. Thus, we can conclude that the statistics of higher order N -grams is determined by more than word statistics, i.e. language structure matters at different scales.

z-scores in null model
The amount of structure each language exhibits can be quantified by the z-scores of the empirical 2-grams with respect to the shuffled data. Following its standard definition, the z-score of a 2-gram is a measure of the deviation between its observed frequency in empirical data and the frequency we expect to see in a shuffled dataset, normalized by the standard deviation seen if we were to shuffle the data and measure the frequency of the 2-gram many times (see Section 4.2 for details).
The 2-grams with the highest z-scores are those for which usage of the 2-gram accounts for a large proportion of the usage of each of its two constituent words. That is, both words are more likely to appear together than they are in other contexts (for example, 'led zeppelin' in the Spanish datasets), suggesting that the combination of words may form a linguistic token that is used in a similar way to an individual word. We observe that the majority of 2-grams have positive z-scores, which simply reflects the existence of non-random structure in language ( Figure 5). What is more remarkable is that many 2-grams, including some at low ranks ('und der', 'and the', 'e di'), have negative z-scores; a consequence of the high frequency and versatility of some individual words.
After normalizing the results to account for varying total word frequencies between different language datasets, we see that all languages exhibit a similar tendency for the z-score to be smaller at higher ranks (measured by the median; this is not the case for the mean). This downward slope can be explained by the large number of 2-grams that are a combination of one highly versatile word, i.e. one that may be combined with a diverse range of other words, with relatively low frequency words (for example 'the antelope'). In such cases, z-scores decrease with rank as z ∼ k −1/2 (see Section 4.2).

Next-word entropy
Motivated by the observation that some words appear alongside a diverse range of other words, whereas others appear more consistently with the same small set of words, we examine the distribution of next-word entropies. Specifically, we define the next-word entropy for a given word i as the (non-normalized) Shannon entropy of the set of words that appear as the second word in 2-grams for which i is the first. In short, the next-word entropy of a given word quantifies the difficulty of predicting the following word. As shown in Figure 6, words with higher next-word entropy are less abundant than those with lower next-word entropy, and the relationship is approximately exponential.

Change probability of N -gram usage
To complement the analysis of rank diversity, we propose a related measure: the change probability p(k), i.e. the probability that a word at rank k will change rank in one time interval. We calculate it for a given language dataset by dividing the number of times elements change for given rank k by the number of temporal transitions, T − 1. The change probability behaves similarly to rank diversity in some cases. For example, if there are only two N -grams that appear with rank 1, d(1) = 2/155. If one word was ranked first until 1900 and then a different word became first, there was only one rank change, thus p(1) = 1/154. However, if the words alternated ranks every year (which does not occur in the datasets studied), the rank diversity would be the same, but p(1) = 1. Figure 7 shows the behavior of the change probability p(k) for all languages studied. We see that p(k) grows faster than d(k) for increasing rank k. The curves can also be well fitted with the sigmoid of Equation (1) (fit parameters in Table 2). Figure 8 shows the relationship between µ and σ of the sigmoid fits for the change probability p(k). As with the rank diversity, µ decreases with N for each language, except for French and German between 3-grams and 4-grams. However, the σ values seem to have a low correlation with N . We also analyze the difference between rank diversity and change probability, d(k) − p(k) ( Figure S1). As the change probability grows faster with rank k, the difference becomes negative and then grows together with the rank diversity. For large k, both rank diversity and change probability tend to one, so their difference is zero.

Rank Entropy of N -gram usage
We can define another related measure: the rank entropy E(k). Based on Shannon's information, it is simply the normalized information for the elements appearing at rank k during all time intervals (see Methods). For example, if at rank k = 1 only two N -grams appear, d(1) = 2/155. Information is maximal when the probabilities of elements are homogeneous, i.e. when each N -gram appears half of the time, as it is uncertain which of the elements will occur in the future. However, if one element appears only once, information will be minimal, as there will be a high probability that the other element will appear in the future. As with the rank diversity and change probability, the rank entropy E(k) also increases its value with rank k, even faster in fact, as shown in Figure 9. Similarly, E(k) tends to be higher as N grows, and may be fitted by the sigmoid of Equation (1) at least for high enough k (see fit parameters in Table 3) Notice that since rank entropy in some cases has already high values at k = 1, the sigmoids can have negative µ values.
The µ and σ values are compared in Figure 10. The behavior of these parameters is more diverse than for rank diversity and change probability. Still, the curves tend to have a "horseshoe" shape, where µ decreases and σ increases up to N ≈ 3, and then µ slightly increases while σ decreases.

Rank Complexity of N -gram usage
Finally, we define the rank complexity C(k) as This measure of complexity represents a balance between stability (low entropy) and change (high entropy) (Gershenson and Fernández, 2012;Fernández et al., 2014;Santamaría-Bonfil et al., 2016). So complexity is minimal for extreme values of the normalized entropy [E(k) = 0 or E(k) = 1] and maximal for intermediate values [E(k) = 0.5]. Figure 11 shows the behaviour of the rank complexity C(k) for all languages studied. In general, since E(k) ≈ 0.5 for low ranks, the highest C(k) values appear for low ranks and decrease as E(k) increases. C(k) also decreases with N . Moreover, C(k) curves reach values close to zero when E(k) is close to one: around k = 10 2 for N = 5 and k = 10 3 for N = 1, for all languages.

DISCUSSION
Our statistical analysis suggests that human language is an example of a cultural construct where macroscopic statistics (usage frequencies of N -grams for N > 1) cannot be deduced from microscopic statistics (1-grams). Since not all word combinations are valid in the grammatical sense, in order to study higher-order N -grams, the statistics of 1-grams are not enough, as shown by the null model results. In other words, N -gram statistics cannot be reduced to word statistics. This implies that multiple scales should be studied at the same time to understand language structure and use in a more integral fashion. We conclude not only that semantics and grammar cannot be reduced to syntax, but that even within syntax, higher scales (N -grams with N > 1) have an emergent, relevant structure which cannot be exclusively deduced from the lowest scale (N = 1).
While the alphabet, the grammar, and the subject matter of a text can vary greatly among languages, unifying statistical patterns do exist, and they allow us to study language as a social and cultural phenomenon without limiting our conclusions to one specific language. We have shown that despite many clear differences between the six languages we have studied, each language balances a versatile but stable core of words with less frequent but adaptable (and more content-specific) words in a very similar way. This leads to linguistic structures that deviate far from what would be expected in a random 'language' of shuffled 1-grams. In particular, it causes the most commonly used word combinations to deviate further from random that those at the other end of the usage scale.
If we are to assume that all languages have converged on the same pattern because it is in some way 'optimal', then it is perhaps this statistical property that allows word combinations to carry more information that the sum of their parts; to allow words to combine in the most efficient way possible in order to convey a concept that cannot be conveyed through a sequence of disconnected words. The question of whether or not the results we report here are consistent with theories of language evolution (Nowak and Krakauer, 1999;Cancho and Solé, 2003;Baronchelli et al., 2006) is certainly a topic for discussion and future research.
Apart from studying rank diversity, in this work we have introduced measures of change probability, rank entropy, and rank complexity. Analytically, the change probability is simpler to treat than rank diversity, as the latter varies with the number of time intervals considered (T ), while the former is more stable (for a large enough number of observations). Still, rank diversity produces smoother curves and gives more information about rank dynamics, since the change probability grows faster with k. Rank entropy grows even faster, but all three measures [d(k), p(k), and E(k)] seem related, as they tend to grow with k and N in a similar fashion. Moreover, all three measures can be relatively well fitted by sigmoid curves (the worst fit has e = 0.02, as seen in Table 1-3). Our results suggests that a sigmoid functional shape fits rank diversity the best for low ranks, as the change probability and rank entropy have greater variability in that region.
In Cocho et al. (2015), we used the parameters of the sigmoid fit to rank diversity as an approximation of language core size, i.e. the number of 1-grams minimally required to speak a language. Assuming that these basic words are frequently used (low k) and thus have d(k) < 1, we consider the core size to be bounded by log 10 k = µ + 2σ. As Table 4 shows, this value decreases with N , i.e. N -gram structures with larger N tend to have smaller cores. However, if the number of different words found on cores are counted, they increase from monograms to digrams, except for Spanish and Italian. From N = 2, the number of words in cores decreases constantly for all languages. This suggests that core words can be combined to form more complex expressions without the requirement of learning new words. English and French tend to have more words in their cores, while Russian has the least. It is interesting to note that the null model produces cores with about twice as many words as real 2-grams. Also, only in language cores rank complexity values are not close to zero. In other words, only ranks within the core have a high rank complexity.
Our results may have implications for next-word prediction algorithms used in modern typing interfaces like smartphones. Lower ranked N -grams tend to be more predictable (higher z-scores and lower next word entropy on average). Thus, next-word prediction should adjust the N value (scale) depending on the expected rank of the recent, already-typed words. If these are not in top ranked N -grams, then N should be decreased. For example, on the iOS 11 platform, after typing 'United States of', the system suggests 'the', 'all', and 'a', as the next-word prediction by analyzing 2-grams. However, it is clear that the most probable next-word is 'America', as this is a low-ranked 4-gram.
Beyond the previous considerations, perhaps the most relevant aspect of our results is that the rank dynamics of language use is generic not only for all six languages, but for all five scales studied. Whether the generic properties of rank diversity and related measures are universal still remains to be explored. Yet, we expect this and other research questions to be answered in the coming years as more data on language use and human culture becomes available.

Data description
Data was obtained from the Google Books N -gram dataset 1 , filtered and processed to obtain ranked N -grams for each year for each language. Data considers only the first 11, 140 ranks, as this was the maximum rank available for all time intervals and languages studied. From these, rank diversity, change probability, rank entropy, and rank complexity were calculated as follows. Rank diversity is given by where |X(k)| is the cardinality (i.e. number of elements) that appear at rank k during all T = 155 time intervals (between 1855 and 2009 with one-year differences, or ∆t = 1). The change probability is where δ(X(k, t), X(k, t + 1)) is the Kronecker delta; equal to zero if there is a change of N -gram in rank k in ∆t [i.e. the element X(k, t) is different from element X(k, t + 1)], and equal to one if there is no change. The rank entropy is given by so as to normalize E(k) in the interval [0, 1]. Note that |X(k)| is the alphabet length, i.e. the number of elements that have occurred at rank k. Finally, the rank complexity is calculated using Eq. 2 and Eq. 5 (Fernández et al., 2014).

Modelling shuffled data
We first describe a shuffling process that eliminates any structure found within the 2-gram data, while preserving the frequency of individual words. Consider a sequence consisting of the most frequent word a number of times equal to its frequency, followed by the second most frequent word a number of times equal to its frequency, and so on all the way up to the 11, 140 th most frequent word (i.e until all the words in the monogram data have been exhausted). Now suppose we shuffle this sequence and obtain the frequencies of 2-grams in the new sequence. Thus, we have neglected any grammatical rules about which words are allowed to follow which others (we can have the same word twice in the same 2-gram, for example), but the frequency of words remains the same.
We now derive an expression for the probability that a 2-gram will have a given frequency after shuffling has been performed. Let f i denote the number of times the word i appears in the text, and f ij the number of times the 2-gram ij appears. Additionally, F = i f i . We want to know the probability P (f ij ) that ij appears exactly f ij times in the table. We can think of P (f ij ) as the probability that exactly f ij occurrences of i are followed by j. Supposing f i < f j , f ij is determined by f i independent Bernoulli trials with the probability of success equal to the probability that the next word will be j, i.e. f j /F . In this case we have This distribution meets the condition that allows it to be approximated by a Poisson distribution, namely where is the mean, and also the variance, of the distribution of values of f ij .
For each 2-gram we calculate the z-score. This is a normalized frequency of its occurrence, i.e. we normalize the actual frequency f ij by subtracting the mean of the null distribution and dividing by the standard deviation, In other words, the z-score tells us how many standard deviations the actual frequency is from the mean of the distribution derived from the shuffling process. The result is that the 2-grams with the highest z-scores are those which occur relatively frequently but their component words occur relatively infrequently.
Normalization. To compare z-scores of different languages, we normalize to eliminate the effects of incomplete data. Specifically, we normalize z-scores by dividing by the upper bound (which happens to be equal in order of magnitude to the lower bound). The highest possible z-score occurs in cases where so an upper bound exists at √ F . Similarly, The lowest possible z-score would hypothetically occur when We thus define the normalized z-score asẑ The relationship between rank and z-score. To understand how the z-score changes as a function of rank, we look at another special case: suppose that i is a word that is found to be the first word in a relatively large number of 2-grams, and that all occurrences of the word j are preceded by i. In such cases we have Now consider only the subset of 2-grams that start with i and end with words that are only ever found to be preceded by i. Since f i is constant within this subset, we haveẑ ij = Af If we now assume that Zipf's law holds for the set of second words in the subset, i.e. that f j = Br −1 j where r j is the rank of j and B another constant, then we haveẑ ij = Cr −1/2 j , with C a third constant.
Data. Unlike in other parts of this study, the shuffling analysis is applied to the 10 5 lowest ranked 2-grams.

Next-word entropy
The relationship between rank and z-score of 2-grams appears to be, at least partially, a consequence of the existence of high frequency core words that can be followed by many possible next words. This diversity of next words can be quantified by what we call the next-word entropy. Given a word i, we define the next-word entropy, E nw i , of i to be the (non-normalized) Shannon entropy of the distribution of 2-gram frequencies of 2-grams that have i as the first word,

Fitting process
The curve fitting for rank diversity, change probability, and rank entropy has been made with the scipy-numpy package using the non-linear least squares method (Levenberg-Marquardt algorithm). For rank entropy, we average data over each ten ranks, k i = n/10 i=0 k i 10 , as well as over rank entropy values, . With this averaged data, we adjust a cumulative normal (erf function) over the data of log 10 (k i ) and E(k i ). For rank diversity and change probability, we average data over points equally Frontiers spaced in log 10 (k i ). Like for rank entropy, a sigmoid (Eq. 1) is fitted for log 10 (k) and d(k), as well as for log 10 (k) and p(k). To calculate the mean quadratic error, we use whereX i is the value of the sigmoid adjusted to rank k i and X i is the real value of d(k i ). For p(k) and E(k) the error is calculated in the same way.

CONFLICT OF INTEREST STATEMENT
The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.
Santamaría-Bonfil, G., Fernández, N., and Gershenson, C. (2016). The plot is semilogarithmic, so similar changes across ranks correspond to changes proportional to the rank itself. Other languages (not shown) behave in a similar way: changes are more frequent as N increases. We have added a small shift over the y-axis for some curves, to see more clearly how the most frequently used N -grams remain at k = 1 for long periods of time.  Table 1.
Windowing is done averaging d(k) every 0.05 in log 10 k.     15). The range of entropies is segregated into bins of width 1/2, while the probability is calculated as the number of words whose next-word entropy falls inside the bin, divided by the total number of words.     Table 1. Fit parameters for rank diversity for different languages, N -grams and null model. Mean µ, standard deviation σ, and error e for the sigmoid fit of the rank diversity d(k) according to Equation (1). We also show the fit parameters for the null model of Figure 4.  Table 2. Fit parameters for change probability for different languages. Mean µ, standard deviation σ, and error e for the sigmoid fit of the change probability p(k) according to Equation (1).  Table 3. Fit parameters for rank entropy for different languages. Mean µ, standard deviation σ, and error e for the sigmoid fit of the rank entropy E(k) according to Equation (1 Table 4. Language core parameters. Upper bound rank log 10 k = µ + 2σ for the estimated core size of all languages studied, according to the sigmoid fit of Equation (1) Figure S1: Difference between rank diversity and change probability across languages and N -grams.
As p(k) grows faster than d(k), these curves decrease their value to a minimum close to 0.6, for then increasing when d(k) starts growing and become zero when both measures reach their maximum value of one. Windowing is done averaging d(k) every 0.05 in logk.  it  19  be  by  be  be  20  and  which  not  by  21  their  the  which  or  22  was  on  on  are  23  surveyor  at  at  you  24  with  this  had  from  25  upon  had  or  at  26  but  from  from  his  27  no  have  are  he  28  are  their  have have  29  have  but  this  an  30 were you ' this Table S1. Top English 1-grams for arbitrary years.

Supplementary Material
Rank 1700 1800 1900 2000 1 of-the of-the of-the of-the 2 in-the in-the in-the in-the 3 to-the to-the to-the to-the 4 the-surveyor and-the and-the on-the 5 and-the to-be on-the and-the 6 upon-the on-the to-be for-the 7 for-the by-the by-the to-be 8 that-the of-his of-a of-a 9 by-the from-the for-the from-the 10 of-his with-the from-the with-the 11 to-be of-a with-the at-the 12 the-stage for-the at-the that-the 13 in-his it-is that-the in-a 14 to-make at-the it-is by-the 15 with-the that-the of-his as-a 16 is-not in-a in-a is-a 17 the-surveyor all-the with-a is-the 18 the-drama is-the the-same it-is 19 as-the with-a is-the do-not 20 of-their it-was it-was with-a 21 not-to in-his is-a did-not 22 does-not i-have it-is can-be 23 at-the as-the as-the as-the 24 and-that he-was have-been to-a 25 in-their have-been as-a the-same 26 he-is that-he he-was for-a 27 from-the the-same had-been is-not 28 the-reader of-their may-be into-the 29 that-the had-been all-the it-was 30 out-of of-this that-he was-a Table S2. Top English 2-grams for arbitrary years.

Frontiers 3 Supplementary Material
Rank 1700 1800 1900 2000 1 so-much-as one-of-the one-of-the one-of-the 2 of-the-stage as-well-as the-united-states as-well-as 3 the-surveyor-'s part-of-the part-of-the the-united-states 4 by-the-surveyor i-do-not i-do-not i-do-not 5 as-the-surveyor out-of-the as-well-as part-of-the 6 up-to-the the-name-of out-of-the the-end-of 7 the-rest-of in-order-to can-not-be out-of-the 8 the-english-stage i-can-not and-in-the in-order-to 9 as-well-as can-not-be in-order-to end-of-the 10 to-this-i the-same-time the-end-of some-of-the 11 part-of-the and-in-the in-which-the the-number-of 12 of-the-fable of-the-world of-the-united to-be-a 13 not-to-be according-to-the that-he-was i-did-not 14 not-so-much that-it-was is-to-be be-able-to 15 of-the-play of-the-most i-can-not the-use-of 16 of-the-drama that-he-was of-the-most a-number-of 17 comedy-and-tragedy is-to-be the-same-time can-not-be 18 but-the-surveyor the-united-states that-of-the the-fact-that 19 upon-the-stage at-the-same that-it-was in-the-united 20 tragedy-and-comedy it-is-not the-name-of do-not-know 21 the-surveyor-is to-have-been some-of-the the-same-time 22 the-reader-may that-of-the of-the-world in-terms-of 23 the-authority-of that-it-is that-he-had in-which-the 24 rest-of-the that-he-had the-fact-that there-is-a 25 one-would-think of-all-the in-the-same a-lot-of 26 is-not-so to-be-a side-of-the there-is-no 27 and-as-for not-to-be it-is-not i-can-not 28 a-man-'s in-the-same that-it-is the-rest-of 29 to-no-purpose in-the-world to-be-a you-do-not 30 to-make-the there-is-no at-the-same side-of-the Table S3. Top English 3-grams for arbitrary years.

Supplementary Material
Rank 1700 1800 1900 2000 1 not-so-much-as at-the-same-time of-the-united-states in-the-united-states 2 to-this-i-answer of-the-united-states at-the-same-time the-end-of-the 3 the-rest-of-the at-the-head-of the-end-of-the i-do-not-know 4 this-may-serve-to in-the-midst-of one-of-the-most at-the-end-of 5 the-authority-of-the the-name-of-the i-do-not-know at-the-same-time 6 i-must-tell-him in-the-county-of on-the-part-of of-the-united-states 7 but-it-seems-the one-of-the-most in-the-case-of as-well-as-the 8 as-far-as-it i-do-not-know for-the-purpose-of the-rest-of-the 9 and-this-may-serve on-the-part-of at-the-end-of on-the-other-hand 10 a-great-part-of as-well-as-the is-one-of-the one-of-the-most 11 a-great-deal-of in-the-course-of in-the-midst-of as-a-result-of 12 whence-it-will-follow the-rest-of-the in-the-united-states is-one-of-the 13 upon-the-english-stage in-the-name-of on-the-other-hand in-the-case-of 14 upon-the-account-of the-end-of-the as-well-as-the in-the-form-of 15 to-take-notice-of the-head-of-the for-the-first-time do-not-want-to 16 the-moral-of-the in-a-state-of was-one-of-the on-the-basis-of 17 the-business-of-the for-the-purpose-of the-rest-of-the for-the-first-time 18 so-much-as-a in-the-way-of the-head-of-the did-not-want-to 19 reader-may-please-to the-hands-of-the in-the-course-of at-the-same-time 20 me-in-mind-of on-the-other-hand the-part-of-the on-the-other-hand 21 it-may-be-so for-the-most-part the-middle-of-the i-do-not-think 22 in-the-time-of for-the-sake-of a-member-of-the in-the-middle-of 23 in-the-reign-of the-middle-of-the on-the-other-hand the-top-of-the 24 i-shall-endeavour-to the-town-of-mansoul at-the-time-of at-the-time-of 25 he-is-pleased-to is-one-of-the at-the-head-of can-be-used-to 26 far-as-it-appears at-the-end-of in-the-form-of the-middle-of-the 27 end-of-the-play the-part-of-the a-part-of-the in-front-of-the 28 but-i-'-m into-the-hands-of for-the-sake-of i-do-not-want 29 at-the-end-of for-the-first-time the-name-of-the the-beginning-of-the 30 as-well-as-the the-nature-of-the in-the-hands-of at-the-university-of Table S4. Top English 4-grams for arbitrary years.
Rank 1700  1800  1900  2000  1 and-this-may-serve-to on-the-part-of-the on-the-part-of-the at-the-end-of-the 2 as-far-as-it-appears in-the-name-of-the at-the-end-of-the in-the-middle-of-the 3 will-not-so-much-as and-at-the-same-time and-at-the-same-time i-do-not-want-to 4 we-have-no-reason-to at-the-head-of-the at-the-time-of-the the-united-states-of-america 5 there-'s-no-need-of in-the-midst-of-the the-other-side-of-the i-do-not-know-what 6 the-root-of-all-evil the-other-side-of-the at-the-head-of-the the-other-side-of-the 7 the-heat-of-the-climate into-the-hands-of-the in-the-middle-of-the at-the-time-of-the 8 the-end-of-the-play the-name-of-the-lord in-the-case-of-the at-the-beginning-of-the 9 the-ancient-and-modern-stages at-the-end-of-the is-one-of-the-most as-a-result-of-the 10 puts-me-in-mind-of in-the-middle-of-the in-the-hands-of-the on-the-part-of-the 11 not-so-much-as-offer in-the-hands-of-the in-the-midst-of-the is-one-of-the-most 12 not-so-much-as-a on-the-banks-of-the in-the-form-of-a in-the-united-states-of 13 may-not-be-amiss-to at-the-head-of-a on-the-other-side-of at-the-top-of-the 14 it-not-been-for-the in-the-course-of-the at-the-foot-of-the at-the-end-of-the 15 it-may-not-be-amiss on-the-other-side-of in-the-direction-of-the i-did-not-want-to 16 is-there-no-difference-between of-the-town-of-mansoul as-in-the-case-of in-the-form-of-a 17 is-not-so-much-as the-greater-part-of-the to-be-found-in-the at-the-bottom-of-the 18 is-an-odd-way-of at-the-foot-of-the in-the-history-of-the on-the-other-side-of 19 i-'-m-afraid-i at-the-bottom-of-the i-do-not-know-what in-such-a-way-that 20 from-whence-it-will-follow the-inferior-extremity-of-the at-the-beginning-of-the in-the-united-states-and 21 does-not-so-much-as in-the-beginning-of-the in-the-course-of-the for-the-first-time-in 22 does-not-in-the-least at-the-head-of-his the-greater-part-of-the and-at-the-same-time 23 but-i-'-m-afraid i-do-not-know-what president-of-the-united-states will-not-be-able-to 24 at-the-latter-end-of the-inner-face-of-the at-the-close-of-the in-the-center-of-the 25 at-the-end-of-the to-be-found-in-the in-the-name-of-the in-the-case-of-the 26 at-that-time-of-day is-one-of-the-most into-the-hands-of-the i-do-not-know-how 27 as-it-would-be-to president-of-the-united-states at-the-bottom-of-the printed-in-the-united-states 28 ancient-and-modern-stages-surveyed in-the-form-of-a i-do-not-want-to by-the-end-of-the 29 according-to-the-custom-of the-posterior-face-of-the was-a-member-of-the you-do-not-have-to 30 a-great-part-of-the the-superior-extremity-of-the on-the-side-of-the you-do-not-have-to mil-six-cens de-tous-les il-y-a à-la-fois 3 il-y-a plus-ou-moins point-de-vue il-y-a 4 tout-ce-qui il-y-a de-la-loi ne-sont-pas 5 que-nous-avons de-toutes-les au-point-de de-la-vie 6 ce-qui-est le-nom-de de-la-société point-de-vue 7 de-ceux-qui de-la-nature plus-ou-moins à-partir-de 8 tout-ce-que de-la-terre ne-sont-pas plus-en-plus 9 duc-de-reims ne-sont-pas en-même-temps la-mise-en 10 de-toutes-les tout-ce-qui il-y-a de-plus-en 11 six-cens-quatre il-y-a de-tous-les dans-le-cadre 12 la-somme-des de-la-même à-peu-près ce-qui-est 13 ce-que-vous la-plus-grande la-loi-du et-à-la 14 ce-que-nous que-nous-avons que-nous-avons de-la-société 15 tous-les-jours grand-nombre-de à-la-fois de-la-population 16 la-somme-de dans-tous-les le-nom-de il-y-a 17 ville-de-reims un-grand-nombre de-la-vie par-rapport-à 18 de-la-nature de-la-mer et-à-la en-tant-que 19 pieces-de-comparaison ce-qui-est ce-qui-concerne à-la-fin 20 de-ce-que et-à-la de-la-ville la-fin-de 21 archevêque-duc-de de-ceux-qui le-droit-de la-fin-du 22 entre-les-mains une-espèce-de de-toutes-les de-la-ville 23 amour-de-dieu dans-toutes-les de-la-france plus-ou-moins 24 à-ceux-qui la-plupart-des tout-à-fait de-la-france 25 du-mois-de à-tous-les ce-qui-est en-même-temps 26 aprés-le-choc que-dans-les de-plus-en de-tous-les 27 que-ce-soit au-lieu-de plus-en-plus de-la-loi 28 que-vous-ne de-la-vie chemins-de-fer la-plupart-des 29 que-nous-ne de-la-france chemin-de-fer de-la-république 30 il-est-évident on-ne-peut la-fin-de ce-qui-concerne Table S8. Top French 3-grams for arbitrary years.