Accurate Prediction of the Statistics of Repetitions in Random Sequences: A Case Study in Archaea Genomes

Régnier, Mireille; Chassignet, Philippe

doi:10.3389/fbioe.2016.00035

ORIGINAL RESEARCH article

Front. Bioeng. Biotechnol., 08 June 2016

Sec. Computational Genomics

Volume 4 - 2016 | https://doi.org/10.3389/fbioe.2016.00035

This article is part of the Research TopicRepetitive structures in biological sequences: algorithms and applicationsView all 11 articles

Accurate Prediction of the Statistics of Repetitions in Random Sequences: A Case Study in Archaea Genomes

Mireille Régnier^1,2*

Philippe Chassignet²

¹Inria, Palaiseau, France
²LIX, Ecole Polytechnique, Palaiseau, France

Repetitive patterns in genomic sequences have a great biological significance and also algorithmic implications. Analytic combinatorics allow to derive formula for the expected length of repetitions in a random sequence. Asymptotic results, which generalize previous works on a binary alphabet, are easily computable. Simulations on random sequences show their accuracy. As an application, the sample case of Archaea genomes illustrates how biological sequences may differ from random sequences.

1. Introduction

This paper provides combinatorial tools to distinguish biologically significant events from random repetitions in sequences. This is a key issue in several genomic problems as many repetitive structures can be found in genomes. One may cite microsatellites, retrotransposons, DNA transposons, long terminal repeats (LTR), long interspersed nuclear elements (LINE), ribosomal DNA, and short interspersed nuclear elements (SINE). In Treangen and Salzberg (2012), it is claimed that half of the genome consists of different types of repeats. Knowledge about the length of a maximal repeat is a key issue for assembly, notably the design of algorithms that rely upon de Bruijn graphs. In re-sequencing, it is a common assumption for aligners that any sequenced “read” should map to a single position in a genome: in the ideal case where no sequencing error occurs, this implies that the length of the reads is larger than the length of the maximal repetition. Average lengths of the repeats are given in Gu et al. (2000). Recently, heuristics have been proposed and implemented (Devillers and Schbath, 2012; Rizk et al., 2013; Chikhi and Medvedev, 2014).

A similar problem has been extensively studied: the prediction of the length of maximal common prefixes for words in a random set. Typical parameters are the background probability model, the size V of the alphabet, the length n of the sequence, and so on. Deviation from uniformity was studied for a uniform model as early as 1988 (Flajolet et al., 1988). A complexity index that captures the richness of the language is addressed in Janson et al. (2004). A distribution model, valid for binary alphabets and biased distributions, was introduced in Park et al. (2009), the so-called trie profile and extended to Patricia tries in Magner et al. (2014). The authors pointed out different “regimes” of randomness and a phase transition, by means of analytic combinatorics (Sedgewick and Flajolet, 2009). It was observed in Jacquet and Szpankowski (1994) that the average length of maximal common prefixes in a random set of n words is asymptotically equivalent to the average length of maximal repetitions in a random sequence of length n. Sets of words are considered below in the theoretical analysis. A comparison with the distribution of maximal repetitions in random sequences or real Archaea genomic sequences is presented in Section 3.

Our first goal is to extend results of Park et al. (2009) to the case of a general V-alphabet, including the special case {A, C, G, T} where V is 4. A second goal is to compare the results consistency with random data and real genomic data in the finite range.

To achieve the first goal, we rely on an alternative, and simpler, probabilistic and combinatorial approach that is interesting per se. It avoids generating functions and the Poissonization–dePoissonization cycle that is used in Park et al. (2009) and it extends to non-binary alphabets. In that case, there is no closed formula for the asymptotic behavior. Nevertheless, the Lagrange multipliers allow to derive it as the solution of an equation that can be computed numerically.

Explicit and computable bounds for the profile of a random set of n words are provided. Three domains can be observed. A first domain is defined by a threshold k for the length, called the completion length: any prefix with a length smaller than this threshold occurs at least twice. This threshold is extremely stable over the data sets and it is highly predictable. A similar phenomenon was observed for a uniform model in Fagin et al. (1979a) and a biased model (Mahmoud, 1992; Park et al., 2009). For larger lengths, some prefixes occur only once. In a second domain, called the transition phase, the number of maximal common prefixes is sublinear in the size n of the sequence: increasing first, then decreasing slowly, and, finally, dropping rapidly. In the third domain, for a length larger than some extinction length, almost no common prefix of that length occurs. Despite the fact that these bounds are asymptotic, a good convergence is shown in practice for random texts when a second-order term is known.

Differences between the model and the observation are studied on the special case of Archaea genomes. A dependency to the GC-content, which is a characteristic of each genome, is exhibited. Regimes and transitions are studied on these genomic data and theoretical results are confirmed, with a drift in the values of transition thresholds. Notably, the length of the largest repetitions is much larger than expected. This difference between the model and the observation arises from the occurrences of long repeated regions.

Section 2 is devoted to Main Results, to be proved in Section 4. First, some notations are introduced; then, an algebraic expression for the expectation of the number of maximal common prefixes in a sequence is derived in Theorem 2.1. Second, this expression is split between two sums that are computable in practical ranges. Then, it is shown that a Large Deviation principle applies. It yields first and second order asymptotic terms, and oscillations, that are provided in Theorem 2.2. A comparison between exact, approximate, and asymptotic expressions is presented in Section 3.

2. Main Results

It is assumed throughout this study that sequences and words are randomly generated according to a biased Bernoulli model on an alphabet of size V. Let p₁, ⋯, p_V denote the probabilities of the V characters χ₁, ⋯ ,χ_V.

Definition 2.1. For any i in {1, ⋯ ,V }, one notes

β_{i} = \log \frac{1}{p_{i}} .

Additionally,

\begin{align} p_{\min} = \min {p_{i}; 1 \leq i \leq V} and α_{\min} = \frac{1}{\log \frac{1}{p_{\min}}} = \frac{1}{\max (β_{i})}; \end{align}

(1)

\begin{align} p_{\max} = \max {p_{i}; 1 \leq i \leq V} and α_{\max} = \frac{1}{\log \frac{1}{p_{\max}}} = \frac{1}{\min (β_{i})} . \end{align}

(2)

The two values min(β_i) and max(β_i) are different when the Bernoulli model is non-uniform.

2.1. Enumeration

Definition 2.2. Given U a set of words and an integer k, k ≥ 2, a unique k-mer in U is a word wχ_i of length k such that

1. w is a prefix of at least two words in U;

2. and wχ_i is a prefix of a single word.

By convention, a unique 1-mer is a character χ_i that is a prefix of a single word.

Definition 2.3. Let U be a set of n words.

For k ≥ 1, one denotes B(n, k) the number of unique k-mers in U.

One denotes μ(n, k − 1) the expectation of B(n, k) over all sets of n words.

Remark: It follows from Definition 2.2 that quantity B(n, k) is upper bounded by n. Observe that, for each random set U, it is the sum of a large number – V^k – of correlated random variables. Expectation μ(n, k) is studied below and compared in Section 3 with B(n, k + 1).

Profiles of repetitions can be expressed as a combinatorial sum.

Theorem 2.1. Given a length k, the expectation μ(n, k) satisfies:

\begin{align} μ (n, k) = n \sum_{k_{1} + \dots k_{v} = k} (\begin{matrix} k \\ k_{1}, \dots, k_{V} \end{matrix}) ϕ (k_{1}, \dots, k_{V}) ψ_{n} (k_{1}, \dots, k_{V}) \end{align}

(3)

where

\begin{align} ϕ (k_{1}, \dots, k_{V}) = p_{1}^{k_{1}} \dots p_{V}^{k_{V}} \end{align}

(4)

\begin{align} ψ_{n} (k_{1}, \dots, k_{V}) = \sum_{i = 1}^{V} p_{i} [{(1 - φ (k_{1}, \dots, k_{V}) p_{i})}^{n - 1} - {(1 - φ (k_{i}, \dots, k_{V}))}^{n - 1}] . \end{align}

(5)

Proof. A word wχ_i is a unique (k + 1)-mer iff (i) w has length k and is the prefix of at least two words, including wχ_i; (ii) wχ_i is not repeated.

Event (i) has probability

\begin{array}{l} n ϕ (k_{1}, \dots, k_{V}) p_{i} [1 - {(1 - ϕ (k_{1}, \dots, k_{V}))}^{n - 1}] . \end{array}

Event (ii), which is a sub-event of (i), has probability

\begin{array}{l} n ϕ (k_{1}, \dots, k_{V}) p_{i} [1 - {(1 - ϕ (k_{1}, \dots, k_{V}) p_{i})}^{n - 1}] . \end{array}

2.2. A Combinatorial Expression

Definition 2.4. Given a k-mer w, let α denote $\frac{k}{\log n}$ and k_i denote the number of occurrences of character χ_i in w. The objective function is

ρ (k_{1}, \dots, k_{V}) = \sum_{i = 1}^{V} \frac{k_{i}}{k} β_{i} - \frac{1}{α} .

(6)

The character distribution (k₁, ⋯ , k_V) of a k-mer may be viewed as barycentric coordinates for a point $β (k_{1}, \dots, k_{V}) = \sum_{i = 1}^{V} \frac{k_{i}}{k} β_{i}$ that lies in $[\min (β_{i}); \max (β_{i})] = [\frac{1}{α_{\max}}; \frac{1}{α_{\min}}]$ . The order of β points on that interval allows for a classification of k-mers that is a key to this study.

Definition 2.5. A k-mer w is said

• a common k-mer if ρ(k₁,…, k_V) < 0;

• a transition k-mer if ρ(k₁, ⋯ , k_V) ≥ 0 and its ancestor is a common k-mer;

• a rare k-mer, otherwise.

Remark: If ρ(k₁, ⋯ , k_V) = 0, the condition on the ancestor is trivially satisfied.

Definition 2.6. Given a set U of n words and an integer k, let D_k(n) denote the set of character distributions (k₁, ⋯ , k_V) for rare and transition k-mers. Let E_k(n) denote the set of character distributions for common k-mers.

The set D_k(n) is the empty set if k < α_min log n and is the set of character distributions (k₁, ⋯ , k_V) if k > α_max log n. Computation of (3) is split among the two sets D_k(n) and E_k(n). Computations show that the main contribution arises from transition k-mers. A probabilistic interpretation will be discussed in 2.4.

Notation: Let S(k) and T(k) be

\begin{align} S (k) = n \sum_{D_{k} (n)} (\begin{matrix} k \\ k_{1} \dots k_{V} \end{matrix}) ϕ (k_{1}, \dots, k_{V}) ψ_{n} (k_{1}, \dots, k_{V}); \end{align}

(7)

\begin{align} T (k) = n \sum_{E_{k} (n)} (\begin{matrix} k \\ k_{1} \dots k_{V} \end{matrix}) ϕ (k_{1}, \dots, k_{V}) ψ_{n} (k_{1}, \dots, k_{V}) . \end{align}

(8)

So μ(n, k) rewrites

μ (n, k) = S (k) + T (k) .

(9)

These sums S(k) and T(k) can be efficiently computed for moderate k, up to a few hundred, approximately. In practice, α_max log n is below this threshold for the sizes of actual genomes and for their ordinary GC content value. The simulations in Section 3 show that this estimation is rather tight. Behavior and asymptotic estimates are derived and discussed in the next section.

2.3. Asymptotic Estimates

In this section, asymptotic estimates for (3) are derived. First, some characteristic functions are introduced. Then, it is observed that a Large Deviation Principle applies for the combinatorial sums to be computed and asymptotics for the dominating term follow. Amortized terms are also computed. It is shown in Section 3 that this second-order term cannot be neglected in the finite range.

2.3.1. Notations

For general alphabets, asymptotic behavior is a function of the solution of an equation and depends on domains whose bounds are defined below.

Definition 2.7. Let (p_i)_1≤i≤V be a Bernoulli probability distribution. Let σ₂ denote $\sum_{i = 1}^{V} p_{i}^{2}$ .

The fundamental ratio, noted $\tilde{α}$ , is ${(\sum_{i} p_{i} \log \frac{1}{p_{i}})}^{- 1}$ .

The transition ratio, noted $\bar{α}$ , is $σ_{2} {(\sum_{i} p_{i}^{2} \log \frac{1}{p_{i}})}^{- 1}$ .

The extinction ratio, noted α_ext, is $\frac{2}{\log \frac{1}{σ_{2}}}$ .

Definition 2.8. Let α be a real value in [α_min, α_max]. Let τ_α be the unique real root of the equation

\frac{1}{α} = \frac{\sum_{i = 1}^{V} β_{i} e^{- β_{i} τ}}{\sum_{i = 1}^{V} e^{- β_{i} τ}}

(10)

Let ψ be the function defined in [α_min, α_ext] as

\begin{array}{l} α_{\min} \leq α \leq \bar{α} & : ψ (α) = τ_{α} + α \log (\sum_{i = 1}^{V} e^{- β_{i} τ_{α}}); \\ \bar{α} \leq α & : ψ (α) = 2 - α \log \frac{1}{σ_{2}} . \end{array}

Proposition 2.1. The following property holds

α_{\min} \leq \tilde{α} \leq \bar{α} \leq α_{\max} \leq α_{ext} .

Function ψ increases on $[α_{\min}, \tilde{α}]$ and decreases on $[\tilde{α}, \infty]$ . It satisfies

ψ (α_{\min}) = ψ (α_{ext}) = 0 and ψ (\tilde{α}) = 1 .

(11)

Remark: Uniqueness of τ_α is shown in Section 4.2. As $τ_{\bar{α}} = 2$ , ψ is continuous at α = $\bar{α}$ , with $ψ (\bar{α}) = 2 - \bar{α} \log \frac{1}{σ_{2}}$ .

2.3.2. Asymptotic Results

Theorem 2.2. Given a length α log n, when n tends to ∞ the ratio $\frac{\log μ (n, α \log n)}{\log n}$ satisfies:

\begin{align} 0 \leq α \leq α_{\min} or α_{ext} \leq α : \frac{\log μ (n, α \log n)}{\log n} \leq 0; \end{align}

(12)

\begin{align} α_{\min} \leq α \leq α_{ext} : \frac{\log μ (n, α \log n)}{\log n} \sim ψ (α) . \end{align}

(13)

Moreover, let ξ be the function defined in [α_min, α_ext] as $ξ (α) = \frac{μ (n, α \log n)}{\log n} - ψ (α)$ . It satisfies

\begin{align} α_{\min} \leq α \leq \bar{α} : ξ (α) \sim - \frac{V - 1}{2} \frac{\log (α \log n)}{\log n}; \end{align}

(14)

\begin{align} \bar{α} \leq α \leq α_{ext} : ξ (α) \sim \frac{\log (1 - σ_{2})}{\log n} . \end{align}

(15)

Proof. The key to the proof when α ranges in [α_min, α_max] is that ψ_n(k₁, ⋯ k_V) is maximal when ρ(k₁, ⋯ k_V) is close to 0. Sum T(k) satisfies a Large Deviation Principle.

\frac{\log T (\tilde{k})}{k} \sim \max \{- \sum_{i = 1}^{V} \frac{k_{i}}{k} \log \frac{k_{i}}{k}; ρ (k_{1}, \dots, k_{V}) = 0\} .

(16)

The maximization problem rewrites as

\max \{\sum_{i = 1}^{V} θ_{i} \log \frac{1}{θ_{i}}; \sum_{i = 1}^{V} θ_{i} = 1; \sum_{i = 1}^{V} β_{i} θ_{i} = \frac{1}{α}; 0 \leq θ_{i} \leq 1\}

(17)

The maximum value is $τ_{α} + α \log (\sum_{i = 1}^{V} e^{- β_{i} τ_{α}})$ that is reached for the V-tuple ${(θ_{i} = \frac{e^{- β_{i} τ_{α}}}{\sum_{i = 1}^{V} e^{- β_{i} τ_{α}}})}_{1 \leq i \leq V}$ .

S(k) satisfies again a Large Deviation Principle when α < $\bar{α}$ , which yields the asymptotic result in this range. For larger α, S(k) is approximately $(1 - σ_{2}) n^{1 - α \log \frac{1}{σ_{2}}}$ that dominates T(k).

Details for the proof, including the short and long lengths, are provided in Section 4.

Remark: The discussion will depend of the ratio $α = \frac{k}{\log n}$ . Possible values for α range over a discrete set as they are constrained to be the ratio of an integer by the log of an integer. An interesting property is that, for any real α, the set T = {n ∈ N; α log n ∈ N} is either empty or infinite. Indeed, when T is non-empty, it contains all values n(α)^p where n(α) denotes the minimum value of T. It is beyond the scope of this paper to establish the number of other possible solutions.

2.3.3. Domains

Different domains arise from this Theorem, which were observed in Park et al. (2009). Equalities ψ(α_min) = 0 and $ψ (\bar{α}) = 2 - \bar{α} \log \frac{1}{σ_{2}}$ show that there is a continuity between domains.

When α lies inside the domain [α_min, α_ext], the ratio $\frac{\log μ (n, α \log n)}{\log n}$ is positive and parameters μ(n, α log n) are sub-linear in the size n of the text: some k-mers – mostly transition k-mers – are unique k-mers. Observe that the maximum value for ψ(α) is 1. When the Bernoulli model is uniform, this central domain is empty.

When the length is smaller than the completion length α_min log n or greater than the extinction length α_ext log n, the ratio $\frac{\log μ (n, α \log n)}{\log n}$ is negative.

2.3.4. Oscillations

Parameters (k₁, ⋯ , k_V) in the combinatorial sums are integers. As the optimum values (kθ_i)_1≤i≤V may not be integers, the practical maximum is a close point on the lattice (k₁, ⋯ , k_V). The difference introduces a multiplicative factor that ranges in $[- \log \frac{p_{\max}}{p_{\min}}, \log \frac{p_{\max}}{p_{\min}}]$ . This leads to a small oscillation of log μ(n, k). For large n, this contribution to $\frac{\log μ (n, k)}{\log n}$ becomes negligible. As mentioned above, the set of lengths n that are admissible for a given α is very sparse. Nevertheless, an approximate value may be used: for instance, for an integer k′, $\begin{array}{l} \frac{1}{k^{'}} \log ⌈ n {(α)}^{\frac{k^{'}}{k}} ⌉ \end{array}$ is very close to α. This oscillation phenomenon was first observed in Nicodème (2005).

2.3.5. Binary Alphabets

Results for binary alphabets in Park et al. (2009) steadily follow from Theorem 2.2. A rewriting of ψ leads to alternative expression (18). This explicit expression points out the dependency to the distances to α_min and α_max, and the behavior around these points.

Corollary 2.1. Assume that the alphabet is binary. Then

ψ (α) = \frac{α}{\log \frac{p_{\max}}{p_{\min}}} \log [{s_{α}}^{\frac{1}{α} - \frac{1}{α_{\min}}} + {s_{α}}^{\frac{1}{α} - \frac{1}{α_{\max}}}]

(18)

where

s_{α} = \frac{α_{\min}}{α_{\max}} \cdot \frac{α - α_{\min}}{α_{\max} - α} .

(19)

A similar result holds for DNA sequences when the alphabet is 4-ary and the probability distribution satisfies p_A = p_T and p_C = p_G. Such a distribution is defined by its GC-content p_G + p_C.

2.4. A Probabilistic Interpretation

The main contribution to μ(n, k) arises from k-mers with an objective function close to 0, i.e., transition k-mers. Such k-mers exist in the transition phase [α_min log n, α_max log n] where they coexist with rare or common k-mers. Observe that this phase is shrinked when the Bernoulli model is uniform, as p_min = p_max and α_min = α_max. Therefore, most unique k-mers are concentrated on the two lengths ⌊α_min log n⌋ and ⌈α_min log n⌉, as observed initially in Fagin et al. (1979b).

Let k be some integer in the transition phase. First, the relative contribution of S(k) and T(k) to μ(n, k) varies with the length k. For lengths close to α_min log n, most words are common and T(k) dominates S(k). When k increases, the proportion of common words decreases and the relative contribution of T(k) decreases.

Second, the dominating term in μ(n, k) arises from transition k-mers. Let w be a word of length k, the character distribution in w be (k₁, ⋯ , k_V) and χ_i be some character. The number of words that admit w or wχ_i as a prefix fluctuates around the expectations nϕ(k₁, ⋯ , k_V) and nϕ(k₁, ⋯ , k_V)p_i, respectively. On the one hand, when word wχ_i is a rare word, nϕ(k₁, ⋯ , k_V) is less than 1. The smallest nϕ(k₁, ⋯ , k_V) is, the less likely the actual number of occurrences of w is greater than 2 and the smallest the contribution of wχ_i to S(k), and μ(n, k), is. On the other hand, let wχ_i be a common k + 1-mer; w is a common k-mer and then nϕ(k₁, ⋯ , k_V) is greater than 1. The largest nϕ(k₁, ⋯ , k_V) is, the more likely the word wχ_i is repeated and the smallest the contribution to T(k), and μ(n, k), is.

For a short length, i.e., k smaller than the completion length k_min, all words are common. In a given sequence, most k-mers are repeated at least twice and there is (almost) no unique k-mers.

For a large length k, i.e., k greater than k_max, all words are rare. Nevertheless the number of unique k-mers remains sublinear in n in the range [α_max log n, α_ext log n]: the sum of small contributions arising from a large number of possible words is significant.

A folk theorem (Szpankowski, 2001; Jacquet and Szpankowski, 2015) claims that the objective function is concentrated around $\frac{1}{\tilde{α}} - \frac{1}{α}$ . Consequently, when α = $\tilde{α}$ , most k-mers are transition k-mers and the exponent, the ψ function, is maximal.

3. Experiments and Analysis

Simulations are presented for random and real data. For each simulation, a suffix tree (Ukkonen, 1995) is built, where each leaf represents a unique k-mer. For random cases, the Ukkonen’s insertion step is iterated until a tree with exactly n leaves is build. This requires n + k_ins insertions of symbols, where k_ins > 0 is relatively small (there is a value of a few dozen in practice for considered n). One can observe that the event of having n leaves after n + k − 1 insertions corresponds to the fact that the trailing k-mer is unique in the sequence of length n + k − 1.

Even if a statistical bias exists, with respect to the case of a set of N random words analyzed in previous sections, this bias for respective values on k and n is below the numeric precision used for tables below.

Then, one simulation that is related to the case of a set of n random words, requires the generation of the order of N random symbols from a small alphabet, following a Bernoulli scheme. For this range of n, and even in the case of a hundred consecutive simulations, this corresponds to a regular use of a common random number generator (Knuth, 1998).

A first set of simulation deals with the case of random sequences over a binary alphabet, since the results can be compared with previous work. A second set addresses the case of random sequences over a quaternary alphabet {A, C, G, T} with a constrained distribution such that probabilities p_A ≈ p_T and p_C ≈ p_G as it is the case for DNA sequences (where the sum p_C + p_G is also known as the GC-content). Results on such random sequences are then compared with the sample biological sequence of an Archaea (Haloferax volcanii).

An implementation with a suffix array (Manber and Myers, 1993) allows for a compact representation and an efficient counting (Beller et al., 2013).

3.1. Random data

A hundred binary sequences were randomly generated. The number of leaves in each tree was fixed to n = 5000000 and the Bernoulli parameter was p_max = 0.7000. Therefore, p_min = 0.3000, $\tilde{p} = 0.5429$ , and log n = 15.4249. The thresholds for α and the corresponding lengths α log n are:

3.1.1. Statistical Behavior on Random Sets

Throughout experiments, every sample profile for a given sequence fluctuates very little around the expectation.

Table 1 provides experimental results averaged over a hundred binary sequences. Short length with no observed unique k-mer is removed. Column 2 gives the mean of B(k + 1), i.e., the mean number of observed leaves at depth k + 1, over the set of a hundred simulations. Columns 3 to 5 give the computed values for S(k), T(k), and μ(k), using the expressions, equations (7–9).

TABLE 1

Table 1. Mean profile for 100 random binary sequences.

The actual number of leaves B(n, k + 1) is very close to the average value μ(n, k), and simulations show that this is the general case when (only) a hundred simulations are performed: μ(n, k) is a very good prediction.

Observed lengths of extinction also show very little variations. In array below, each column gives n_k, the number of sequences out of the one hundred sample set for which the longest repetition had length k.

Distribution of the extinction level for 100 random binary sequences. p_max is 0.7.

In the binary case, the predicted extinction length is between 56 and 57. It is noticeable that, in most cases, the observed depth is slightly smaller than this value. In Table 1, value 0.04 for μ(n, 61) means that one expects a total of four leaves at depth 60 over one hundred sequences. In that run, exists a total amount of 8.

3.1.2. Quality of Estimates

1. Tightness of the asymptotic estimates. Asymptotic estimates (13) given in Column 7 significantly overestimate the observed values in Column 6 that is computed directly from Column 2 and n. A first conclusion is that first-order asymptotics provide a poor prediction: next term is O $(\frac{1}{\log n})$ that goes slowly to 0.

2. Tightness of the second-order asymptotics. Second term for the asymptotic ξ(α) ensures a much better approximation in Column 8.

3. Growth of asymptotic estimates. Observed values increase with length until $k = \tilde{k}$ and then decrease. This is consistent with the variation of asymptotic values ψ(α).

3.1.3. Dependency to Probability Bias

Thresholds were computed for a given sequence length n and various probabilities. The more p_max departs from 0.5, the value for the uniform model, the largest the extinction length is. The completion length, k_min, slightly decreases, while the extinction length significantly increases. Nevertheless, this effect is limited when the largest probability p_max remains in the range [0.5;0.7].

Dependency of thresholds to p_max for binary alphabets, n = 5,000,000.

3.2. Long Repetitions in Archaea Genomes

The experimental data set is the sequence from Haloferax volcanii DS2 chromosome, complete genome (Hartman et al., 2010). The alphabet is quaternary. Profile results are shown in Table 2.

TABLE 2

Table 2. Profile for the sequence from Haloferax volcanii DS2 chromosome, complete genome.

Sequence length is n = 2847757. The observed symbol frequencies are p_A = 0.1655; p_C = 0.3334; p_G = 0.3330; p_T = 0.1681. Therefore, observed GC-content is 0.6664. Parameters for an approximate degenerated quaternary model are p_A = p_T = p_min = 0.1668; p_C = p_G = p_max = 0.3332; $\tilde{p} = 0.2645$ ; and log n = 14.8620. The thresholds for the domain are

Statistics on one hundred random sequences with same parameters are shown in Table 3. GC-content is 0.6664. Extinction level is provided in Table 4. Observe first a good match between the observed values, the predicted values for μ(n, k), and the asymptotic values for random data. As shown for binary alphabets, the observed extinction level for random sequences departs very little from the predicted k_ext level.

TABLE 3

Table 3. Mean profile for 100 random degenerated quaternary sequences.

TABLE 4

Table 4. Distribution of the extinction level for 100 random degenerated quaternary sequences.

Numerous differences with random data can be observed on real genomes.

Interestingly, the behavior for short lengths and in the transition phase is similar to the random behavior. Observation and prediction have the same order of magnitude. In particular, the number of unique k-mers is maximum for length $\tilde{k}$ where observation and prediction coincide. For a real genome and a length k smaller than k_min, observed B(n, k + 1) is larger than predicted μ(n, k). This indicates, at a level k + 1 where completion is expected, more leaves in the real trie, more missing words at level k + 2. Simultaneously, less internal nodes occur at level k + 1 because the total sum is constant and equal to V^k⁺¹.

The effect of (non-random) repetitions is more sensible in the decreasing domain. First, the number of unique k-mers decreases much more slowly than expected for lengths larger than k_max. A significant gap can be observed around extinction level k_ext. The decrease rate, which was around 0.02–0.04 drops to 0.007 and then becomes even lower. Finally, the extinction level is much larger than the predicted value 23: the largest repetition is 1395 bp long.

To evaluate the contribution of long repetitions, one may erase the longest ones. When a word w is repeated, any proper suffix of w is also repeated. Consequently, once the longest repeated word is erased, one unique k-mer (only) disappears for each length larger than the length of the second largest subsequence (here, 935). The profile remains far from the random profile. This observation is still true if the 10 longest subsequences are erased.

4. Combinatorial and Analytic Derivation

4.1. Lagrange Multipliers

Lagrange multipliers method allows to maximize an expression under constraints. To compute (17), one sets

\begin{align} F = \sum_{i = 1}^{V} θ_{i} \log θ_{i}; \end{align}

(20)

\begin{align} G = \sum_{i = 1}^{V} θ_{i}; \end{align}

(21)

\begin{align} H = \sum_{i = 1}^{V} θ_{i} β_{i} . \end{align}

(22)

Two constraints are given:

G = 1 and H = \frac{1}{α} .

An intermediary function ϕ_α(τ₁, ⋯ τ_V) is defined

ϕ_{α} = F + λ_{α} G + τ_{α} H

(23)

In order to maximize ϕ under these two constraints, ϕ function is derived with respect to each random variable τ_i. This yields V equations

1 + \log θ_{i} + λ_{α} + τ_{α} β_{i} = 0 .

(24)

Two indices i_min and i_max are chosen that satisfy $β_{i_{\min}} \neq β_{i_{\max}}$ . For instance

\begin{array}{l} β_{i_{\min}} = \min {(β_{i})}_{1 \leq i \leq V} = \log \frac{1}{p_{\max}}; \\ β_{i_{\max}} = \max {(β_{i})}_{1 \leq i \leq V} = \log \frac{1}{p_{\min}} . \end{array}

Solving equation (24) with indices i_min and i_max yields

\begin{array}{l} τ_{α} & = \frac{\log θ_{i_{\min}} - \log θ_{i_{\max}}}{β_{i_{\max}} - β_{i_{\min}}} = \log {\frac{θ_{i_{\min}}}{θ_{i_{\max}}}}^{\frac{1}{β_{i_{\max}} - β_{i_{\min}}}}; \\ 1 + λ_{α} & = \frac{β_{i_{\min}} \log θ_{i_{\max}} - β_{i_{\max}} \log θ_{i_{\min}}}{β_{i_{\max}} - β_{i_{\min}}} . \end{array}

Remaining equations rewrite:

\log θ_{i} = \log θ_{i_{\min}} + τ_{α} (β_{i_{\min}} - β_{i}) .

(25)

Using the constraint $\sum_{i = 1}^{V} θ_{i} = 1$ that yields

θ_{i_{\min}} e^{β_{i_{\min}} τ_{α}} \sum_{i = 1}^{V} e^{- β_{i} τ_{α}} = 1,

and an expression for $θ_{i_{\min}}$ follows. Therefore Equation 25 rewrites:

θ_{i} = \frac{e^{- β_{i} τ_{α}}}{\sum_{i = 1}^{V} β_{i} e^{- β_{i} τ_{α}}} .

(26)

Finally, Equation $\sum_{i = 1}^{V} θ_{i} β_{i} = \frac{1}{α}$ yields equation (10).

\frac{1}{α} = \frac{\sum_{i = 1}^{V} β_{i} e^{- β_{i} τ_{α}}}{\sum_{i = 1}^{V} e^{- β_{i} τ_{α}}} .

For this V-tuple

\begin{array}{l} \sum_{i = 1}^{V} θ_{i} \log θ_{i} = - (\sum_{i = 1}^{V} θ_{i} β_{i}) τ_{α} - (\sum_{i = 1}^{V} θ_{i}) \log (\sum_{i = 1}^{V} e^{- β_{i} τ_{α}}) = - \frac{τ_{α}}{α} - \log (\sum_{i = 1}^{V} e^{- β_{i} τ_{α}}) . \end{array}

4.2. Approximation Orders

Derivating the RHS of (10) yields $\frac{\sum_{i \neq j} {(β_{i} + β_{j})}^{2} e^{- (β_{i} + β_{j}) τ}}{{(\sum_{i} e^{- β_{i} τ})}^{2}}$ that is positive. Therefore, for any α, the solution to (10) is unique. Moreover, τ_α increases with α. Let

\begin{align} ψ_{1} (α) & = τ_{α} + α \log (\sum_{i = 1}^{V} e^{- β_{i} τ_{α}}); \end{align}

(27)

\begin{align} ψ_{2} (α) & = 2 - α \log \frac{1}{σ_{2}} . \end{align}

(28)

Notably, the solutions τ_α of (10) associated with the four increasing values of α: $(α_{\min}, \tilde{α}, \bar{α}, α_{\max})$ are (–∞, 1 + 2, + ∞). Computing ψ for these values yields (11) and Equality $ψ_{1} (\tilde{α}) = ψ_{2} (\tilde{α})$ .

Derivating both expressions yields

\begin{align} \frac{\partial ψ_{1}}{\partial α} (α) & = \log (\sum_{i = 1}^{V} e^{- β_{i} τ_{α}}); \end{align}

(29)

\begin{align} \frac{\partial ψ_{1}}{\partial α} (α) - \frac{\partial ψ_{2}}{\partial α} (α) & = \log (\frac{1}{σ_{2}} \sum_{i = 1}^{V} e^{- β_{i} τ_{α}}) . \end{align}

(30)

Both derivatives are monotone functions of τ_α. In equation (30), derivative is 0 when $α = \bar{α}$ . Therefore, ψ is the maximum of the two values ψ₁ and ψ₂ over the interval [α_min, α_max]. The former equation is 0 if α = $\tilde{α}$ . Therefore, ψ is maximum when α = $\tilde{α}$ .

4.3. Approximations

4.3.1. Short Lengths

Assume that k ≤ α_min log n. Each term ϕ(k₁, ⋯ , k_V) is lower bounded by $p_{\min}^{k} = n^{α \log p_{\min}} = n^{- \frac{α}{α_{\min}}}$ . Each term ψ_n(k₁, ⋯ , k_V) is trivially bounded by $e^{- n^{1 - \frac{α}{α_{\min}}}}$ that is upper bounded by 1 and nψ_n(k₁, ⋯ , k_V) tends to 0 when n goes to ∞. As $\sum (\begin{matrix} k \\ k_{1}, \dots k_{v} \end{matrix}) ϕ (k_{1}, \dots, k_{V}) = 1$ , the ratio $\frac{\log μ (n, k)}{\log n}$ is negative.

4.3.2. Moderate and Large Lengths

For a length k in the transition domain [α_min log n, α_max log n], the objective function may be either positive or negative. When k > α_max log n, set E_k(n) is empty and μ(n, k) reduces to S(k).

The maximum M among the terms $e^{k (- \sum_{i} \frac{k_{i}}{k} \log \frac{k_{i}}{k} - \frac{1}{k} \log n ϕ (k_{1}, \dots, k_{V}))}$ in T(k) is reached when ρ(k₁, ⋯ , k_V) is 0. Due to the exponential decrease of $e^{- n ϕ (k_{1}, \dots, k_{V})}$ when nϕ(k₁, ⋯ , k_V) ≥ 1, $\frac{T (k)}{k}$ is upper bounded. Computation of log M is done with Lagrange multipliers, as explained above.

Computation of S(k) relies on the local development of ψ_n(k₁, ⋯ , k_V), that is n(1–σ₂)ϕ (k₁, ⋯ , k_V). S(k) rewrites ${σ_{2}}^{k} \tilde{S} (k) + (S (k) - {σ_{2}}^{k} \tilde{S} (k))$ where $\tilde{S} (k) = \sum_{ρ (k_{1}, \dots, k_{V}) \leq 0} (\begin{matrix} k \\ k_{i} \end{matrix}) {(\frac{p_{1}^{2}}{σ_{2}})}^{k_{1}} \dots {(\frac{p_{V}^{2}}{σ_{2}})}^{k_{V}}$ . This sum satisfies a Large Deviation Principle when $ρ (k_{1}, \dots, k_{V}) + \frac{1}{α} \geq \frac{1}{\tilde{α}}$ , or α < $\tilde{α}$ . In this range, $\frac{\tilde{S} (k)}{k} \sim \max \{- \sum_{i = 1}^{V} \frac{k_{i}}{k} \log \frac{k_{i}}{k}\}$ , which was shown to be ψ(α).

When α > $\tilde{α}$ , sum S˜(k) rewrites $1 - \bar{S} (k)$ where

\bar{S} (k) = \sum_{ρ (k_{1}, \dots, k_{V}) + \frac{1}{α} < \frac{1}{\tilde{α}}} (\begin{matrix} k \\ k_{i} \end{matrix}) {(\frac{p_{1}^{2}}{σ_{2}})}^{k_{1}} \dots {(\frac{p_{V}^{2}}{σ_{2}})}^{j_{V}} .

This sum satisfies a Large Deviation Principle and

\frac{\bar{S} (k)}{k} \sim \max \{- \sum_{i = 1}^{V} \frac{k_{i}}{k} \log \frac{k_{i}}{k} + \sum_{i = 1}^{V} \frac{k_{i}}{k} \log \frac{p_{i}^{2}}{σ_{2}}\} .

As $\sum_{i = 1}^{V} \frac{k_{i}}{k} \log \frac{p_{i}^{2}}{σ_{2}} = - \frac{2}{α} + \log \frac{1}{σ_{2}}$ , this maximum is

- \frac{1}{α} [2 - α \log \frac{1}{σ_{2}} - ψ (α)]

that is negative.

4.4. Binary Case

Barycentric coordinates of α are unique. Indeed, equation (10) reduces to a linear equation on the variable $e^{- (β_{2} - β_{1}) τ}$

\frac{1}{α} = \frac{β_{1} + β_{2} e^{- (β_{2} - β_{1}) τ}}{1 + e^{- (β_{2} - β_{1}) τ}}

where $β_{2} - β_{1} = β_{\min} - β_{\max} = \log \frac{p_{\max}}{p_{\min}}$ . Therefore, $e^{- (β_{2} - β_{1}) τ} = \frac{1 - α β_{1}}{α β_{2} - 1}$ . Finally

τ_{α} = \frac{1}{\log \frac{p_{\max}}{p_{\min}}} \log \frac{α β_{2} - 1}{1 - α β_{1}} = \frac{1}{\log \frac{p_{\max}}{p_{\min}}} \log \frac{\frac{1}{α_{\min}} - \frac{1}{α}}{\frac{1}{α} - \frac{1}{α_{\max}}} .

Function ψ rewrites, in the binary case:

ψ_{α} = τ_{α} = α \log e^{- \frac{1}{α} τ_{α}} (e^{- (β_{1} - \frac{1}{α}) τ_{α}} + e^{- (β_{2} - \frac{1}{α}) τ_{α}}) .

Observing that $e^{- (β_{2} - β_{2}) τ_{α}} = s_{α}$ and changing variable τ_α into (β₂ − β₁) yields $e^{- (β_{1} - \frac{1}{α}) τ_{α}} = {s_{α}}^{- (\frac{1}{α_{\min}} - \frac{1}{α})}$ and $e^{- (β_{2} - \frac{1}{α}) τ_{α}} = {s_{α}}^{- (\frac{1}{α_{\max}} - \frac{1}{α})}$ .

5. Conclusion

This paper describes the behavior of the number of unique or repeated k-mers in a random sequence, on a general alphabet. Derivation relies on a combination of analytic combinatorics and on Lagrange multipliers. It simplifies an approach provided for binary alphabets and allows to address larger alphabets, including the quaternary alphabets, such as DNA alphabet. Precise asymptotic estimates are provided and a probabilistic interpretation is given. They are validated on random simulated data and shown to be valid in the finite range. Therefore, they provide a valuable tool to estimate a suitable read length for assembly purposes and tune parameters for assembly algorithms. Real genomes significantly depart from the random behavior for long repetitions. The general shape of the trie profile is observed, with a maximum of the number of unique k-mers at the expected length. However, for real genomes, a number of very short k-mers are missing and, on the contrary, one observes a number of very long repetitions. Besides these events, the behaviors are rather similar.

In the future, it is worth extending the method to generalized Patricia tries, Markov models and approximate repetitions.

Author Contributions

Both authors contributed equally.

Conflict of Interest Statement

The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Funding

Inria-Cnrs-Poncelet grant Carnage.

References

Beller, T., Gog, S., Ohlebusch, E., and Schnattinger, T. (2013). Computing the longest common prefix array based on the burrows–wheeler transform. J. Discrete Algorithms 18, 22–31. doi: 10.1016/j.jda.2012.07.007

CrossRef Full Text | Google Scholar

Chikhi, R., and Medvedev, P. (2014). Informed and automated k-mer size selection for genome assembly. Bioinformatics 30, 31–37. doi:10.1093/bioinformatics/btt310

PubMed Abstract | CrossRef Full Text | Google Scholar

Devillers, H., and Schbath, S. (2012). Separating significant matches from spurious matches in dna sequences. J. Comput. Biol. 19, 1–12. doi:10.1089/cmb.2011.0070

PubMed Abstract | CrossRef Full Text | Google Scholar

Fagin, R., Nievergelt, J., Pippenger, N., and Strong, H. R. (1979a). Extendible hashingâ – a fast access method for dynamic files. ACM Trans. Database Syst. 4, 315–344. doi:10.1145/320083.320092

CrossRef Full Text | Google Scholar

Fagin, R., Nievergelt, J., Pippenger, N., and Strong, R. (1979b). Extendible hashing: a fast access method for dynamic files. ACM Trans. Database Syst. 4, 315–344. doi:10.1145/320083.320092

CrossRef Full Text | Google Scholar

Flajolet, P., Kirschenhofer, P., and Tichy, R. F. (1988). Deviations from uniformity in random strings. Probab. Theory Relat. Fields 80, 139–150. doi:10.1007/BF00348756

CrossRef Full Text | Google Scholar

Gu, Z., Wang, H., Nekrutenko, A., and Li, W. H. (2000). Densities, length proportions, and other distributional features of repetitive sequences in the human genome estimated from 430 megabases of genomic sequence. Gene 259, 81–88. doi:10.1016/S0378-1119(00)00434-0

PubMed Abstract | CrossRef Full Text | Google Scholar

Hartman, A. L., Norais, C., Badger, J. H., Delmas, S., Haldenby, S., Madupu, R., et al. (2010). The complete genome sequence of haloferax volcanii ds2, a model archaeon. PLoS One 5:e9605. doi:10.1371/journal.pone.0009605

PubMed Abstract | CrossRef Full Text | Google Scholar

Jacquet, P., and Szpankowski, W. (1994). Autocorrelation on words and its applications: analysis of suffix trees by string-ruler approach. J. Comb. Theory A 66, 237–269. doi:10.1016/0097-3165(94)90065-5

CrossRef Full Text | Google Scholar

Jacquet, P., and Szpankowski, W. (2015). Analytic Pattern Matching: From DNA to Twitter. Reading, MA: Cambridge University Press.

Google Scholar

Janson, S., Lonardi, S., and Szpankowski, W. (2004). “On the average sequence complexity,” in Combinatorial Pattern Matching, eds S. C. Sahinalp, S. Muthukrishnan and U. Dogrusoz (Berlin Heidelberg: Springer), 74–88.

Google Scholar

Knuth, D. (1998). The Art of Computer Programming, Volume Two, Seminumerical Algorithms. Reading, MA.

Google Scholar

Magner, A., Knessl, C., and Szpankowski, W. (2014). “Expected external profile of patricia tries,” in Proceedings of the Meeting on Analytic Algorithmics and Combinatorics (Society for Industrial and Applied Mathematics), 16–24.

Google Scholar

Mahmoud, H. (1992). Evolution of Random Search Trees. New York: John Wiley & Sons.

Google Scholar

Manber, U., and Myers, G. (1993). Suffix arrays: a new method for on-line string searches. SIAM J. Comput. 22, 935–948. doi:10.1137/0222058

CrossRef Full Text | Google Scholar

Nicodème, P. (2005). “Average profiles, from tries to suffix-trees,” in 2005 International Conference on Analysis of Algorithms, Volume AD of DMTCS Proceedings, ed. C. Martìnez (Barcelona, Spain: Discrete Mathematics and Theoretical Computer Science), 257–266.

Google Scholar

Park, G., Hwang, H.-K., Nicodeme, P., and Szpankowski, W. (2009). Profile of trie. SIAM J. Comput. 38, 1821–1880. doi:10.1137/070685531

CrossRef Full Text | Google Scholar

Rizk, G., Lavenier, D., and Chikhi, R. (2013). Dsk: k-mer counting with very low memory usage. Bioinformatics 29, 652–653. doi:10.1093/bioinformatics/btt020

PubMed Abstract | CrossRef Full Text | Google Scholar

Sedgewick, R., and Flajolet, P. (2009). Analytic Combinatorics. Reading, MA: Cambridge University Press.

Google Scholar

Szpankowski, W. (2001). Average Case Analysis of Algorithms on Sequences. New York: John Wiley and Sons.

Google Scholar

Treangen, T. J., and Salzberg, S. L. (2012). Repetitive dna and next-generation sequencing: computational challenges and solutions. Nat. Rev. Genet. 13, 36–46. doi:10.1038/nrg3117

CrossRef Full Text | Google Scholar

Ukkonen, E. (1995). On-line construction of suffix trees. Algorithmica 14, 249–260. doi:10.1007/BF01206331

CrossRef Full Text | Google Scholar

Keywords: K-mers, combinatorics, probability

Citation: Régnier M and Chassignet P (2016) Accurate Prediction of the Statistics of Repetitions in Random Sequences: A Case Study in Archaea Genomes. Front. Bioeng. Biotechnol. 4:35. doi: 10.3389/fbioe.2016.00035

Received: 03 December 2015; Accepted: 08 April 2016;
Published: 08 June 2016

Edited by:

Marco Pellegrini, Consiglio Nazionale delle Ricerche, Italy

Reviewed by:

Travis Gagie, University of Helsinki, Finland
Solon P. Pissis, King’s College London, UK

Copyright: © 2016 Régnier and Chassignet. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) or licensor are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

*Correspondence: Mireille Régnier, bWlyZWlsbGUucmVnbmllckBpbnJpYS5mcg==

Disclaimer: All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article or claim that may be made by its manufacturer is not guaranteed or endorsed by the publisher.