Characterizing viral within-host diversity in fast and non-equilibrium demo-genetic dynamics

High-throughput sequencing has opened the route for a deep assessment of within-host genetic diversity that can be used, e.g., to characterize microbial communities and to infer transmission links in infectious disease outbreaks. The performance of such characterizations and inferences cannot be analytically assessed in general and are often grounded on computer-intensive evaluations. Then, being able to simulate within-host genetic diversity across time under various demo-genetic assumptions is paramount to assess the performance of the approaches of interest. In this context, we built an original model that can be simulated to investigate the temporal evolution of genotypes and their frequencies under various demo-genetic assumptions. The model describes the growth and the mutation of genotypes at the nucleotide resolution conditional on an overall within-host viral kinetics, and can be tuned to generate fast non-equilibrium demo-genetic dynamics. We ran simulations of this model and computed classic diversity indices to characterize the temporal variation of within-host genetic diversity (from high-throughput amplicon sequences) of virus populations under three demographic kinetic models of viral infection. Our results highlight how demographic (viral load) and genetic (mutation, selection, or drift) factors drive variations in within-host diversity during the course of an infection. In particular, we observed a non-monotonic relationship between pathogen population size and genetic diversity, and a reduction of the impact of mutation on diversity when a non-specific host immune response is activated. The large variation in the diversity patterns generated in our simulations suggests that the underlying model provides a flexible basis to produce very diverse demo-genetic scenarios and test, for instance, methods for the inference of transmission links during outbreaks.

Time (days) γ 2 = 0 γ 2 = 0.2 γ 2 = 0.4 γ 2 = 0.6 γ 2 = 0.8 γ 2 = 1 Figure S3: Effect of the shuffling parameter γ 2 on within-host genetic diversity in the presence of lethal-genome elimination (α = 0.4), the other shuffling parameters being equal to their default values (γ 1 , γ 3 ) = (0.8, 70); See table 3, main text, for full parameter specification. Row 1: variation in within-host virion quantity under the models with acute infection (column 1), persistent infection (column 2) and immune response (column 3). Rows 2-5: variation in within-host genetic diversity measured by richness (R), Shannon (H ), Gini-Simpson (D) and Jukes-Cantor (JC) indices, respectively. Time (days) µ = 5x10 −7 µ = 10 −6 µ = 5x10 −6 µ = 10 −5 µ = 5x10 −5 Figure S6: Effect of the mutation rate µ on within-host genetic diversity in the presence of the shuffling process ((γ 1 , γ 2 , γ 3 ) = (0.8, 0.4, 70)) but without the elimination of lethal genomes (α = 0); See table 3, main text, for full parameter specification. Row 1: variation in within-host virion quantity under the models with acute infection (column 1), persistent infection (column 2) and immune response (column 3). Rows 2-5: variation in within-host genetic diversity measured by richness (R), Shannon (H ), Gini-Simpson (D) and Jukes-Cantor (JC) indices, respectively.   Figure S9: Illustration of the non-monotonic behavior of diversity indices with respect to viral load. The arrows and the color palette give temporal information of the evolution of the pair 'viral load-diversity index', from day 0 (dark blue) to day 10 (yellow). These curves correspond to viral load and diversity shown for the acute infection model with the shuffling process and the elimination of lethal genomes, figure 3, main text. If richness (R) is roughly monotonic with respect to viral load, this is not the case for Shannon (H ), Gini-Simpson (D) and Jukes-Cantor (JC) indices, which continue to increase for some time after the viral-load peak, and finally decrease when the pathogen population vanishes. Time (days) V max = 10 6 V max = 0.5x10 6 V max = 10 5 V max = 10 4 Figure S10: Effect of re-scaling the virion dynamics on within-host genetic diversity. Virion dynamics initially simulated with parameters provided by tables 1-2, main text, and then re-scaled by a multiplicative factor of 1, 0.5, 0.1 and 0.01 to obtain maximum viral load V max equal to 10 6 , 0.5 × 10 6 , 10 5 and 10 4 , respectively. Given these kinetics, simulations were performed with the shuffling process ((γ 1 , γ 2 , γ 3 ) = (0.8, 0.4, 70)), the elimination of lethal genomes (α = 0.4) and (L, µ) = (330, 10 −5 ); see table 3 in main text for full parameter specification. Row 1: changes in within-host virion quantity predicted respectively by the three different kinetic models mentioned in Section 2.1. Rows 2-5: variation in within-host genetic diversity assessed by the four indices presented in Section 2.3, main text. Notes: Red curves in the diversity panels correspond to red curves in figure 3, main text. For the virion dynamics generated with the persistent infection model and multiplied by 0.01, the virion quantity at time 1 was too low for the genetic dynamics to be initialized and was therefore increased to 1 virion (the quantity of virions at time one was large enough in the other cases).

S1 Supporting text: Specific study of the shuffling process
Our model includes a component that accounts for episodes of natural selection and strong genetic drift by enhancing, e.g., the growth of low-frequency variants. This component, called shuffling process (Main text, Section 2.2.3), is constructed by noising the variant proportions P with the Gaussian distribution N (P, γ 1 × P γ 2 × (1−P ) γ 3 ) of mean P and variance depending on P and three non-negative parameters γ 1 , γ 2 and γ 3 , by applying a two-side cut-off min{1, max{0, ·}}, and then by dividing the resulting variables by their sum to get a vector of probabilities. To gain insight into the impact of the shuffling parameters on the variant proportions, we apply the noising process to a set of 100 probabilities (summing to one) obtained by drawing a vector from the Dirichlet distribution with rate vector α = (α 1 , ..., α 100 ), using α 1 = ... = α 80 = 0.02 and α 81 = ... = α 100 = 0.06. Figure S13 displays a realization of the Dirichlet draw and illustrates how the proportions of the dominant variants may decrease and the weights of a few low-proportion variants may increase by noising the vector of probabilities with (γ 1 , γ 2 , γ 3 ) = (0.8, 0.4, 70). Using the same values for (γ 1 , γ 2 , γ 3 ), Figure S14 displays the evolution of the noise variance (namely γ 1 × P γ 2 × (1 − P ) γ 3 ) with respect to the initial probability value P generated with the Dirichlet distribution. The highest variance corresponds to intermediate (but rather low) values of P . Note that this variance profile is prior to the re-scaling. To illustrate the variance obtained after the re-scaling, we applied the shuffling process 10000 times to the same vector of Dirichlet probabilities (by taking into account the two-side cut-off and the division by the sum) for assessing the distributions of the noisy versions of the maximum probability (p max ; i.e. the maximum of the set of probabilities generated with the Dirichlet distribution), the minimum positive probability (p min ; i.e. the minimum of the set of probabilities generated with the Dirichlet distribution) and the probability corresponding to the maximum variance (p varmax ; i.e. the probability of the maximum variance value in Figure S14). These distributions are displayed in Figure  S15 and show how variable the probabilities are in one step (i.e., one generation). For evaluating the potential evolution of probabilities throughout 10 generations, we sequentially repeated the shuffling process 10 times. In other words, starting with the initial set of probabilities generated with the Dirichlet distribution, at each generation we applied the shuffling process to the shuffled probabilities of the previous generation (by taking into account the two-side cut-off and the division by the sum). This process was repeated 10000 times. Figure S16 gives the evolution of the maximum probability, the minimum positive probability and the probability that has undergone the greatest variation. We notice a decrease of the maximum probability value. In contrast, we observe an increase in average of the minimum probability and the maximum-variation probability.
Figure S13: Distributions of initial (black) and shuffled (red) probabilities. Figure S14: Values of the Gaussian noise variances with respect to the initial probability values sampled from the Dirichlet distribution. Figure S15: Distributions of the shuffled versions of the maximum probability distribution (red), the minimum positive probability (blue), and the probability corresponding to the maximum noise variance (purple). Figure S16: Evolution of the probabilities during 10 generations by applying the shuffling process. The red, green and blue lines show, respectively, the average evolution of the maximum probability, the minimum positive probability and the probability that has undergone the greatest variation. The dashed lines give the pointwise 95% confidence envelopes.

S2 Supporting text: Discussion on parameter estimation and model validation
In the perspective of fitting our demo-genetic model to data, a two-stage estimation approach (Pagan, 1986;Mrkvička et al., 2014;Dvořák et al., 2019) grounded on the conditional structure of our model (i.e., the genetics is modeled given the kinetics) could be explored: first, one may estimate parameters of the kinetic component of the model for each host separately to account for inter-host heterogeneity (using standard and validated kinetic models for the virus of interest); second, one may estimate parameters of the genetic component of the model conditional on fitted kinetics, assuming that genetic parameters take the same values for all the hosts (except if the amount of information contained in the data allows the estimation of genetic parameters at the host level).
The first stage could be implemented using a standard approach in viral kinetics. For instance, using diverse forms of mean-square error estimation, various kinetic models were fitted to daily viral load data (Baccam et al., 2006;Wang et al., 2020;Pinky and Dobrovolny, 2020;Hernandez-Vargas and Velasco-Hernandez, 2020), to CD8+ T-cells data (Blanco-Rodriguez et al., 2021) or to associations of viral load data and IFN fold change data (Pawelek et al., 2012). Using multiple series for fitting a model might allow the number of kinetic parameters and initial conditions that are preliminary fixed (which is a common practice in viral-kinetics modeling) to be reduced.
The second stage could lie on serial samples collected from each individual and revealing within-host evolution of the pathogen population, typically obtained from whole genome high-throughput sequencing (HTS) or high-throughput amplicon sequencing (HTAS). After the transformation of data in summary statistics, one might fit the genetic component of the model with approximate Bayesian computation (ABC; Beaumont, 2010;Csilléry et al., 2010;Marin et al., 2012). The summary statistics could be some diversity curves such as those presented in this article, or at least some points along these curves depending on HTS or HTAS data frequency in time. Since the summary statistics will contain some temporal dependencies, one might apply an ABC technique for weighting them and hence mitigating the effect of redundant summary statistics while reducing the number of model simulations (e.g., Soubeyrand et al., 2013). Using the Bayesian paradigm offers the possibility to circumvent possible parameter-identifiability issues since the posterior distribution of parameters provided by ABC could reveal the 'equivalence' of different parameter vectors in their ability to 'explain' the observed data. In addition, ABC offers an easy way to take into account the uncertainty of kinetic parameters used as input of the genetic component of the model. Indeed, in the fit of kinetic models to data, some of the authors mentioned above apply a bootstrap step to measure the uncertainty of parameter estimates. This uncertainty could be propagated in ABC by generating multiple likely kinetics for each host and simulating genetic dynamics conditional on each of these kinetics.
The ABC tool kit also provides techniques for model validation and model choice (Csilléry et al., 2010;Pudlo et al., 2016). Such techniques could be applied to measure the goodness-of-fit of the whole model structure and compare models with different components related to both kinetics and genetic. Ideally, data not used for fitting the model should be used for this task. In addition, a classic practice in ABC is to use, for validation and comparison of models, summary statistics that were not used for fitting the model. Thus, one might exploit original diversity curves (not too much correlated with diversity curves used for parameter estimation) and original temporal series related to the kinetics (e.g., IFN data if only viral load data were used for fitting the model).