Application of the Moran Model in Estimating Selection Coefficient of Mutated CSF3R Clones in the Evolution of Severe Congenital Neutropenia to Myeloid Neoplasia

Bone marrow failure (BMF) syndromes, such as severe congenital neutropenia (SCN) are leukemia predisposition syndromes. We focus here on the transition from SCN to pre-leukemic myelodysplastic syndrome (MDS). Stochastic mathematical models have been conceived that attempt to explain the transition of SCN to MDS, in the most parsimonious way, using extensions of standard processes of population genetics and population dynamics, such as the branching and the Moran processes. We previously presented a hypothesis of the SCN to MDS transition, which involves directional selection and recurrent mutation, to explain the distribution of ages at onset of MDS or AML. Based on experimental and clinical data and a model of human hematopoiesis, a range of probable values of the selection coefficient s and mutation rate μ have been determined. These estimates lead to predictions of the age at onset of MDS or AML, which are consistent with the clinical data. In the current paper, based on data extracted from published literature, we seek to provide an independent validation of these estimates. We proceed with two purposes in mind: (i) to determine the ballpark estimates of the selection coefficients and verify their consistency with those previously obtained and (ii) to provide possible insight into the role of recurrent mutations of the G-CSF receptor in the SCN to MDS transition.


INTRODUCTION
Bone marrow failure (BMF) syndromes, such as severe congenital neutropenia (SCN) are leukemia predisposition syndromes. In addition to SCN, these heterogeneous groups of disorders include Fanconi anemia, dyskeratosis congenita, Diamond-Blackfan anemia, Shwachman-Diamond syndrome, and GATA2 deficiency (West and Churpek, 2017;Kennedy and Shimamura, 2019). Each of these clinically defined disorders are monogenic with mutations in one or more genes in a pathway. For example, Fanconi anemia results from germline mutations in genes involved in DNA repair and Diamond-Blackfan anemia in ribosome structure (Oyarbide et al., 2019). What is less well-understood are the somatic mutations that arise during transformation of a BMF syndrome to myeloid neoplasia (Rafei and DiNardo, 2019).
Focusing on the SCN to MDS to AML transition, individuals with a germline mutation in the ELANE gene develop at early age a severe neutropenia (Touw, 2015). This profound neutropenia makes them susceptible to recurrent infections, which can be only partly managed by antibiotics. Treatment introduced in the 1990s involves administration of large doses of recombinant human granulocyte colony stimulating factor (G-CSF), which boosts neutrophil production (Bonilla et al., 1989). Unfortunately, in about 30% of patients, either myelodysplatic syndrome (MDS), a preleukemic disorder, or acute myeloid leukemia (AML) emerges. In 70% MDS or AML cases arising from SCN, somatic mutations of the G-CSF Receptor (CSF3R) occur (Link, 2019). These are almost always nonsense mutations. The truncated CSF3R affects altered signaling, gene expression, and phenotype within the neutrophil lineage. There is enhanced proliferation and impaired neutrophilic differentiation to G-CSF.
Stochastic mathematical models have been conceived, which attempt to explain the transition of SCN to MDS and then to AML, in the most parsimonious way, using suitable extensions of standard processes of population genetics and population dynamics, such as the branching (Kimmel and Corey, 2013) and the Moran processes (Wojdyla et al., 2019). Specifically, the latter paper presented a hypothesis of the SCN → MDS transition, which involves the Moran process with directional selection (Durrett, 2008) and recurrent mutation, to explain the distribution of ages at onset of MDS or AML. As argued in Wojdyla et al. (2019), starting in the fetal life, CSF3R mutations arise as a random process and are selected for when G-CSF is administered to boost neutrophil production. Based on experimental and clinical data and a model of human hematopoiesis, a range of probable values of the selection coefficient s and mutation rate µ have been determined. These estimates lead to predictions of the age at onset of MDS or AML, which are consistent with the clinical data.
In the current paper, based on data extracted from published literature, we seek to provide an independent validation of these estimates. We will use the model of evolution of the mutant receptors in the hematopoietic stem cells (HSC) in the bone marrow in the form of a Moran process with selection and recurrent mutation. This is the same process we used in Wojdyla et al. (2019), except that here, to simplify computations, we assume constant HSC population size and develop an analytical approximation of the expected values of the mutant receptor occurrence among HSC under the assumption that initial count of mutants is already substantial (Methods and Data). We proceed with two purposes in mind. Our first purpose is to determine the ballpark estimates of the selection coefficients and verify their consistency with those obtained in Wojdyla et al. (2019). Our second purpose is to provide insight into the relative role of recurrent mutations of the G-CSF receptor in the SCN to MDS transition.

Moran Process
In the monograph by Durrett (2008), the Moran process with selection is defined as follows • Constant population of N individuals.
• At each discrete time moment, a randomly chosen individual dies, and, at the same moment, another randomly chosen individual proliferates (for mathematical completeness, it can be the same individual). • In the model with directional selection, there are individuals of two types: wildtype (WT) and mutant (M) and the choice of individual that proliferates is biased. The wildtype is chosen with weight 1 − s, s ∈ (0, 1).
It is instructive to consider the discrete-time case first. Let us denote the number of mutants by i. There are four possibilities Only the WT→M and M→WT options lead to change in the number of mutants in the case with selection, which leads to in the neutral case. The continuous-time version is defined by transition intensities which have different denominators than the transition probabilities. However, they lead to the same absorption formula. The expected time to absorption in {N} (fixation of the mutant) has a commonly used asymptotics as N → ∞ in the case with selection, which however is not very accurate.

Time-Continuous Moran Process With Directional Selection and Recurrent Mutations
A time-continuous Moran process with directional selection may be supplemented with recurrent mutation by adding a term of the form µ(N − i) to the q i,i+1 transition intensity. This can be interpreted as an equal and independent chance µ t + o( t), for each of the N − i WT cells, of becoming a mutant in a short time interval (t, t + t). The complete set of transition rules for the chain {X(t), t ≥ 0} assumes the form (1) Because the state space is finite, the chain is eventually absorbed in the state N at random time T N ; cf. Figure 1 for a heuristic illustration.

Simulation of Trajectories of the Moran Process With Recurrent Mutation
Simulation of a time-continuous Markov Chain is based directly on application of the transition intensities as expressed in Equation (1). Briefly, if the mutant cell chain X(t) is in state i at time t, then the time to the next jump is a random variable τ distributed exponentially with parameter q i,i−1 + q i,i+1 . The direction of the jump at time t + τ is then decided by a random choice, i → i − 1 with probability , respectively. This algorithm (know also popularly as the Gillespie algorithm) is based on the properties of holding times and jumps of time-continuous Markov Chain, as explained for example in the book (Grimmett and Stirzaker, 2001). Simulations depicted in Figure 2 were executed using this algorithm.

Approximation of the Moran Process With Recurrent Mutation and Estimation of Selection Coefficient and Mutation Rate
In the current study, we are not as much concerned with a mathematically rigorous theory of the Moran process with selection and recurrent mutation, as with obtaining computable expressions that lead to ballpark estimates of the selection coefficient and mutation rate. Based on transitions spelled out in Equation (1), we obtain the following expression for the conditional expectations: (2) Corresponding expression for variance is more involved. From Equation (2), assuming that x = X(t) can be replaced by its expectation and denoting the latter by x(t) we formally obtain the following ordinary differential equation (ODE) for x(t).
Following a change of variables y(t) = x(t)/N ∈ [0, 1], this leads toẏ This latter equation has an explicit solution This curve is very similar to that derived in the initial part of the well-known study by Gerrish and Lenski (1998), under branching process hypotheses. Let us also notice that population size N does not play a role in the expression for y(t). However, as evidenced by a comparison between the simulations in Figures 2A,F, larger N reduces process variance and the slight bias of y(t) as the estimate of X(t)/N. Let us note that Equation (5) yields which after substitution t = t 1 yields which yields The latter can be written alternatively as Knowing y 0 and y 1 (and therefore also knowing α 0 and α 1 ), we can thus now find the set of values (s, µ) such that y i = y(t i ) i = 0, 1. The latter expression embodies the trade-off between selection and mutation. To understand it, let us notice that the RHS of Equation (10) is equal to C = ln(α 1 /α 0 ) (t 1 −t 0 ) if u = 0, and it changes very little if u is small. The magnitude of C (which is the estimate of s when µ = 0) as computed from data varies between 0.002 and 0.05 if we disregard the sole negative value −0.059. To significantly influence (i.e., by say 10%) the lowest estimate 0.002, the mutation per cell per nucleotide rate should be equal to at least 0.0002, which is five orders of magnitude higher than the standard human rate. We use per nucleotide rates since we are discussing specific mutation sites in each case.
FIGURE 1 | Anatomy of the Moran process with recurrent mutations. Some mutant count trajectories may become transiently extinct (green line), but they will be resurrected by one of the recurring mutations events (red lines). Eventually the mutants are fixed.
In summary, estimates depicted in Table 1 and Figure 3 were obtained using the method of the preceding paragraph under assumption µ = u = 0.

Empirical Observations of Increase in Mutant Receptor Frequency Over Time
Several papers documented the process of increase of the frequency of mutant receptor, based on genome sequencing of bone marrow cells of the SCN patients, in two or more time points. We focus our attention on three of these papers (Beekman et al., 2012;Skokowa et al., 2014;Klimiankou et al., 2019). In these papers, patient data were recorded with changing frequency of mutant receptors over time. Of these cases, we selected only the ones that displayed unambiguous monotonous trend. Theses case are listed in Table 1. The estimates that are obtained using expression (10) are ambiguous since the expression provides only a relationship between s and µ. However, if s estimated under the hypothesis that µ is small, which is plausible unless the mutation rate is orders of magnitude higher than normal, then the estimates of s differ only slightly from those obtained under µ = 0, as explained in the preceding subsection. This effect is very similar to that observed in simulations in Wojdyla et al. (2019).

Approximate Mean Expression vs. Simulations
We address here the accuracy of the agreement of the approximate expression (5) for expected value of the Moran process with directional selection and recurrent mutations, with direct simulations. Figure 2, presents a comparison of 1,000 simulated trajectories and their mean and standard deviation to the y(t) function. We observe an almost complete agreement of the approximate mean and simulation average, which become indistinguishable with cell count N increasing from 1, 000 to 10, 000 [compare panels (A-F)]. Additionally, the simulation variance decreases almost inversely proportionally to N.
It is very important to compare the influence of selection coefficient s and mutation rate µ on the expectation of the process. If µ does not exceed 0.001 per generation, its influence can be disregarded, while the influence of s is decisive [panels (A-C)]. Only when µ reaches 0.01, its influence becomes important. In the estimates of the selection coefficient s, based on expression (10), this effect is represented by the coefficient u = µ/s (also present in expressions for α 0 and α 1 ), the magnitude of which determines the departure from the case µ = 0. Table 1 depicts the estimates of the selection coefficient s obtained using Equation (10) with µ = 0. All details of the data used are included in the Table 1. The estimates depend on the assumed average interdivision time of the HSC (including some self-renewing CMP). It is assumed to be equal to 1/24 of 1 year (15 days). Changing this assumption leads to different estimates, as it can be tested by modifying the parameter λ in the spreadsheet (λ equals the inverse of the interdivision time). Overall, the estimates span a range from 0 to 0.05, with the exception of patient 13 of publication (Klimiankou et al., 2019) who has a negatively estimated selection coefficient. Figure 3 depicts estimated selection coefficientsŝ from Table 1, plotted against the time interval t 1 − t 0 between the first and the second instance of sequencing. There seems to exist a negative association betweenŝ and t 1 − t 0 . In addition, the MDS cases (red circles), seem to have lower values ofŝ than the AML cases (green circles). Cases labeled as "CN-MDS/AML" in Klimiankou et al. (2019) are denoted by black circles.

DISCUSSION
The results of the present paper provide estimates of the selection coefficients that may underlie the fixation of the mutant G-CSF receptor in the SCN to MDS transition, which are consistent with the range deduced in Wojdyla et al. (2019) based on epidemiological evidence. Let us emphasize that our initial and final fractions of mutant receptor data come from sequencing of samples from patients with SCN. Availability of these sequencing data in papers (Beekman et al., 2012;Skokowa et al., 2014;Klimiankou et al., 2019) is at the stem of our results.
Severe congenital neutropenia is not the only inherited BMF syndrome with predisposition to MDS and AML; however, we believe that SCN provides the most robust and accurate disease to model because acquisition of CSF3R mutation is so common (70-80%) as a secondary hit (see discussion of sources in Wojdyla et al., 2019). On the other hand, TP53 mutations in Shwachman-Diamond are controversial in that the mutations do not augur for transformation (Xia et al., 2018). Furthermore, the prevalence of mutations in TP53 and in other genes such RUNX1 in Fanconi anemia or dyskeratosis congenita appears to be much less than that of CSF3R in SCN (Chao et al., 2017;Lane, 2017;Kirschner et al., 2018). Despite this, modified Moran model might be applicable to other bone marrow failure syndromes that are associated with leukemia transformation. Since the variant allele frequencies of CSF3R is not reported for most of these rare patients, but our model provides an accurate prediction, it is conceivable that uncommon secondary mutations, such as TP53 or PPM1D or RUNX1, could be used in our modified Moran model.
We do not use the cases with non-monotonic change in variant receptor frequency. The reason is that, at this range of frequency, Moran model is very unlikely to exhibit persistent reversals. Therefore, it is more likely that factors that cannot be included in the Moran model play a role.  , also equal to the inverse of the expected interdivision time (years), α 0 , α 1 , as defined in Equation (8), andŝ, estimate of the selection coefficient. "pt." is patient, "ph." is phase. Table 1, plotted against the time interval between the first and the second instance of sequencing. Red, MDS; green, AML; black, unclassified. Numbering of cases follows the identifiers in Table 1.

FIGURE 3 | Estimated selection coefficients from
One of the important questions in understanding cancer evolution is the balance among different genetic forces, such as mutation and selection. The problem has been studied for solid cancers, e.g., by Ling et al. (2015). In essence, mutant frequency in the cell population can increase in a similar way with different (negatively associated) values of s and µ. In particular, if a fit to the mutant frequency increase observed over a time interval is obtained under µ = 0, as in Table 1, then under µ > 0, the estimate of s will only be smaller. The decrease will depend on the value of u = µ/s. However, as discussed in the Results, unless the mutation rate in cells is five orders of magnitude higher than in normal cells, i.e., µ ≈ 10 −4 per cell generation per nucleotide, mutation does not make much difference for the estimates. Therefore, recurrent mutation is an important factor only if the CSF3R mutation sites are extremely strong mutational hot-spots. We also examined the magnitude of the correlation coefficient, depending on whether the observed transition was from SCN to MDA or to AML (Figure 3), wherever the data have been available. The selection coefficients in the transitions to AML seem to be greater.
An additional effect may be due to the fact that not one, but several types of mutant receptors are observed in MDS. The most frequent is the truncated D715 variant, but there are a number of other, as documented in Beekman et al. (2012), Skokowa et al. (2014), and Klimiankou et al. (2019). Therefore, the basic mutation rate should be multiplied by the number of alternative mutants. Assuming that there are no more than 10 of these mutants, the effect does not seem to play a major role.
It is interesting to observe the apparent effect of ascertainment bias on the data from papers (Beekman et al., 2012;Skokowa et al., 2014;Klimiankou et al., 2019) which we use in our study. Figure 3 in the Results depicts estimated selection coefficientŝ s from Table 1, plotted against the time interval t = t 1 − t 0 between the first and the second instance of sequencing. The negative association seen in Figure 3 may be explained by the fact that the second time at which the frequency of the mutant receptor is observed, arrives sooner if the progression of the disease is faster, i.e., when the coefficient s is higher. This trend may also lead to our estimated s being in general an overestimate, under the assumption that cases with very low s are never diagnosed. Impact of ascertainment bias on estimates of progression in solid cancers has been studied (see, e.g., Kimmel and Flehinger, 1991), and similar methods may be used. However, such study exceeds the framework of the current paper.

DATA AVAILABILITY STATEMENT
All datasets generated for this study are included in the article.

AUTHOR CONTRIBUTIONS
All authors listed have made a substantial, direct and intellectual contribution to the work, and approved it for publication. Specifically, MK conceived the study and derived mathematical expressions, SC searched for relevant data and interpreted the model in the context of data, and KD designed and carried out Moran model computations and simulations.