Automatic differentiation and the optimization of differential equation models in biology

A computational revolution unleashed the power of artificial neural networks. At the heart of that revolution is automatic differentiation, which calculates the derivative of a performance measure relative to a large number of parameters. Differentiation enhances the discovery of improved performance in large models, an achievement that was previously difficult or impossible. Recently, a second computational advance optimizes the temporal trajectories traced by differential equations. Optimization requires differentiating a measure of performance over a trajectory, such as the closeness of tracking the environment, with respect to the parameters of the differential equations. Because model trajectories are usually calculated numerically by multistep algorithms, such as Runge-Kutta, the automatic differentiation must be passed through the numerical algorithm. This article explains how such automatic differentiation of trajectories is achieved. It also discusses why such computational breakthroughs are likely to advance theoretical and statistical studies of biological problems, in which one can consider variables as dynamic paths over time and space. Many common problems arise between improving success in computational learning models over performance landscapes, improving evolutionary fitness over adaptive landscapes, and improving statistical fits to data over information landscapes.

Introduction eoretical studies o en analyze improvement on a performance surface.What phenotypes enhance tness?What di erential equation model best describes ecological or biochemical dynamics?What interventions improve the health of organisms or ecosystems?
Similarly, inductive methods of statistics and machine learning optimize prediction, minimize error, or maximize the use of information. 1,2Natural selection can itself be thought of as nature's inductive process of improving the t of organisms to their environment. 3,4ost of these problems reduce to nding a path that enhances performance. 5In practice, that o en means changing model parameter values in the direction along which the slope of performance tilts most strongly toward improvement.Ideally, that best slope is found by di erentiating the performance surface with 1 Department of Ecology and Evolutionary Biology, University of California, Irvine, CA 92697-2525, USA web: h ps://stevefrank.orgrespect to the parameters of the model or the traits of the organism.
Taking the derivative of a performance function with respect to its parameters is, in principle, a simple process.Performance functions typically arise by composition of basic mathematical operations such as addition, multiplication, and exponentiation.Standard procedures yield the derivative for each basic operation.
e overall derivative follows by the chain rule for differentiation.
Two aspects hinder the calculation of derivatives.First, performance measures o en follow from long chains of operations embedded in complex algorithms.For example, performance may depend on a measure of the distance between some optimal trajectory and the trajectory of a system of di erential equations speci ed by a model with many parameters.
In practice, one o en calculates the model trajectory for a given parameter set by a numerical algorithm.e derivative of the distance measure between the optimal and realized trajectories must be calculated through the operations of the numerical algorithm, which is typ-ically embedded in the code of a computer program.][8] e second di culty for calculation arises because performance functions o en depend on large numbers of parameters.For example, the modern arti cial intelligence revolution arose in part from expanding neural networks to multiple deeply connected layers with millions of parameters.It only became possible to calculate an improving pathway of parameter values with sucient speed a er the development of e cient methods for automatic di erentiation of performance functions embedded within the calculations of complex computer code. 2 Conceptually, automatic di erentiation provides insight into the topography of performance surfaces.In biology, many problems of evolutionary dynamics turn on the adaptive topography that maps phenotypes to tness. 9,10e close similarity of evolutionary dynamics and the many other problems that depend on the topography of performance surfaces provides insight into how natural selection designs organisms and into the prospects and limits of optimization models in biology.In other words, automatic di erentiation will help us to learn more about the shape of optimization problems in both natural history and in models of evolutionary biology.
is article focuses on the bene ts of automatic differentiation for di erential equation modeling and for broader problems in theoretical biology.Highlights include the opportunities for new kinds of theoretical approaches, a be er understanding of current developments in statistics and machine learning, and some possible directions for future research motivated by improved understanding of multidimensional performance surfaces.
I start with an overview of di erentiation in optimization.I next turn to the techniques and advantages of automatic di erentiation for optimizing di erential equation models. 11,12I then present two examples to illustrate how the optimization of di erential equations by automatic di erentiation may enhance the future study of various topics in biology.
e rst example ts various di erential equation models to the classic data on predator-prey population cycles of lynx and hare.e second example searches for transcription factor network designs that solve the challenge of maintaining an internal circadian rhythm.e internal rhythm must bu er against the intrinsic stochasticity of biochemical dynamics, entrain to an erratic external circadian signal when available, and maintain the internal rhythm when the external signal is absent.
Slopes and curvatures calculated by di erentiation also provide the basis for enhanced sensitivity analysis. 13In biology, sensitivity plays a key role in understanding robustness, variability, and evolutionary dynamics.In inference, sensitivity in uences the degree of belief in conclusions drawn from models and data, o en analyzed by Bayesian methods.

Di erentiation in optimization
Optimizing a function  with respect to a parameter vector p sets a classic optimization problem.For example, what parameters for a system of di erential equations minimize the distance between the system's trajectory and the observed trajectory of uctuating lynx and hare populations?In this case, the total distance, , might be the sum of the squared deviations between system's trajectory and the observed trajectory when measured at series of temporal intervals.
In general, it is relatively easy to nd a local optimum near some initial point in parameter space and relatively di cult to nd a global optimum over the full space of potential parameter values.Many optimization techniques exist.e best methods for a particular problem depend on the shape of the performance surface. 5,14,15or problems with continuous performance surfaces, using the derivative of  with respect to the parameters, p, o en greatly enhances the search for be er parameters.e vector of partial derivatives of  with respect to the parameters is the gradient or slope of the performance surface, ∇.
e gradient describes the change in performance with respect to the change in parameters.us, the gradient provides much information about how to update parameter values in an iterative search for improving performance.Knowing the gradient does not by itself solve the problem of ge ing stuck in a local optimum.However, fast calculation of the gradient often transforms problems from hard and beyond reasonable study into ones that can be analyzed relatively easily. 2,14,15ast calculation of derivatives by automatic methods may also allow calculation of the second derivatives, which describe the curvature of the performance surface.For a problem with  parameters and a singlevalued performance measure, , we write ∇ 2  for the  ×  Hessian matrix of partial second derivatives.e Hessian matrix provides insight into many aspects of optimization.

Optimizing di erential equations
Di erential equations provide the primary modeling approach for many problems. 16,17Organismal traits often follow trajectories over time as individuals develop and respond to the environment.Systems of gene regulation, biochemistry, and physiology describe temporal changes in molecular concentrations and biological outputs.Continuous models of population genetics describe temporal changes in allelic and genotypic frequencies.Ecological dynamics describe temporal changes in populations and interacting species.
e di erential equations describe a vector of values, x, that depend on time, , on a parameter vector, p, on initial conditions, and on various other inputs.If we write the system values at time  as x  , then we can abbreviate the di erential equations as in which the prime denotes di erentiation with respect to .
Many studies seek a parameter vector, p, that optimizes a temporal trajectory of values, x  , over some time span.Analysis optimizes a performance function, (X) with respect to p for the system values at various times, X = x  1 , x  2 , . . . . is section illustrates automatic di erentiation for the simple di erential equation with parameter vector p = {,  }.We use the performance function (  ) = (  − ) 2 , the squared Eu-clidean distance between the system's value at particular time,  , and the target value, , with the goal of minimizing the performance function.Eqn 1 has a simple explicit solution, which we could use to nd optimum values of p.However, in most applications, the system of di erential equations must be evaluated numerically.Optimization must be done by choosing some parameters, numerically calculating the solution,   , and the loss function, , and then updating the parameters to improve performance. 11,12e optimization procedure consists of repeated rounds of calculation and parameter updating.e goal is to improve performance and, ideally, to nd an optimal parameter vector that minimizes the loss function.
To illustrate the typical optimization cycle, we evaluate eqn 1 numerically by Euler's method in which ℎ is the stepsize over which we update values.In practice, one uses be er numerical methods to approximate the trajectory that   follows over time. 18owever, the concepts by which one applies automatic di erentiation do not depend on the numerical technique, so Euler's method is su cient to illustrate how one uses automatic di erentiation to aid in optimizing the parameters of di erential equations.

Automatic di erentiation
7][8] e chain rule tells us how to combine the di erentiated pieces into the overall derivative.For example, given the function we can set  =  1  2 and obtain the partial derivative of  with respect to  1 by the chain rule e nal value multiplies the two component derivatives,  / = 2 and / 1 =  2 .To use this piece as part of a longer calculation, we substitute the numerical value of  2 and store only that numerical value.For example, if for eqn 2 with  = 0 and ℎ = 0.01, in which optimization seeks to minimize the loss function, .Table 1 shows the intermediate results,   , with  6 =  ℎ . is computational graph can be recursively expanded  times by using  6 =   as the starting point for additional applications of eqn 2 to calculate  ℎ for any integer, . e resulting graph quickly becomes too large to visualize or to trace the pathways of calculation without some automatic method.However, computer code readily expands such graphs and calculations.Forward value trace In general, long calculations can be broken into a sequence of simple steps.At each step, we calculate the numerical value of the component derivative, combine that component's value with prior component values, and then store the resulting numerical value to use in subsequent steps.us, we need to store only a small amount of information as we sequentially combine the component values to obtain the nal overall value.e calculation has the same theoretical exactness as a complete symbolic derivative but, by substituting numerical values as we go along, we can proceed faster and with much less computational storage space.Alternatively, many optimization methods estimate the derivatives numerically, which requires relatively li le storage space but has much lower accuracy.e errors in estimation impede optimization or cause optimization to fail.
e following subsections introduce two general methods of automatic di erentiation and one special method for di erentiating trajectories of di erential equations.ose methods di er in how they apply the chain rule.
Forward method e forward method begins with input parameters and then traces the chain of derivatives to the output.e center box of Table 1 illustrates the calculation for the computational graph of eqn 2, shown in Fig. 1. at calculation yields the derivative of the output, , with respect to the input parameter,  , wri en as  = / .Two early steps in the forward derivative trace show application of the chain rule.First, for  1 =  , we obtain  1 = 1, in which the overdot denotes the partial derivative with respect to  .Second, for  4 , the chain rule expands to using the constant ℎ = 0.01 and substituting the other values given in the table.Continuing forward, we eventually arrive at the nal value,  = −0.01998.e forward mode is simple and relatively easy to implement in computer code.A forward derivative trace can be done in parallel with the computations to calculate the performance value, , for given inputs, as shown in the le box of Table 1.
Forward mode's primary disadvantage arises from the need to do a separate derivative trace for each input parameter.For large models with many parameters, full calculation of the gradient by forward automatic di erentiation can be slow and limit application.Large optimization analyses o en gain by using reverse mode automatic di erentiation, which can calculate the gradient over all parameters in a single derivative trace.
Reverse method e right box of Table 1 illustrates the reverse derivative trace for the same problem.Starting at the end, with , and going backwards, the rst step calculates v6 = / 6 .en, by the chain rule, v5 = / 5 = v6  6 / 5 .By this method, the calculation continues in the reverse direction.In the nal steps, at the top of that box, one can calculate  with respect to each input parameter, yielding the full gradient ∇ in one backward trace.
Modern neural network models o en have millions of parameters.A single backward derivative trace calculates the full gradient of the performance function with respect to all parameters.at technique, o en called back propagation in the neural network literature, provided a necessary advance in computation for the great recent breakthroughs in modern neural net-work modeling. 2 same method also enhances the potential to optimize large di erential equation models.However, for di erential equations, one typically needs to optimize through the numerical methods used to approximate the temporal trajectories of systems.To accomplish a practical method for reverse mode di erentiation through the numerical approximation algorithm required another conceptual advance.11,12 Di erentiating trajectories of di erential equations In essence, we apply the reverse method to the computational graph for the trajectory of the di erential equation.
is approach leads to a new di erential equation that propagates the derivative of the performance function  backward in time, similarly to how we traced the derivative of  back to the parameters in the reverse mode approach of Table 1.Here, I describe the concepts in an approximate and intuitive way.I follow Chen et al., 11 who provide a full explanation and complete derivation.
e goal is to calculate the derivative of  as a function of the location of the trajectory at various times.For simplicity, we consider a univariate di erential equation with location at a single time,   , and associated performance, (  ).We seek the gradient of  with respect to the parameters, p, and the initial condition,  0 .
Start with the de nition () = d/d  .We then consider how this value changes with time, , creating a new di erential equation,  (), with the prime denoting di erentiation with respect to time for small ℎ. e term   /  −ℎ is a backward chain rule expansion, like the steps in the reverse mode trace of Table 1.Noting from eqn 2 that   /  −ℎ = 1 + ℎ /  −ℎ , and le ing ℎ go to zero, we obtain As in the reverse trace of Table 1, we can trace derivatives backwards, in this case tracing back in time for   ,   −ℎ , . . .,  ℎ to calculate the trajectory of ().
In the nal backward step, we apply the derivative of  ℎ relative to the input parameters, p, as  ℎ /p, in essence, allowing us to transform the last term of eqn 3 at time  = ℎ as Using that method to analyze changes in  and  with respect to p instead of with respect to   , we can write the solution for eqn 3 as Here, the backward integral trace in eqn 5 equals the solution backward in time of the di erential equation in eqn 3, modi ed by a nal transformation that yields the derivative of the performance function, , with respect to the parameters.Standard numerical methods to analyze di erential equations can be used to nd the solution for eqn 5.
e formal derivation of eqn 5 uses a more rigorous analysis to change  /  in eqn 3 into  /p in eqn 5, arriving at the same conclusion. 11

Examples
Fitting data for temporal trajectories One can t di erential equation models to time series data.For example, Fig. 2 shows the dynamics of hare and lynx populations.e blue curve shows the uctuation of the hare population, and the green curve shows the uctuation of the lynx population.e gold curves show a Bayesian t of a di erential equation model to the initial 2/3 of the data and the model's prediction compared to the data for the nal 1/3 of the time series.
Each model has  variables, two for the log transformation of hare and lynx population sizes and  − 2 for dummy variables tracking unobserved factors.e models seek to match the data of hare and lynx population sizes observed over 91 years in a classic ecological study summarized by Odum & Barre 19 and Bonna é et al.. 20 e di erential equation for the vector of variables u in the ODE models has the form in which the rst two components of u are the log-transformed populations sizes, the  2 + parameters are in the  ×  matrix, S, and the  vector, b. e function  maps the  dimensional input to an  dimensional output, potentially inducing nonlinearity in the model.One typically selects  from the set of common activation functions used in neural network models.For all runs in Frank, 21 I used  = tanh applied independently to each dimension.It would be easy to study alternative ODE forms.However, the analyses in Frank 21 focused on eqn 6. e three sets of plots in Fig. 2 correspond to models with  = 2, 3, 4 variables.For  = 2, the model has one variable to track the hare population and one to track the lynx population.at model roughly matches the frequency but not the amplitude of population uctuations during the ing period for the initial 2/3 of the data, corresponding to the rst 60 years.e model fails to predict the frequency or amplitude accurately during the prediction period years 60-90.
For  = 3, the model adds another variable to account for unobserved factors.at model matched the ing period be er because it has an extra dummy variable and more parameters.e extra parameters determine the dynamics of the dummy variable and its interactions with the hare and lynx populations.e model's dynamics during the prediction phase roughly matches the frequency of population uctuations.Although a bit o on amplitude, the t is reasonably good given the very limited data and complex dynamics.
For  = 4, the model has two extra dummy variables and more parameters.e match to the ing period is very good, with li le variation among the sampled trajectories from the Bayesian posterior distribution of parameters.However, the match during the prediction phase is highly erratic.
Overall, the model with  = 2 appears under t, with insu cient parameters and exibility to match the observed pa ern or predict future observations.e model with  = 4 appears over t, with a close match to the ed period but poor match for the predicted period.e model with  = 3 appears to be a good compromise, a conclusion supported by further analyses in Frank. 21 Bayesian ing procedure used automatic differentiation through a numerical di erential equation solver.at approach provided a fast computational method to nd a good t to the data.e computa-  Figure 2: Dynamics of hare (blue) and lynx (green) populations 19,20 and ed models (gold).21 e ordinary di erential equation (ODE) models were t to the rst 61 yearly observations.e gold curves past year 60 show predictions for subsequent dynamics.For  = 2, the model had one variable for hare and one for lynx.For  = 3, 4, the model had one or two extra variables to account for possible unobserved factors, providing more parameters to t the data.Smaller loss means a be er t to the rst 60 years. e model ng was done by an approximate Bayesian method based on the gradient of the loss calculated by automatic di erentiation.22 For each model, gold trajectories arise from 30 randomly chosen parameter combinations of the Bayesian posterior distribution.From Fig. 4 of Frank, 21 which provides methods, a broader analysis of these data, and other ing approaches such as neural ODEs that use arti cial neural networks to t di erential equations to target trajectories.
(a) ( tional bene t does not alter the challenge of deciding what model to use or the criteria for deciding the relative success of di erent models.Instead, the method provides a simple and practical way of doing the computations that may be di cult to achieve with other methods.Technical innovation o en leads to conceptual advance.

Analyzing adaptive traits
One can also t a model to a theoretical challenge rather than to data.In particular, how can one build systems that produce good trajectories relative to some goal?How does natural selection design biological systems to solve environmental challenges and increase tness?
Consider how cells control gene expression.Much of that control arises from transcription factors (TFs), which are proteins that bind to DNA and alter the rate at which nearby genes transcribe mRNA.Following the details in Frank, 23 we can study a simple model.
Suppose we have  genes.Each gene expresses an mRNA, which makes a TF protein.us, our di erential equation model has to track the temporal trajectory of the numbers  mRNAs within the cell and the associated  TF proteins.e expression level of each gene is in uenced by the abundances of the  TFs.We can think of the  TFs as inputs to the TF computational network, which produces  outputs that in uence the expression level of the  genes.][26] We seek parameters to match a target temporal trajectory of TF abundances.Suppose we measure success by how well the rst TF, labeled TF 1, follows a circadian pa ern. 23igure 3 shows an example, in which there are  = 4 TFs (le ) and matching mRNA (right) levels.Panel (a) shows the abundance of TF 1 (blue curve) on a log 10(1 + ) scale for molecules per cell, , over a time period of 6 days.Each day divides by the do ed vertical line, which denotes entry into daytime.e solid vertical line denotes entry into nigh ime.We transform the molecular number (blue curve) into a cellular state by a commonly used Hill function, 27,28 yielding the green curve that describes the cellular circadian rhythm.
We measure success by the total Euclidean distance between the gold curve that describes the environmental circadian pa ern and the green curve that describes the internal cellular circadian rhythm.We search for good parameters by optimization algorithms that use the gradient of the success measure relative to the parameters of the di erential equation, calculating the gradient by autodi erentiation through the numerical algorithm that calculates the temporal trajectories.
In this example, the stochastic di erential equation has random perturbations of the molecular numbers per cell. is example also imposes environmental randomness, shown in panel (b).e external daylight signal (gold curve) is initially o , then comes on in the middle of day 3 and stays on for the remainder of the period shown.In general, the external signal turns on and o randomly, such that cells may pass many days entirely in the dark.e blue curve of that panel tracks the abundance of TF 2. e external daylight signal strongly stimulates production of TF 2, providing an internal signal within the cell that could be used to entrain the circadian dynamics.e computational challenge starts with random parameters.One then searches for those parameters that lead to an internal circadian rhythm that bu ers the internal stochastic molecular perturbations, uses the external daylight signal for entrainment when it is present, and maintains a good internal rhythm when the external signal is absent.e particular example illustrated in Fig. 3 handles all of those challenges. 23s model has 164 parameters.It would be di -cult to nd a good parameter combination without the remarkable computational advantages of automatic di erentiation.In general, one can use this approach to generate and analyze hypotheses about how natural processes design biological systems to produce dynamic traits.I nish my summary of examples with a few comments about using automatic di erentiation in various applications.For the hare-lynx problem, ing ODE models typically took several hours or less on a 2022 Apple M1 Ultra Mac Studio computer.In that study, I also t neural ODE (NODE) models that use articial neural networks. 21ose NODE models have many more parameters and take longer to optimize but typically nish within many hours or less than one day.
For the TF problem, optimization time was typically several days, sometimes a week or more per run.ese stochastic models are much more challenging problems for optimization through the di erential equation solvers.
Although I optimized most individual algorithms for both problems, I made no a empt to minimize overall runtimes for any of these problems.So the times quoted here are only meant as very rough guidelines for those who might want to compare with their own approaches.
A question sometimes arises whether common derivative-free optimization methods could solve such problems as well or be er than automatic di erentiation.A full study of that question would require tuning the computer code to match various alternative methods.at has not been done for these problems.However, to make a quick test, I compared runs of the TF problem using automatic di erentiation and the commonly used Nelder-Mead algorithm, which is a derivative-free optimization method.Using the same code with the alternative optimization methods, in three independent runs Nelder-Mead failed completely to track to circadian pa ern.A single comparison run with automatic di erentiation converged to a good solution.

Discussion
Many aspects of model interpretation depend on sensitivity. 13How much do model predictions change with a change in a parameter?For example, if one wishes to change dynamics, altering a parameter with a strong e ect on outcome would be be er than altering a parameter with li le e ect.
Con dence in parameter estimates also relates to sensitivity.If large changes in a parameter have li le e ect on outcome, then estimates for that parameter will vary widely and con dence in a particular estimate is low.
In evolutionary models of biological traits, sensitivity may relate to genetic variability.For a parameter with low sensitivity, changes in that parameter have relatively li le e ect on performance.Such parameters are likely to accumulate much associated genetic variability.By contrast, changes in sensitive parameters strongly a ect performance, suggesting relatively li le genetic variability associated with such parameters.
Because sensitivity o en means the change in performance with respect to a change in parameters, one o en evaluates sensitivity by the derivative of performance with respect to parameters.For models with many parameters, automatic di erentiation provides computational bene ts. 13 In evolutionary analyses, the performance surface that de nes sensitivity is the adaptive or tness landscape. 9,10Many conceptual aspects of evolutionary dynamics and of optimizing arti cial neural networks come down to understanding how various factors alter the geometry of performance surfaces in models with large numbers of parameters. 29or studies of sensitivity and performance surface geometry, Bayesian posterior distributions of parameters provide a complementary approach.A narrow distribution typically means that small changes in a parameter provide much information, which also typically means that small changes strongly alter model outcome.In biology, narrow Bayesian posteriors may associate with less genetic variability than broad posteriors, because narrowness associates with large tness e ects for small changes in trait values.
e approximate Bayesian method to generate the predicted trajectories in Fig. 2 provides estimates for the posterior distribution of parameters.at method depends on automatic di erentiation to calculate the gradient of performance with respect to the parameters.Roughly speaking, the gradient gives the directional change of the parameters favored to improve performance in the same way that biological tness favors changes in traits to enhance tness.
Balanced against that directional change in improved performance, the method introduces random uctuations in parameters, similar to the way that mutation causes random changes in traits.e balance between directional selection and random mutation creates a distribution of parameter values that approximates the Bayesian posterior distribution.ose opposing forces also match the notion of genetic variability maintained by a balance between selection and mutation.e similarity between mutation-selection genetics and computational Bayesian analysis hints at the broad conceptual relations between evolutionary dynamics and the study of optimizing large systems in arti cial neural networks and other domains.
As computational techniques for automatic di erentiation improve, many new opportunities for theoretical advances will arise in domains for which optimization provides an important tool.Such opportunities for theoretical application will be matched by opportunities for greater conceptual understanding of processes that improve performance.

Con ict of interest
e author declares that the research was conducted in the absence of any commercial or nancial relationships that could be construed as a potential con ict of interest.

Figure 3 :
Figure 3: Circadian dynamics with stochastic uctuations and random daylight signal.(a-d) Abundance of TFs on log 10(1+) scale for  molecules per cell.(e-h) Abundance of mRNAs for associated TFs.See text for explanation.From Fig. 1 of Frank. 23

Fundinge
Donald Bren Foundation, National Science Foundation grant DEB-1939423, and DoD grant W911NF2010227 support my research.

Table 1 :
Calculations through the computational graph in Fig.1.e le box traces the calculation of values forward from the inputs to the output, yielding the nal value for .e center box traces the forward chain rule calculation of the derivatives with respect to the input parameter,  , yielding  = / at the bo om. e variable  represents the le side of each equation.
A separate derivative trace is required for each input parameter.e right box traces the reverse chain rule calculation of the derivatives, starting from the bo om and proceeding to the top.Reverse mode has the advantage that a single trace calculates the partial derivatives of  with respect to all of the input parameters.A constant value ℎ = 0.01 is used in calculations.Structure of table based on Baydin et al. 's (2018) Table 2.