Data-Driven Modeling and Prediction of Complex Spatio-Temporal Dynamics in Excitable Media

Spatio-temporal chaotic dynamics in a two-dimensional excitable medium is (cross-) estimated using a machine learning method based on a convolutional neural network combined with a conditional random ﬁeld. The performance of this approach is demonstrated using the four variables of the Bueno-Orovio-Fenton-Cherry model describing electrical excitation waves in cardiac tissue. Using temporal sequences of two-dimensional ﬁelds representing the values of one or more of the model variables as input the network successfully cross-estimates all variables and provides excellent forecasts when applied iteratively.


INTRODUCTION
In life sciences mathematical models based on first principles are scarce and often a variety of approximate models of different complexity exists for describing the given (experimental) dynamical process. For example, electrical excitation waves in cardiac tissue can be described using partial differential equations (PDEs) with 2 to more than 60 variables, covering the range from simple qualitative models [1,2] to detailed ionic cell models including not only cell membrane voltage but also different ionic currents and gating variables [3,4]. While there are several modalities for measuring membrane voltage (electrical sensors, fluorescent dyes [5]) it is in general much more difficult and expensive (if not impossible) to directly measure the other variables of the mathematical model, such as ionic currents or gating variables. In such cases it is desirable to (cross) estimate variables, which are difficult to assess from those that can be easily measured. In control theory this task is addressed by constructing an observer based on a given mathematical model describing the process of interest. Once all state variables of the model have been estimated, the model (e.g., a PDE) can be used to simulate and forecast the future evolution of the dynamical process. This combination of cross estimation and prediction of dynamical variables is the core of all data assimilation methods [6][7][8][9][10] where again the model equations are involved and have to be known. In this contribution, we present a machine learning method for estimating all state variables and forecasting their evolution from limited observations. This "black-box model" consists of a convolutional neural network (CNN) combined with a conditional random field (CRF) and will be introduced in section 2. For training and evaluating the network two dimensional spatio-temporal time series are used, which were generated by the Bueno-Orovio-Fenton-Cherry (BOCF) model [11] describing complex electrical excitation waves in cardiac tissue. This model is introduced in section 3. As modeling tasks we consider cross estimation of variables, forecasting dynamics using an iterative feedback scheme, and a combination of forecasting and cross estimation providing future values of not measured variables. These results are presented in section 4. A summary and a brief discussion of potential future developments are given in section 5.

DATA DRIVEN MODELING
In data driven modeling mathematical models are not based on first principles (e.g., Newton's laws, Maxwell's equations, ...) but are directly derived from experimental measurement data or other physical observations. The model should describe the experiment as precisely as possible but it also should possess a high level of generalizability, i.e., the ability to provide a suitable and good description for data from a very similar experiment. Therefore, overfitting has to be avoided and all irrelevant aspects that are not necessary to describe the desired effect should be discarded when generating the model (without employing human expert knowledge). Many approaches for generating (dynamical) models from (training) data have been devised including autoregressive models [12], evolutionary algorithms in particular genetic algorithms [13], local modeling [14], reservoir computing [15][16][17][18][19], symbolic regression [20], or adaptive fuzzy rule-based models [21]. Furthermore, Monte Carlo techniques may be used for assessing uncertainty in model parameters [22]. In this work we present a modeling ansatz which combines deep convolutional neural networks [23] for feature extraction and dimension reduction with conditional random fields (CRFs) [24] for modeling the properties of temporal sequences.

Artificial Neural Network
Artificial neural networks (ANNs) [25][26][27] are parameterizable models for approximating a (unknown) function F implicitly given by the data. The actual function provided by the ANN: should be a good approximation of F, i.e., f ≃ F. Here O ∈ N and P ∈ N denote the dimension of the input and the output of f , respectively. A widely used type of ANN are feed-forward neural networks (FNN) where, in general, f is given by with a non-linear function ψ applied component-wise, an input vector X ∈ R O , a weight matrix W ∈ R P×O , and a bias b ∈ R P . Equation (2) is called a one-layer FNN. By recursively applying the output of one layer as input to the next layer, a multi-layer FNN can be constructed: Equation (3) describes a multi-layer FNN with L ∈ N layers. In the following an input with several variables is considered and the input is given by X ∈ R h×w×d , with h ∈ N rows and w ∈ N columns of the input field, and the number of variables d. To improve the approximation properties of the network Equation (3), FNNs may contain additional convolutional layers leading to state-of-the-art models for data classification, so-called convolutional neural networks (CNNs) [23].

Network Architecture
The network used in the following for prediction of multivariate time series is built based on the architecture of a convolutional autoencoder [28], with residual connections [29] consisting of an encoding path (left half of the network, from 512×512 to 64×64) to retrieve the features of interest and a symmetric decoding path (right half of the network, from 64 × 64 back to 512 × 512). As illustrated in Figure 1 each encoding/decoding path consists of multiple levels, i.e., resolutions, for feature extraction on different scales and noise reduction. The conditional random field block has a special role: Based on the selected feature, the CRF should map a sequence of features of a previous time step t to the next time step t + t. The other four components of the network are basic building blocks, like regular convolutional layers followed by rectified linear unit activation and batch normalization (these blocks are omitted in Figure 1 for simplicity). Each residual layer consists of three convolutional blocks and a residual skip connection. A maxpooling layer is located between levels in the encoding path to perform downsampling for feature compression. The deconvolutional layer [30] is located between levels in the decoding path to up-sample the input data using learnable interpolations. The input for the network are all four system variables of the BOCF model which will be introduced in section 3.1 or a sequence of the four system variables as introduced in section 4.1. The output of the network always consists of four system variables.

Convolution Layer
Convolutional neural networks [23,26,27] receive a training data set X = {X 1 , X 2 , . . . , X m }, where X α ∈ R h×w×d . The data processing through the network is described layer-wise i.e., in the l-th convolutional layer the input X (l) will be transformed to the raw output o (l) , which is in turn the input to the next layer l + 1, where the dimension changes depending on the number and size of convolutions, padding and stride of the layers as illustrated in Figure 1. The padding parameter P (l) ∈ N, for layer l, describes the number of zeros at the edges of a field by which the field is extended. This is necessary since every convolution being larger than 1 × 1 will decrease the output size. The stride parameter S (l) ∈ N is the parameter determining how much the kernel is shifted in each step to compute the next spatial position (x, y). This specifies the overlap between individual output pixels, and it is here set to 1. Each layer l is specified by its number of kernels K (l) = {K (l,1) , K (l,2) , . . . K (l,d (l) ) }, where d (l) ∈ N is the number of kernels in layer l, and its additive bias terms b (l) = {b (l,1) , b (l,2) , . . . , b (l,d (l) ) } with b (l,d) ∈ R. Note that the input X (l,d) ∈ R h (l) ×w (l) in the l-th layer with size h (l) × w (l) , kernel k, and depth d (l) is processed by a set of kernels {K (l,d) }. For each is computed Frontiers in Applied Mathematics and Statistics | www.frontiersin.org element by element as: The result is clipped by an activation function ψ to obtain the activation ψ(o (l,d) x,y ) of each unit in layer l: To obtain o (l) = {o (l,1) , . . . , o (l,d (l) ) }, Equation (5) needs to be calculated ∀d = 1, . . . , d (l) and ∀(x, y). Each spatial calculation of o (l,d) x,y is considered as a unit and ψ(o (l,d) x,y ) as the feedforward activation of the unit. The value of an element of a kernel (K (l,d) i,j ) between two units is the weight of the feedforward connection. Such systems are well-suited for feature extraction [28], but their linear structure does not allow a direct modeling of temporal changes or the possibility to process a sequence of data. To enable temporal modeling, we employ linear-chain conditional random fields [31] that will be introduced in the next section.

Linear-Chain Conditional Random Fields
To implement a probabilistic forecasting block we consider the output of the convolutional layer o and the corresponding forecast q as random variables O and Q, respectively. Both random variables O and Q are jointly distributed and in a predictive framework we aim at constructing a conditional model P(Q|O) from paired observation and forecast sequences.
where Q is indexed by the vertices of G. Each vertex in G represents a state, a history or a forecast. Then (O, Q) is a conditional random field (CRF), if conditioned on O the random variables Q v obey the Markov property [24]. A linear-chain conditional random field, where o is a sequence of historical extracted features and q a corresponding forecasted feature in the future, is given by: where q ∈ Q, Q is a set of future events, h ∈ H, H is the set of layers of the CRF where each element h i of h represents a historical state of an event at time t. θ is the set of parameters.
(q, h, o; θ ) is a so called potential function (also called local or compatibility function) which measures the compatibility (i.e., high probability) between a forecast, a set of observed features, and a configuration of historical states, such that: Here n is the number of historical states and φ j (o, ω) is a vector that can include any feature of the observation specific for a specific time window ω, and θ = [θ h , θ q , θ ǫ , θ p ] are model parameters. To restrict the search space for possible parametrizations only sine, cosine, and a linear interpolation  (7) captures the influence of the past features on the forecast. For training the following likelihood function is defined: where n is the number of training examples. By maximizing the likelihood for the forecasted training data the optimal parameter set θ * is determined. To find θ * Equation (8) can be evaluated by the same gradient descent method which is used for optimizing/training the autoencoder. To forecast the input sequence with a linear-chain CRF it is necessary to compute the q sequence that maximizes the following equation: The sequence maximizing this is then used by the deconvolutional part of the network to map the features back to the desired system variables at t + t.

MODELING EXCITABLE MEDIA
Excitable systems are non-linear dynamical systems with a stable fixed point. Small perturbations of the stable equilibrium decay, but stronger perturbations above some characteristic threshold lead to a high amplitude excursion in state space until the trajectory returns to the stable fixed point. In neural or cardiac cells this response leads to a so-called action potential. After such a strong response a so-called refractory period has to pass until the next excitation can be initialized by perturbing the system again.
An excitable medium consists of excitable systems (e.g., cells), which are spatially coupled. Electric coupling of neighboring cardiac cells, for example, can be modeled by means of a diffusion term for local currents. The resulting partial differential equations (PDEs) describe the propagation of undamped solitary excitation waves. Due to the refractory time of local excitations spiral or scroll waves are very common and typical hallmarks of excitable media, which can lead to stable periodically rotating wave patterns or may break-up forming complex chaotic wave dynamics. From the large selection of different PDE models describing excitable media we have chosen the Bueno-Orovio-Cherry-Fenton (BOCF) model which was devised as an efficient model for cardiac tissue [11].

Bueno-Orovio-Cherry-Fenton Model
The Bueno-Orovio-Cherry-Fenton (BOCF) model [11] provides a compact description of excitable cardiac dynamics. We use this model as a benchmark to validate our approach for forecasting and cross-estimation of complex wave patterns in excitable media. The evolution of the four system variables of the BOCF model is given by four PDEs where u represents the membrane voltage and H(·) denotes the Heaviside function. The three currents J si , J fi and J so are given by the equations Furthermore, seven voltage dependent variables are required. The characteristic model dynamics is determined through 28 parameters. In our simulations we used a set of parameters [11] given in Table 1 for which the BOCF model exhibits chaotic excitation wave dynamics similar to the Ten Tusscher-Noble-Noble-Panfilov (TNNP) model [32]. The spatio-temporal chaotic dynamics of this system is actually transient chaos whose lifetime grows exponentially with system size [33,34]. To obtain chaotic dynamics with a sufficiently long lifetime the system has been simulated on a domain of 512 × 512 grid points with a grid constant of x = 1.0 space units and a diffusion

RESULTS
The proposed network model was trained with simulated data generated by the BOCF model with parameter values given in Table 1. Ten trajectories with different initial conditions for the variables u, v, w, and s were generated by simulating the BOCF model for a time series of 50,000 samples spanning a period of time of 5 s. Five of these data sets randomly chosen, were used to train the network, while the other solutions were used for validation. For training the Adam optimizer [35] was used, with a learning rate lr = 0.0001 and β 1 = 0.9, β 2 = 0.999. In order to quantify the performance of the estimation and prediction methods the similarity of target fields and output fields of the network has to be quantified. For this purpose we use the Jensen-Shannon divergence (JSD) [36] applied to normalized fields of the variables of the BOCF model. The JSD of two discrete probability distributions A and B is defined as 1 We consider all variables and parameters of the BOCF model as dimensionless.
The parameter values given in Table 1 are, however, consistent with the choice of a time unit equalling 1ms. In this case all t-values given in this article would correspond to milli seconds.
where M = 1 2 (A + B) and D KL (A M) is the Kullback-Leibler divergence [37]: During training the JSD was used as objective function to be minimized (for a GPU implementation of the JSD see [38]). The JSD is bounded by 0 and 1 and a value below 0.02 was considered to indicate no discernible differences between the two distributions (fields). An alternative for quantifying the deviation would be the Fractions Brier Score [39]. For training the network, for each trajectory at each time step, sequences of lengths up to m = 10 were used as input.
The input of the network consisted of fields of variables that were assumed to be measured and random fields representing variables that were considered to be not available.

Forecast
For forecasting the input of the network consisted of sequences of length m = 10 of u, v, w, and s given by {u t−m+1 , . . . , u t }, {v t−m+1 , . . . , v t }, {w t−m+1 , . . . , w t }, and {s t−m+1 , . . . , s t }. The desired output of the network is then u t+ t , v t+ t , w t+ t and s t+ t . By using the output of the network as a new input the system can be run iteratively in a closed loop for long term prediction. The development of the JSDs of u, v, w, s through time are shown in Figure 3A. Since the u and the s fields look quite similar (see Figures 2A,D) their JSD-values are almost the same. The w-field ( Figure 2C) exhibits relatively high values at all spatial locations and therefore the JSD of two such fields is rather low. On the other hand, the v field ( Figure 2B) possesses only very localized structures with high values and this leads to rather high values of the JSD for (slightly) different fields.   shows for comparison the root normalized mean squared errors (RNMSE) of all variables u, v, w, s which is given by where Herev denotes the temporal and spatial mean values of the BOCF sequence of length T F , M 2 = 512 · 512 is the number of grid points of the domain and v BOCF ij denotes the value of variable v at grid point (i, j) for the reference solution generated by the BOCF model. As can be seen in Figure 3A all four curves possess very similar values and indicate an increase of the error already during the initial period for t ∈ [0, 1000]. Figure 4 shows a comparison of the error dynamics of the forecast obtained with the iterated network with feedback (orange curve) and the dynamics of a BOCF model starting from slightly perturbed initial conditions (blue curve). Both curves give the root normalized mean squared error (RNMSE) with respect to the same reference orbit generated by the BOCF model. The perturbation of the initial condition of the second BOCF solution with respect to the initial condition of the reference orbits was chosen to be very small. Therefore, during the initial phase the deviation still remains so small that (with semilogarithmic axes) a linear segment of the error curve occurs that and (w t → u t , v t , s t ), based on the input from the BOCF simulation. (B) Cross estimation of future values of not measured variables for the cases (v t * , w t * , s t * → u τ ), (w t * , s t * → u τ , v τ ), (u t * , v t * → w τ , s τ ), (u t * → v τ , w τ , s τ ), and (w t * → u τ , v τ , s τ ) based on the forecast of the data driven model for a period of τ = 1, 000, where t * denotes 10 successive snapshots at times 0, 0.1, . . . , 0.9 constituting the input . In both diagrams the orange line is the median value for each case, the box extends from the lower to upper quartile values. The whiskers extend from the box to show the range of the data. Flier points are those past the end of the whiskers. can be used to estimate the largest Lyapunov exponent [40]. Once the error of the perturbed BOCF orbit (blue curve) reaches the level of the network prediction error (orange curve) both error curves continue to increase in the same way indicating that the network almost perfectly learned the true dynamics of the BOCF model.
To illustrate the deviation between the u field forecasted by the network and the (true) u field provided by the simulation of the BOCF PDE Figure 5 shows snapshots at times t = 500, t = 1, 500, t = 3, 000, and t = 5, 000. While at t = 500 original (A) and forecast (E) are almost indistinguishable the snapshots at t = 1, 500 exhibit minor differences (Figures 5B,F). At time t = 3, 000 only rough structures agree (Figures 5C,G) until at t = 5, 000 forecast and simulation appear completely decorrelated (Figures 5D,H). The full evolution of the forecast compared to the original dynamics generated with the BOCF model is also available as a movie (Supplemental Data). Compared to a typical spiral rotation period of approximately T sp = 350 good forecasting results can be obtained for about five spiral rotations corresponding to 5*350 / 4 = 437 Lyapunov times T L = 1/λ 1 ≈ 4 given by the largest Lyapunov exponent λ 1 ≈ 0.25 (see Figure 4).

Cross-Estimation
For cross-estimation only a part of the system variables are considered as being directly observable or measurable. Based on these available variables the other not measurable variables have to be estimated (a task also called cross prediction). In the context of the BOCF model we shall, for example, estimate v t , w t , s t from observations of u t , only. Since the network expects all system variables as input the not observed variables were replaced by uniform noise in the range of 0 − 0.3. For this purpose for every t ∈ [0, 1000] the data of the BOCF model were used as single time step input for the network and the cases (v t , w t , s t → u t ), (w t , s t → u t , v t ), (u t , v t → w t , s t ), (u t → v t , w t , s t ), and (w t → u t , v t , s t ) were considered as estimation tasks. Figure 6 shows the JSD statistics for all these cases. The low JSD values for (v t , w t , s t → u t ) indicated that the variable u can be very well estimated by the variables v, w, s, which could be expected because the variable u is part of the PDEs of the other variables. Similarly good estimation results are obtained for (u t → v t , w t , s t ) which is remarkable, because the membrane potential u is the variable, which can be measured most easily in experiments and the result shows that this information is sufficient to recover the other variables v, w, and s of the BOCF model. The worst performance is achieved if only w is used to cross estimate all other system variables. These cross estimation results are in very good agreement with the performance of an Echo State Network applied to similar data [19].

Forecast and Cross-Estimation
This investigation represents a combination of the two previous ones. In this case, however, not for every time step the data from the BOCF model were used, but only ten time steps from the BOCF model were used to initialize the forecast of the network. Depending on the case which variable should be estimated the BOCF variables for initialization were replaced by uniform noise, as before. Figure 6B shows the JSD statistics for the four estimation cases considered and in Figure 7 snapshots of the input and the true and estimated fields are presented illustrating the very good performance at time t = 100.

DISCUSSION
Spatio-temporal non-linear dynamical systems like extended systems (described by PDEs) or networks of interacting oscillators may exhibit very high dimensional chaotic dynamics. A typical example are complex wave pattern occurring in some excitable media. As a representative of this class of systems we used the BOCF model describing electrical excitation waves in cardiac tissue where chaotic dynamics is associated with cardiac arrhythmias. For future applications like monitoring and predicting the dynamical state of the heart or the impact of interventions, mathematical models are required describing the temporal evolution or the relation between different (physical) variables. As an alternative to the large number of simple qualitative or detailed (ionic) models (incorporating many biophysical details and corresponding variables) we presented a machine learning approach for data driven modeling of the spatio-temporal dynamics. A convolutional neural network combined with a linear-chain of conditional random fields was trained and validated with data generated by a simulation of the BOCF model. To mimic experimental limitations when measuring cardiac dynamics we considered different cases where only some of the variables of the BOCF model were assumed to be available as input of the generated model and the not measurable variables were replace by random numbers. Running the trained network in a closed loop (feedback) configuration iterated prediction provided forecasts of the complex dynamics that turned out to follow the true (chaotic!) evolution of the BOCF simulation for about five periods of the intrinsic spiral rotations. These results clearly show that machine learning methods like those employed here provide faithful models of the underlying complex dynamics of excitable media that, when suitably trained can provide powerful tools for predicting the spatio-temporal evolution and for cross-estimating not directly observed variables.

AUTHOR CONTRIBUTIONS
SH performed numerical simulations. UP and SH identified the scientific topic. UP, FW, and SH devised the strategy for solving it, and wrote the manuscript.