An Information Criterion for Inferring Coupling of Distributed Dynamical Systems

The behaviour of many real systems can be modelled by nonlinear dynamical systems whereby a latent system state is observed through a filter. We are interested in interacting subsystems of this form, which we model as a synchronous graph dynamical systems. Specifically, we study the structure learning problem for dynamical systems coupled via a directed acyclic graph. Unlike established structure learning procedures that find locally maximum posterior probabilities of a network structure containing latent variables, our work shows that we can exploit the properties of certain dynamical systems to compute globally optimal approximations of these distributions. We arrive at this result by use of time delay embedding theorems. Taking an information-theoretic perspective, we show that the log-likelihood has an intuitive interpretation in terms of information transfer.


Introduction
Complex systems are broadly defined as systems that comprise interacting nonlinear components [1].Discrete-time complex systems can be represented using graphical models such as graph dynamical systems (GDSs) [2], where dynamical units are coupled via a directed graph.The task of learning the structure of such a system is to infer directed relationships between variables; in the case of dynamical systems, these variables are typically hidden [3,4].In this paper, we study the structure learning problem for complex networks of nonlinear dynamical systems coupled via a directed acyclic graph (DAG).Specifically, we formulate synchronous update GDSs as dynamic Bayesian networks (DBNs) and study this problem from the perspective of information theory.
The structure learning problem for nonlinear dynamical networks is a precursor to inference in systems that are not fully observable.This case encompasses many practical problems of known artificial, biological and chemical systems, such as neural networks [5,6], multi-agent systems [7,8], and various others [1].Modelling a partially observable system as a dynamical network presents a challenge in synthesising these models and capturing their global properties [1].In addressing this challenge, we draw on two parallel fields of research: Bayesian network (BN) structure learning and nonlinear time series analysis.
BN structure learning comprises two subproblems: evaluating the fitness of a graph, and identifying the optimal graph given this fitness criterion [9].The evaluation problem is particularly challenging in the case of graph dynamical systems, which include both latent and observed variables.A number of theoretically optimal techniques exist for the evaluation problem for BNs with complete data [10,11,12], which have been extended to DBNs [13].With incomplete data, however, the common approach is to resort to approximations that find local optima, e.g., expectation-maximisation arXiv:1605.06931v1[cs.LG] 23 May 2016 (EM) [13,14].An additional caveat of the traditional approach to structure learning is that algorithms find an equivalence class of networks with the same Markov structure, and not a unique solution [9].
In nonlinear time series analysis, the problem of inferring coupling strength and causality in complex systems has received significant attention recently [15,16].The additive noise model approach [16,17] attempts to model unidirectional cause and effect relationships with observed random variables, and finds a unique DAG (as opposed to an equivalence class).These studies have been extended by exploring weakly additive noise models for learning the structure of systems of observed variables with nonlinear coupling [18].Another recent approach to inferring causality is convergent crossmapping (CCM), which tests for causation (predictability) by considering the history of observed data of a hidden variable in predicting the outcome of another [19].However, CCM was derived heuristically from Takens' theorem [3] and is not specifically designed for structure learning.
In this paper we exploit the properties of discrete-time multivariate dynamical systems in inferring coupling between latent variables in a DAG.Specifically, the main focus of this paper is to analytically derive a measure (score) for evaluating the fitness of a candidate DAG, given data.We assume the data are generated by a certain family of multivariate dynamical system and are thus able to overcome the issue of latent variables faced by established structure learning algorithms.That is, under certain assumptions of the dynamical system, we are able to employ time delay embedding theorems [20,21] to compute our scores.
Our main result is a tractable form of the log-likelihood function for synchronous GDSs.Using this result, we are able to directly compute the Bayesian information criterion (BIC) [22] and Akaike information criterion (AIC) [23] and thus achieve globally optimal approximations of the posterior distribution of the graph.Finally, we show that the log-likelihood and log-likelihood ratio can be expressed in terms of collective transfer entropy [24,5].This result places our work in the context of effective network analysis [25,26] based on information transfer [27,6] and relates to the information processing intrinsic to distributed computation [28].

Background
This section summarises relevant technical concepts used throughout the paper.First, a stochastic temporal process X is defined as a sequence of random variables (X 1 , X 2 , . . ., X N ) with a realisation (x 1 , x 2 , . . ., x N ) for countable time indices n ∈ N. Consider a collection of M processes, and denote the ith process X i to have associated realisation x i n at temporal index n, and x n as all realisations at that index x n = x 1 n , x 2 n , . . ., x M n .If X i n is a discrete random variable, the number of values the variable can take on is denoted |X i n |.The following sections collect results from DBN literature, attractor reconstruction, and information theory that are relevant to this work.

Dynamic Bayesian networks (DBNs)
DBNs are a general graphical representation of a temporal model, representing a probability distribution over infinite trajectories of random variables (Z 1 , Z 2 , . ..) compactly [13].These models are a more expressive framework than the hidden Markov model (HMM) and Kalman filter model (KFM) (or linear dynamical system) [13].In this work, we denote Z n = {X n , Y n } as the set of hidden and observed variables, respectively, where n ∈ {1, 2, . ..} is the temporal index.
BNs B = (G, Θ) represent a joint distribution p(z) graphically and consists of: a DAG G and a set of conditional probability distribution (CPD) parameters Θ. DBNs B = (B 1 , B → ) extend the BN to model temporal processes and comprises two parts: the prior BN B 1 = (G 1 , Θ 1 ), which defines the joint distribution p B1 (z 1 ); and the two-time-slice Bayesian network (2TBN) B → = (G → , Θ → ), which defines a first-order Markov process p B→ (z n+1 | z n ) [13].This formulation allows for a variable to be conditioned on its respective parent set Π G→ (Z i n+1 ) that can come from the preceding time slice or the current time slice, as long as G → forms a DAG.The 2TBN probability distribution factorises according to G → with a local CPD p D estimated from an observed dataset.That is, given a set of stochastic processes (Z 1 , Z 2 , . . ., Z N ), the realisation of which constitutes a dataset D = (z 1 , z 2 , . . ., z N ), we obtain the 2TBN distribution as where π G→ (Z i n+1 ) denotes the (index-ordered) set of realisations {z j o : Z j o ∈ Π G→ (Z i n+1 )}.Note that, when convenient, we will use z ij n j to denote the parent realisations π G→ (Z i n+1 ) from time n.

Attractor reconstruction
Attractor reconstruction refers to methods for inferring the (hidden) state of a dynamical system from observations.The state of a discrete-time dynamical system is given by a point x n confined to a d-dimensional manifold M. The time evolution of this state is described by a map f : M → M, so that the sequence of states (x n ) is given by x n+1 = f (x n ).In many situations we only have access to a filtered, scalar representation of the state, i.e., the measurement y n = ψ(x n ) given by some measurement function ψ : M → R [3,4].
The standard Takens' theorem [3] shows that for typical f and ψ, it is possible to reconstruct f from the observed time series up to some smooth coordinate change.More precisely, fix some κ (the embedding dimension) and some τ (the time delay), then define the delay embedding map In differential topology, an embedding refers to a smooth map Ψ : M → N between manifolds M and N if it maps M diffeomorphically onto its image; and, therefore, Φ f,ψ has a smooth inverse Φ −1 f,ψ .The implication of Takens' theorem is that for typical f and ψ, the image Φ f,ψ (M) of M is completely equivalent to M itself, apart from the smooth invertible change of coordinates given by the mapping Φ f,ψ .An important consequence of this theorem is that we can define a map n ) [4].There are technical assumptions for Takens' theorem (and the generalised versions employed herein) to hold.These assumptions require (f, ψ) to be generic functions (in terms of Baire space), a restricted number of period points and distinct eigenvalues at each neighbourhood of these points [3,4,20,21].

Information theoretic measures
Conditional entropy represents the uncertainty of a random variable X after taking into account the outcomes of another random variable Y by Multivariate transfer entropy is a measure that computes the information transfer from a set of source processes to a set of destination process [6].In this work, we use the formulation of collective transfer entropy [24], where the information transfer from m source processes V = {Y 1 , Y 2 , . . ., Y m } to a single destination process Y can be decomposed as a sum of conditional entropy terms: where )τ i for some κ i and τ i , and similarly for Y (κ) n .

Representing nonlinear dynamical networks as DBNs
We model multivariate dynamical systems [21] as synchronous update GDSs.With this model, we can express the time evolution of the GDS as a stationary DBN, and perform inference and learning on the subsequent graph.We formally restate the network of dynamical systems from [2] as a special case of the sequential GDS [29] with an observation function for each vertex.Definition 1 (Synchronous graph dynamical system (GDS)).A synchronous GDS is a tuple (G, x n , {f i }, {ψ i }) that consists of: a finite, directed graph G = (V, E) with edge-set E = {E i } and M vertices comprising the vertex set V = {V i }; a network state x n = x i n , composed of states for each vertex V i confined to a d i -dimensional manifold x i n ∈ M i ; a set of local maps {f i } of the form f i : M → M i , which update synchronously and induce a global map f : M → M; and a set of local observation functions {ψ 1 , ψ 2 , . . ., ψ M } of the form Figure 1: Representation of (a) the GDS with three vertices (V 1 , V 2 and V 3 ), and (b) the rolled-out DBN of the equivalent structure.Subsystem V 1 is coupled to both subsystems V 2 and V 3 by means of the edges between latent variables Without loss of generality, we can use local functions: Here, υ f i is i.i.d.additive noise and υ ψ i is noise that is either i.i.d. or dependent on the state, i.e., υ ψ i (x i n+1 ).The subsystem dynamics (4) are therefore a function of the subsystem state x i n and the subsystem parents' state x ij n j at the previous time index, s.t., Each subsystem observation is given by (5).We assume the functions f i and ψ i are invariant w.r.t.time and thus the graph G is stationary.
The time evolution of a synchronous GDS can be modelled as a DBN.First, each subsystem vertex V i = {X i n , Y i n } has an associated state variable X i n and observation variable Y i n ; and, the parents of subsystem V i is denoted Π G (V i ).Since the graph G → is stationary and synchronous, parents of X i n+1 come strictly from the preceding time slice, and additionally Π G→ (Y i n+1 ) = X i n+1 .Thus, we can build the edge set E = {E 1 , E 2 , . . ., E M } in the GDS by means of the DBN.That is, each edge subset E i is build by the DBN edges As an example, consider the synchronous GDS in Fig. 1(a), the subsystem V 1 is coupled to both subsystem V 2 and V 3 through the edge set The time-evolution of this network is shown in Fig. 1(b), where the top two rows (processes X 1 and Y 1 ) are associated with subsystem V 1 , and similarly for V 2 and V 3 .The distributions for the state (4) and observation (5) of M arbitrary subsystems can therefore be factorised according to (1): The scoring functions for the 2TBN network B → can be computed independently of the prior network B 1 [13].We will assume the prior network is given, and focus on learning the 2TBN.As a result, we drop the subscript and note that all references to the network B are to the 2TBN.Since B → is stationary, learning B → is equivalent to learning the synchronous GDS.

Learning synchronous GDSs from data
In this section we develop the theory for learning the synchronous update GDS from data.We will focus on techniques for learning graphical models using the score and search paradigm, the objective of which is to find a DAG G * that maximises a score g(B : D).Given such a score, we can then employ established search procedures to find the optimal graph G * .Thus, we can state that our main goal is to derive a tractable scoring function g(B : D) for synchronous GDSs that gives a parsimonious model for describing the data.This is effectively employing the minimum description length (MDL) principle [30], which is a mathematical formulation of parsimony in model building [22,30].
To derive the score we use the DBN formulation of synchronous GDSs (Sec.3) to show that we cannot directly compute the posterior probability of the network structure (Sec.4.1).By making some assumptions about the system, however, we are able to compute scores for GDSs by use of attractor reconstruction methods (Sec.4.2).We conclude this section by giving an interpretation of the log-likelihood in terms of information transfer (Sec.4.3).

Structure learning for DBNs
Ideally, we want to be able to compute the posterior probability of the network structure G, given data D. Using Bayes' rule, we can express this distribution as , where p(G) encodes any prior assumptions we want to make about the network G. Thus, the problem becomes that of computing the likelihood of the data, given the model, p(D | G).The likelihood can be written in terms of distributions over network parameters [13]: where we denote ( ΘG : D) = log p(D | G, ΘG ) as the log-likelihood function for a choice of parameters ΘG that maximise p(D | G, Θ), given a graph G.
A common approach to compute (7) in closed form is by using Dirichlet priors; this leads to the BD (Bayesian-Dirichlet) score and variants [12,13].However, to obtain this analytic solution, we require counts of the tuples (z i n , π G (Z i n )), which involve hidden variables.We will instead use Schwarz's [22] asymptotic approximation of the posterior distribution, which states that where C(G) is the model dimension (i.e., number of parameters needed for the graph G [13]) and O( 1) is a constant bounded by the number of potential models.The approximation of the posterior (8) requires data come from an exponential family of likelihood functions with conjugate priors over the model G, and the parameters given the model Θ G [22].
Akaike [23] gives a similar criterion by approximating the KL-divergence of any model from the data.We can compute both criteria in terms of the log-likelihood function ( ΘG : D) and the model dimension C(G), and thus the problem can be generalised to that of deriving a MDL scoring function, due to the relationship to the coding scheme described in [30] g MDL (B : When f (N ) = 1, we have the AIC score [23], f (N ) = log(N )/2 is the BIC score [22], and f (N ) = 0 gives the maximum likelihood score.It should be noted that the MDL principle can also be employed via cross-entropy with a different model encoding scheme [10].

Deriving the MDL score for synchronous GDSs
To calculate the MDL score (9), we require tractable expressions for the log-likelihood function ( ΘG : D) and the model dimension C(G).The form of the CPD in (6) specifies these functions, and for (8) to hold, this distribution must come from an exponential family [22].Modelling the dynamical systems with non-parametric techniques requires the number of parameters scale linearly in the size of the data, and thus C(G) scales linearly with N .Instead, we assume the observation data are discretised, such that there are |Y i n | possible outcomes for a observed random variable Y i n .The log-likelihood function can then be expressed as the maximum likelihood estimate for multinomial distributions [13], from (6) the log-likelihood The log-likelihood function (10) involves distributions over latent variables.To compute this function, we resort to state-space (attractor) reconstruction.First, Lemma 1 shows that a future observation from a given subsystem can be predicted from a sequence of past observations.Building on this result, we present a computable formulation of the 2TBN distribution p B→ (z n+1 | z n ) via Lemma 2.
We then derive a tractable form of the log-likelihood function, presented in Lemma 3. It is then shown in Theorem 4 that these lemmas allow us to compute the MDL score (9).Lemma 1.Consider a synchronous GDS (G, x n , {f i }, {ψ i }), where the graph G is a DAG.Each subsystem state follows the dynamics x i n+1 = f i (x i n , x ij n j ) and emits an observation y i n+1 = ψ i (x i n+1 ); the subsystem observation can be estimated, for some map G i , by Proof.Consider a forced system x n+1 = f (x n , w n ) with forcing dynamics w n+1 = h(w n ) and observation y n = ψ(x n+1 ).Given this type of forced system, the bundle delay embedding theorem [4,20] states that the delay map n is an embedding for generic f , ψ, and h.Stark [4] proved this result in the case of forcing dynamics h that are independent of the state x. 1 For notational simplicity, we omit dependence on the noise process for the map Φ f,h,ψ ; the noise can be considered an additional forcing system so long as υ f is i.i.d and υ ψ is either i.i.d or dependent on the state [20].
Given a DAG G, any ancestor of the subsystem V i is not dependent on V i .As such, the sequence is an embedding, since x ij n j is independent of x i n .Let x ijk n k be the index ordered set of parents of node X ij n (which itself is the jth parent of the node X i n ).Under the constraint that G is a DAG, where the state x i n+1 = f i (x i n , x ij n j ) + υ f i , it follows from the bundle delay embedding theorem [4,20] that there exists a map F i that is well defined and a diffeomorphism between observation sequences.From (12) we can write this map The last κ i + j κ ij components of F i are trivial.Denote the first component as G i : R κ i × j R κ ij → R, then we arrive at (11).
Lemma 2. Given an observed dataset D = (y 1 , y 2 , . . ., y N ), where y n ∈ R M , generated by a directed and acyclic synchronous GDS (G, x n , {f i }, {ψ i }), the 2TBN distribution can be written as Proof.The generalised time delay embedding theorem [21] states that, under certain technical assumptions, and given M inhomogeneous observation functions {ψ 1 , ψ2 , . . ., ψ M }, the map is an embedding where each subsystem (local) map Φ f i ,ψ i : M → R κ i , and, at time index n is described by where i κ i = 2d + 1 [21]. 2Therefore, the global map ( 15) is given by Φ f,ψ (x n ) = y i,(κ i ) n and there must exist an inverse map . Given Lemma 1, the existence of Φ −1 f,ψ , and since ∀i, {y , we arrive at the following equation: Rearranging (16) gives the equality in (14).
Lemma 3. Consider a synchronous GDS (G, x n , {f i }, {ψ i }), where the graph G is a DAG.Each subsystem state follows the dynamics x i n+1 = f i (x i n , x ij n j ) and generates an observation y i n+1 = ψ i (x i n+1 ); a complete dataset is given by the sequence of observations D = (y 1 , y 2 , . . ., y N ).The log-likelihood of the data given a network structure can be computed in terms of conditional entropy: Proof.Substituting ( 14) into (10) gives In (18) we have removed arguments of the joint distributions that will be nullified when multiplied with the CPD.Expressing (18) in terms of conditional entropy (2), we arrive at (17).
Theorem 4. The MDL score (9) for synchronous GDS can be computed as: Proof.The distributions for H(X n | Y i,(κ i ) n ) in ( 18) do not depend on the graph G, and therefore We can now compute the number of parameters needed to specify the model as [13] C Since we are searching for the graph G * = max G g(B : D), holding N constant, we can substitute ( 20) and ( 21) into ( 9) and ignore the constant term O(N ) in (20).

The log-likelihood and information transfer
To conclude our study of the scores, we look at the log-likelihood in the context of information transfer.First, rearranging the terms of collective transfer entropy (3) we can rewrite the log-likelihood function (17), leading to the following result.Proposition 1.The log-likelihood function for the synchronous GDS (17) decomposes as follows: Similarly, we can express the log-likelihood ratio in terms of collective transfer entropy.This ratio quantifies the gain in likelihood by modelling the data D by a candidate network B instead of the empty network B ∅ : Recall that the empty DAG G ∅ is one with no parents for all vertices ∀i, Π G (V i ) = Y ij,(κ ij ) n j = ∅.Substituting this definition into (17) gives the following result.Proposition 2. The ratio of the log-likelihood (17) of a candidate DAG G to the empty network G ∅ can be expressed in terms of collective transfer entropy: 5 Discussion and future work We have presented a principled method to score the structure of nonlinear dynamical networks, where dynamical units are coupled via a DAG.We approached the problem by modelling the time evolution of a synchronous GDS as a DBN.We then derived the AIC and BIC scoring functions for the DBN based on time delay embedding theorems.Finally, we have shown that the log-likelihood of the synchronous GDS can be interpreted in the context of information transfer.
The representation of synchronous GDSs as DBNs allows for inference of coupling in dynamical networks and facilitates techniques for synthesis in these systems.DBNs are an expressive framework that allow representation of generic systems, as well as a numerous general purpose inference techniques that can be used for filtering, prediction, and smoothing [13].Our representation therefore allows for probabilistic reasoning for purposes of planning and prediction in complex systems.
Theorem 4 captures an interesting parallel between learning from complete data and learning nonlinear dynamical networks.If the embedding dimension κ and time delay τ are unity, then the MDL score becomes identical to learning a DBN from complete data [13].Thus, our result could be considered a generalisation of typical structure learning procedures.
The results presented here provoke new insights into the concepts of structure learning, nonlinear time series analysis and effective network analysis [25,26] based on information transfer [27,6].
The information-theoretic interpretation of the log-likelihood has interesting consequences in the context of information dynamics and information thermodynamics of nonlinear dynamical networks.The transfer entropy terms in Propositions 1 and 2 show that the optimal structure of a synchronous GDS is immediately related to the information processing of distributed computation [28], as well as the thermodynamic costs of information transfer [31].
In the future, we aim to perform empirical studies to exemplify the properties of the presented scoring functions.Specifically, the empirical studies should yield insight into the effect of weak, moderate and strong coupling between dynamical units.In particular, we are interested in the effect of synchrony in these networks and the relationship to previous results for dynamical systems coupled by spanning trees [2].We conjecture that the approach used here will allow us to derive scoring functions without the assumption of multinomial observations, and thus afford the use of non-parametric density estimators.Parametric techniques, such as learning the parameters of dynamical systems [32,33], could be considered in place of the posterior approximations.